Production Environment Architecture Proposal
This page accumulates requirements to a production environment and an architecture how to put those into practice.
Workload Requirements
- An average user's workload is ca. 100 requests per Minute
- Around peek times we expect max. 2500 active users
- Write load is relatively low, around 1 request per MInute per user
- Initial page load 2MB, each further request on average 3KB. (might change dramatically upon introduction of image support)
- Users low tolerance towards data corruption or loss (aim: tolerate 3 concurrent disk failures, have a backup log to counter corruption)
- Users have medium tolerance towards availability outages (aim: no longer failures, a couple of seconds are okay)
- We should be using as much of our existing infrastructure as possible to reduce costs
- At least two AZ for resilience against natural disasters
- Automated recovery from failures, we don't have an always available sysadmin
Solution 1 - Master/Slave with Master in Aachen and Slave on Titan
We currently have a rented machine named Titan and Aachen will according to Ingo soon get a new, relatively fast server. We could use that to span a master/slave architecture with the master sitting in Aachen and and a readonly slave on titan.
Detailed specs for both machines are currently not known to me. Also I don't know about the uplink each server will have, but afaik Aachen is in DFN → enough and Titan is professionally hosted → enough
- DNS global: Two entries, one for Aachen and one for Titan, effectively splitting requests 50/50
- Service discovery: Make sure requests for a service resolv to the own Node/AZ. Traefik needs to be able to see remote network
- Reverse-proxying: make sure core and alastair don't get anything but GET requests, reverse proxy the rest to Aachen (Latency hit not too tragic on writes)
- Mongodb: Create 3-node distributed setup with Titan being priority: 0 node (won't become primary) and Aachen hosting one arbiter and the primary
- PG-Sql: Create master-slave setup with the master being in Aachen
- Virtual networks: Unencrypted overlay on node/in AZ, encrypted overlay between the nodes with the reverse proxies and databases in it
- Storage: Raid-6 with daily backups in Aachen, Titan can be ephemeral. Weekly backup copies from Aachen to Titan (in case Aachen burns down)
- Orchestrator: Kubernetes with master in Aachen
Evaluation
- Assuming the two machines can handle the combined workload \[1, 2, 3, 4]
- A failure of Aachen would turn the whole system to be read-only \[6, 7]
- A failure of Titan would not affect the system
- A network split would turn Titan readonly and not affect requests to Aachen, no split-brain situations possible
- A synchronous failure of 2 live HDDs can be tolerated in Aachen
- A synchronous failure of 3 live HDDs can be restored from local backup, loosing max. a day
- A complete failure of all storage in Aachen can be restored from Titan backup, loosing max. a week
- Any HDD failure on Titan would not affect the system
- Kubernetes would solve automated recovery in case of service crashes \[9]
Pro
- Relatively high avaliability and data promises
- Everything in our hands
- Cheap, we already have the hardware
- Scaling possible by adding more nodes into Aachen AZ
Contra
- Synchronization traffic between the databases might be problematic (no idea how much that will effectively be) \[6]
- Not sure if Aachen fail could automatically be recovered by Kubernetes \[9]
- High configuration overhead on databases
Solution 2 - Everything on Azure
This is a pretty straightforward solution. We can use docker swarm to span a couple of worker nodes but basically take the development compose configuration as it is. I believe we have some kind of sponsorship by them so I assume we can use their services for free.
Evaluation
- We can achieve availability and fault-tolerance goals through replication
- Data promises made by Azure
- A failure of Azure AZ would disable our system completely
- In case we loose sponsorship we might incur high charges
- Docker-compose configuration could be adopted as a swarm setup with minimal effort
Pro
- Low configuration overhead
- Meeting any kind of availability goals, even if AEGEE grows exponentially
Contra
- Completely dependent on Microsoft,
- privacy (GDPR),
- sponsorship (finances),
- data (we don't have our data anywhere)
- No AZ failure protection
Solution 3 - Master/Slave setup with Azure being master and Aachen slave
Same as 1 but changed roles
Evaluation
- Combines 1 and 2
- Azure more reliable than Aachen → Increased availability
- Easily scalable
- Configuration as in 1
Pro
- AZ failure protection
- We also have a version of our data
- Highest availability of all versions
Contra
- High configuration overhead
- Dependent on Microsoft,
- privacy (GDPR)
- sponsorship