Production Environment Architecture Proposal

This page accumulates requirements to a production environment and an architecture how to put those into practice.

Workload Requirements

An average user's workload is ca. 100 requests per Minute
Around peek times we expect max. 2500 active users
Write load is relatively low, around 1 request per MInute per user
Initial page load 2MB, each further request on average 3KB. (might change dramatically upon introduction of image support)
Users low tolerance towards data corruption or loss (aim: tolerate 3 concurrent disk failures, have a backup log to counter corruption)
Users have medium tolerance towards availability outages (aim: no longer failures, a couple of seconds are okay)
We should be using as much of our existing infrastructure as possible to reduce costs
At least two AZ for resilience against natural disasters
Automated recovery from failures, we don't have an always available sysadmin

Solution 1 - Master/Slave with Master in Aachen and Slave on Titan

We currently have a rented machine named Titan and Aachen will according to Ingo soon get a new, relatively fast server. We could use that to span a master/slave architecture with the master sitting in Aachen and and a readonly slave on titan.

Detailed specs for both machines are currently not known to me. Also I don't know about the uplink each server will have, but afaik Aachen is in DFN → enough and Titan is professionally hosted → enough

DNS global: Two entries, one for Aachen and one for Titan, effectively splitting requests 50/50
Service discovery: Make sure requests for a service resolv to the own Node/AZ. Traefik needs to be able to see remote network
Reverse-proxying: make sure core and alastair don't get anything but GET requests, reverse proxy the rest to Aachen (Latency hit not too tragic on writes)
Mongodb: Create 3-node distributed setup with Titan being priority: 0 node (won't become primary) and Aachen hosting one arbiter and the primary
PG-Sql: Create master-slave setup with the master being in Aachen
Virtual networks: Unencrypted overlay on node/in AZ, encrypted overlay between the nodes with the reverse proxies and databases in it
Storage: Raid-6 with daily backups in Aachen, Titan can be ephemeral. Weekly backup copies from Aachen to Titan (in case Aachen burns down)
Orchestrator: Kubernetes with master in Aachen

Evaluation

Assuming the two machines can handle the combined workload \[1, 2, 3, 4]
A failure of Aachen would turn the whole system to be read-only \[6, 7]
A failure of Titan would not affect the system
A network split would turn Titan readonly and not affect requests to Aachen, no split-brain situations possible
A synchronous failure of 2 live HDDs can be tolerated in Aachen
A synchronous failure of 3 live HDDs can be restored from local backup, loosing max. a day
A complete failure of all storage in Aachen can be restored from Titan backup, loosing max. a week
Any HDD failure on Titan would not affect the system
Kubernetes would solve automated recovery in case of service crashes \[9]

Pro

Relatively high avaliability and data promises
Everything in our hands
Cheap, we already have the hardware
Scaling possible by adding more nodes into Aachen AZ

Contra

Synchronization traffic between the databases might be problematic (no idea how much that will effectively be) \[6]
Not sure if Aachen fail could automatically be recovered by Kubernetes \[9]
High configuration overhead on databases

Solution 2 - Everything on Azure

This is a pretty straightforward solution. We can use docker swarm to span a couple of worker nodes but basically take the development compose configuration as it is. I believe we have some kind of sponsorship by them so I assume we can use their services for free.

Evaluation

We can achieve availability and fault-tolerance goals through replication
Data promises made by Azure
A failure of Azure AZ would disable our system completely
In case we loose sponsorship we might incur high charges
Docker-compose configuration could be adopted as a swarm setup with minimal effort

Pro

Low configuration overhead
Meeting any kind of availability goals, even if AEGEE grows exponentially

Contra

Completely dependent on Microsoft,
- privacy (GDPR),
- sponsorship (finances),
- data (we don't have our data anywhere)
No AZ failure protection

Solution 3 - Master/Slave setup with Azure being master and Aachen slave

Same as 1 but changed roles

Evaluation

Combines 1 and 2
Azure more reliable than Aachen → Increased availability
Easily scalable
Configuration as in 1

Pro

AZ failure protection
We also have a version of our data
Highest availability of all versions

Contra

High configuration overhead
Dependent on Microsoft,
- privacy (GDPR)
- sponsorship