Production Environment Architecture Proposal

This page accumulates requirements to a production environment and an architecture how to put those into practice.


Workload Requirements

  1. An average user's workload is ca. 100 requests per Minute
  2. Around peek times we expect max. 2500 active users
  3. Write load is relatively low, around 1 request per MInute per user
  4. Initial page load 2MB, each further request on average 3KB. (might change dramatically upon introduction of image support)
  5. Users low tolerance towards data corruption or loss (aim: tolerate 3 concurrent disk failures, have a backup log to counter corruption)
  6. Users have medium tolerance towards availability outages (aim: no longer failures, a couple of seconds are okay)
  7. We should be using as much of our existing infrastructure as possible to reduce costs
  8. At least two AZ for resilience against natural disasters
  9. Automated recovery from failures, we don't have an always available sysadmin


Solution 1 - Master/Slave with Master in Aachen and Slave on Titan

We currently have a rented machine named Titan and Aachen will according to Ingo soon get a new, relatively fast server. We could use that to span a master/slave architecture with the master sitting in Aachen and and a readonly slave on titan. 

Detailed specs for both machines are currently not known to me. Also I don't know about the uplink each server will have, but afaik Aachen is in DFN → enough and Titan is professionally hosted → enough

  • DNS global: Two entries, one for Aachen and one for Titan, effectively splitting requests 50/50
  • Service discovery: Make sure requests for a service resolv to the own Node/AZ. Traefik needs to be able to see remote network
  • Reverse-proxying: make sure core and alastair don't get anything but GET requests, reverse proxy the rest to Aachen (Latency hit not too tragic on writes)
  • Mongodb: Create 3-node distributed setup with Titan being priority: 0 node (won't become primary) and Aachen hosting one arbiter and the primary
  • PG-Sql: Create master-slave setup with the master being in Aachen
  • Virtual networks: Unencrypted overlay on node/in AZ, encrypted overlay between the nodes with the reverse proxies and databases in it
  • Storage: Raid-6 with daily backups in Aachen, Titan can be ephemeral. Weekly backup copies from Aachen to Titan (in case Aachen burns down)
  • Orchestrator: Kubernetes with master in Aachen

Evaluation

  • Assuming the two machines can handle the combined workload \[1, 2, 3, 4]
  • A failure of Aachen would turn the whole system to be read-only \[6, 7]
  • A failure of Titan would not affect the system
  • A network split would turn Titan readonly and not affect requests to Aachen, no split-brain situations possible
  • A synchronous failure of 2 live HDDs can be tolerated in Aachen
  • A synchronous failure of 3 live HDDs can be restored from local backup, loosing max. a day
  • A complete failure of all storage in Aachen can be restored from Titan backup, loosing max. a week
  • Any HDD failure on Titan would not affect the system
  • Kubernetes would solve automated recovery in case of service crashes \[9]

Pro

  • Relatively high avaliability and data promises
  • Everything in our hands
  • Cheap, we already have the hardware
  • Scaling possible by adding more nodes into Aachen AZ

Contra

  • Synchronization traffic between the databases might be problematic (no idea how much that will effectively be) \[6]
  • Not sure if Aachen fail could automatically be recovered by Kubernetes \[9]
  • High configuration overhead on databases


Solution 2 - Everything on Azure

This is a pretty straightforward solution. We can use docker swarm to span a couple of worker nodes but basically take the development compose configuration as it is. I believe we have some kind of sponsorship by them so I assume we can use their services for free.


Evaluation

  • We can achieve availability and fault-tolerance goals through replication
  • Data promises made by Azure
  • A failure of Azure AZ would disable our system completely
  • In case we loose sponsorship we might incur high charges
  • Docker-compose configuration could be adopted as a swarm setup with minimal effort

Pro

  • Low configuration overhead
  • Meeting any kind of availability goals, even if AEGEE grows exponentially

Contra

  • Completely dependent on Microsoft,
    • privacy (GDPR),
    • sponsorship (finances),
    • data (we don't have our data anywhere)
  • No AZ failure protection

Solution 3 - Master/Slave setup with Azure being master and Aachen slave

Same as 1 but changed roles

Evaluation

  • Combines 1 and 2
  • Azure more reliable than Aachen → Increased availability
  • Easily scalable
  • Configuration as in 1

Pro

  • AZ failure protection
  • We also have a version of our data
  • Highest availability of all versions

Contra

  • High configuration overhead
  • Dependent on Microsoft,
    • privacy (GDPR)
    • sponsorship