Region-loss recovery drill, summer edition

Last Tuesday at 02:00 UTC we pulled the eu-west region offline as a scheduled drill. Total outage from the customer's perspective: 41 seconds (load balancer failover) plus 6 minutes 33 seconds of degraded write throughput while the secondary region took up the load.

What worked

The failover path itself ran exactly as designed. The synthetic probes detected the outage in 8 seconds; the load balancer drained the failed region in 33 seconds.

What did not

The dashboards we relied on for incident triage were partially served from the failed region. Three of seven critical dashboards were unavailable for the first 90 seconds of the drill.

← All posts