Postmortem on the September edge node shutdown
On September 27 at 14:12 UTC our European edge fleet went into a
rolling restart that we had marked as zero-downtime. It was not. The
fleet returned 5xx errors for 41 minutes before traffic stabilised.
What happened
The restart hook drained connections in batches of four nodes at a
time. The depth of the shared queue meant that in-flight requests
arriving at a draining node had nowhere to fail over because all four
peers in the same rack were also draining. The intended overlap
window was 30 seconds; actual overlap was 4 minutes 12 seconds.
Fixes
We have moved to per-rack rolling restarts, capped concurrent drains
at 25% of any rack, and added a synthetic probe that fails the deploy
if more than two peers in a rack report draining at once.
← All posts