Slack Tries to Handle More Traffic, Self-Destructs Instead
Here’s a summary of the key points from the YouTube transcript:
The Incident:
- Morning: A buggy deployment caused a temporary, minor outage (3 minutes) due to a configuration change impacting Slack’s webapp tier. Autoscaling quickly addressed the initial load increase.
- Afternoon: A complete Slack outage occurred due to a cascading failure stemming from the morning’s events. Users could not connect.
- Root Cause: The root cause was a flaw in how Slack’s load balancer (HAProxy) interacted with their service discovery system (Consul). A background task updating HAProxy’s server list failed when all available slots were filled, leaving HAProxy pointing to outdated, non-existent servers. This wasn’t detected because the system had never reached this condition before.
Technical Details:
- Autoscaling: Slack’s aggressive autoscaling, initially beneficial, exacerbated the problem by exceeding HAProxy’s configured capacity (N slots).
- HAProxy Configuration: Slack used a background task to update HAProxy using its runtime API. The crucial bug was in the order of operations: adding new servers failed (due to full slots), preventing the removal of old, defunct servers. This led to HAProxy gradually becoming out of sync with the actual webapp instances.
- Consul Integration: Consul tracked webapp instances and their health. The interaction between Consul and HAProxy was flawed due to the faulty update logic.
- Monitoring Failure: Monitoring failed to detect the issue because: 1) the system had never experienced this high a load before, so no one tested this failure condition; 2) the monitoring system was scheduled for replacement soon.
Solutions & Aftermath:
- Immediate Fix: A rolling restart of HAProxy instances resolved the immediate outage.
- Short-term Code Fix: Adjusting the order of operations in the server-update script (cleanup before adding servers) or preventing early exit on failure would have avoided the outage.
- Long-term Solution: Slack replaced HAProxy with Envoy Proxy.
Overall: The incident highlights the risks of aggressive autoscaling without proper consideration of all system components and the crucial importance of comprehensive monitoring and robust failure testing, even for systems that have historically functioned reliably. The original configuration error triggered a chain of events which exposed a latent flaw in the load balancing system, ultimately leading to a major outage.