Executive Summary
We would like to extend our sincerest apologies for the Thunderstorm outage on July 23, which caused a viewership drop across multiple channels. We fully recognize the trust you place in us and deeply regret the inconvenience this may have caused.
We experienced a service disruption caused by an internal system (called Linkerd) that helps our services communicate with each other. A sudden spike in activity overwhelmed part of this system, making it difficult for critical services to connect and operate.
We are committed to learning from this event and reinforcing the resilience of our systems and operational processes to prevent such issues in the future. If you have any further questions or would like to discuss this matter in more detail, please feel free to contact our Support team.
Root Cause Analysis (RCA) and Action Plan
Summary of Incident
On July 23, 2025, starting at approximately 11:05 AM EST (20:35 IST), we observed significant viewer drops on channels hosted in the US E1N2 cluster. Services began stabilizing by approximately 12:15 PM EST (21:45 IST) after corrective actions.
Timeline of Events
11:05 AM EST: Major viewer drops detected on monitoring tools.
11:30 AM EST: After initial analysis our L2 team escalated the issue to L3 engineering.
11:35 AM - 12:15 PM EST: Engineering teams engaged to migrate high-viewership channels to alternate clusters, added additional resources to stabilize Linkerd pods.
Root Cause
The issue originated from instability in the Linkerd load balancer control plane, critical for managing microservices connectivity. Failure in Linkerd pods, caused failures in service discovery and connectivity checks, and subsequently impacted service availability.
Immediate Remediation
Diverted the user traffic to other thunderstorm clusters.
Added additional resources to restore pod stability and channel services.
Preventive Measures
Enhanced Monitoring:
Infrastructure alerts thresholds are adjusted for early detection of symptoms.
Load Balancer (Linkerd) Improvements:
Upgrading Linkerd to the latest stable version for improved stability.
Integrating Linkerd-specific metrics into monitoring dashboards.
Network Load Balancer (NLB) Implementation:
Introducing AWS Network Load Balancers (NLB) for better resiliency .
Configuring critical services to use NLBs, with Linkerd as a failover mechanism.
Cluster-level Failover Feature:
Implementing a new failover capability that bypasses Server-Side Ad Insertion (SSAI) during significant disruptions, ensuring maximum service availability.
We sincerely apologize for any inconvenience caused and remain committed to continuous improvement in reliability and performance. Please reach out to support@amagi.com with any further questions or concerns.