YsummarY, use Tab ↹, Return/Enter and go back (⌘ + ←) to navigate.

Surviving Oncall at Amazon | Realistic Week in the Life of a Software Engineer

YouTube Video

Summary

This YouTube transcript chronicles a week in the life of a Software Engineer at Amazon during their on-call rotation. The engineer, let’s call him the “Vlogger,” documents the challenges, stress, and learning experiences associated with being on call.

Key Events and Themes:

  • Introduction to On-Call: The video starts with the Vlogger explaining what being on call means: being the first line of defense for system issues and emergencies, requiring availability at all times. He emphasizes the stress and potential disruption to personal life, mentioning his past stressful on-call experience where he worked from morning to morning without even eating.
  • First On-Call Incident (Past): The Vlogger recounts a particularly stressful first on-call experience where his team was internally DDoS’d by another team, causing a service outage. He was a new team member (3 months in), his reverse shadow (senior engineer support) was sick, and his manager (stepping in as reverse shadow) had him drive the incident response, screen sharing and coding hotfixes in front of senior engineers. This event is described as one of his top three most stressful experiences.
  • Ops Meeting: The Vlogger describes the weekly Ops meeting, a post-on-call wrap-up session where the previous on-call engineer presents a document summarizing incidents, learnings, and hands over unresolved tickets to the next on-call engineer.
  • Daily On-Call Tasks: Beyond major incidents, the Vlogger explains that on-call duties also include handling support tickets (bugs, user questions) and operational maintenance (“janitor work”) to keep systems running smoothly.
  • First Day of Current On-Call: The video follows the first day of the current on-call week. He works on tickets in the evening, including a DynamoDB issue escalated by an internal customer. He struggles to resolve it, works late into the night (until 12:30 AM), and eventually gets help from his manager to reroute the ticket.
  • Alarm Tickets and Troubleshooting: The Vlogger explains alarm tickets as common and challenging, triggered by system issues like service downtime or error spikes. He describes the detective-like process of troubleshooting: digging through logs, reviewing code changes, checking pipelines, and collaborating with other teams.
  • Team Support & Collaboration: Despite initially trying to solve the DynamoDB issue independently, the Vlogger highlights the importance of team support. He mentions a teammate, Aman, proactively reaching out to help late at night and emphasizes that it’s acceptable to page teammates or managers for assistance during critical issues.
  • Stand-up Meeting and Root Cause Analysis: The next day at work, during the stand-up meeting, the Vlogger discusses the ongoing high-severity issue. With team input, he finally pinpoints the root cause – a caching problem. The focus then shifts to solving the caching issue.
  • Office Hours and Ticket Volume: The Vlogger participates in an office hour session for marketing managers to answer questions about their service. He also discusses the varying intensity of on-call depending on the team, mentioning that his team is on the higher end of the spectrum (7-8 out of 10 stress level).
  • Asking Questions and Seeking Help (Pro Tip): He provides advice on how to effectively ask questions to senior engineers, emphasizing the importance of showing prior effort and explaining what has already been tried.
  • Snowboarding Trip and On-Call Coverage: The Vlogger plans a snowboarding trip for the weekend but needs to find someone to cover his on-call shift. He asks teammates for coverage due to the conflict.
  • Lunch and Learn Session: He describes the bi-weekly lunch and learn session where team members share knowledge, comparing it to a school lecture.
  • Snowboarding Weekend: The video transitions to the weekend snowboarding trip, showcasing the Vlogger and his friends enjoying themselves. He gets paged early in the morning even on the weekend, highlighting the always-on nature of on-call. Despite this, he enjoys the trip, although tiring.
  • Reflection and Value: The video concludes with the Vlogger reflecting on the week and the value of demonstrating a realistic glimpse into the life of an on-call software engineer at a tech company. He acknowledges the challenges but also the learning and growth opportunities. He promotes his Discord community and sponsor, EasyJob AI.

In essence, the transcript offers a realistic portrayal of the on-call experience for a software engineer at Amazon, showcasing the blend of technical problem-solving, stress management, teamwork, and the constant need to be available.

Accuracy

The information presented in the transcript is generally accurate and aligns with established knowledge regarding software engineering on-call practices, particularly within large tech companies like Amazon. Here’s a breakdown:

  • Definition of On-Call: The description of on-call duties as the first line of defense, handling emergencies, and requiring availability is consistent with industry standards. The stress and potential for work-life disruption are also accurately depicted.
  • Incident Handling and Escalation: The recounted DDoS incident and the DynamoDB issue accurately reflect the types of problems software engineers on-call might face. The involvement of senior engineers and managers in major incidents, especially for new engineers, is also realistic. Escalation and team collaboration are crucial in such situations.
  • Reverse Shadow/Mentorship: The concept of a reverse shadow or mentor during the first on-call rotation is a common and good practice in many organizations to support new on-call engineers.
  • Ops Meetings and Post-Incident Reviews: Ops meetings or similar post-incident review meetings are standard practice for learning from incidents, sharing knowledge, and improving system reliability.
  • Support Tickets and Operational Tasks: Handling support tickets and performing operational maintenance tasks are indeed typical responsibilities for on-call engineers, alongside addressing critical incidents.
  • Alarm Tickets and Troubleshooting Process: The description of alarm tickets and the troubleshooting process (log analysis, code review, pipeline checks, team collaboration) is accurate and reflects the systematic approach required to resolve technical issues.
  • Stand-up Meetings for Blockers and Collaboration: Utilizing stand-up meetings to discuss blockers and seek team assistance is a common and effective agile practice, particularly relevant for on-call issues that might require wider team input.
  • Lunch and Learn Sessions and Knowledge Sharing: Lunch and Learn sessions are prevalent in tech companies as a way to promote continuous learning and knowledge sharing within teams.
  • Importance of Asking Effective Questions: The advice on how to ask questions by demonstrating prior effort and context is valuable and reflects professional communication best practices in engineering.
  • Finding On-Call Coverage for Time Off: The need to find coverage for on-call shifts when taking time off is a standard logistical consideration in on-call rotations.
  • Varied On-Call Intensity: The mention that on-call intensity varies based on the team and service user base (AWS vs. Audible example) is accurate. Teams managing high-traffic, critical services generally experience more intense on-call rotations.

Minor Points/Nuances:

  • The “janitor work” term might be a slightly informal way to describe operational maintenance, but the underlying concept is valid.
  • The stress levels described are subjective but relatable. On-call can be highly stressful, especially for those new to it or when dealing with complex incidents.

Overall: The transcript provides a credible and realistic depiction of software engineering on-call at a company like Amazon. It accurately reflects the responsibilities, challenges, collaborative aspects, and learning opportunities associated with this critical function in software operations.

Resources

Here are the top 5 most relevant resources to learn more about the subjects presented in the transcript:

  1. Site Reliability Engineering (SRE) books by Google:

    • Resource: “Site Reliability Engineering” and “The Site Reliability Workbook” by Google.
    • Relevance: These books are considered the definitive guide to SRE practices, which are deeply intertwined with on-call responsibilities in modern tech companies. They cover topics like incident management, monitoring, alerting, capacity planning, and automation, all highly relevant to the Vlogger’s experience.
    • Why it’s helpful: Provides a comprehensive and authoritative understanding of the principles and practices behind effective on-call and system reliability, directly applicable to the scenarios described in the transcript.
  2. “PagerDuty Incident Response Documentation”:

    • Resource: PagerDuty’s official documentation and learning resources on incident response.
    • Relevance: PagerDuty is a widely used incident management platform in the tech industry. Their documentation offers practical guidance on incident response workflows, best practices for on-call teams, and tools for managing alerts and escalations.
    • Why it’s helpful: Provides practical, tool-oriented knowledge about incident management, a core aspect of on-call. Learning about tools like PagerDuty and incident response processes can be very beneficial for understanding and preparing for on-call responsibilities.
  3. “Effective DevOps” by Jennifer Davis and Ryn Daniels:

    • Resource: Book “Effective DevOps: Building a Culture of Collaboration, Affinity, and Tooling at Scale.”
    • Relevance: DevOps principles emphasize collaboration, automation, and continuous improvement in software development and operations. Understanding DevOps culture and practices is crucial for effective on-call, as it promotes proactive measures to reduce incidents and improve system resilience.
    • Why it’s helpful: Provides a broader context around on-call within the DevOps framework, emphasizing the cultural and collaborative aspects of running reliable systems. It helps understand how on-call fits into a larger strategy of system reliability and team collaboration.
  4. “Production Engineering at Scale: Google Lessons Learned” by Titus Winters, Tom Manshreck, and Hyrum Wright:

    • Resource: Book “Production Engineering at Scale: Google Lessons Learned.”
    • Relevance: This book delves into the practical aspects of running large-scale systems in production, drawing from Google’s experiences. It covers topics like monitoring, capacity planning, performance optimization, and incident management, offering in-depth insights relevant to on-call engineers at large tech companies.
    • Why it’s helpful: Offers advanced, real-world insights into the challenges and best practices of production engineering at scale, providing a deeper understanding of the technical complexities and considerations behind on-call in large systems.
  5. “r/devops and r/sre Subreddits on Reddit”:

    • Resource: Online communities on Reddit dedicated to DevOps and SRE (r/devops, r/sre).
    • Relevance: These online communities are active forums where professionals discuss real-world challenges, share experiences, ask questions, and provide advice related to DevOps and SRE practices, including on-call.
    • Why it’s helpful: Provides a dynamic and community-driven learning environment. You can find discussions on current trends, practical tips, and get answers to specific questions related to on-call and system reliability from experienced practitioners. It’s a good place to stay updated on industry trends and learn from peer experiences.

These resources offer a mix of theoretical foundations, practical guides, industry best practices, and community insights to comprehensively learn more about the world of software engineering on-call and related disciplines.

Next: Platform Engineering Tools Can HURT More Than HELP
Prev: Ex-Google Recruiter Explains: Why Nobody Hires Older Workers (And How to Fix It)