Master the art of leading incident calls with clear communication, structured responses, and calm leadership
Critical Insight: Effective incident management hinges on clear, calm, and structured leadership, especially during incident calls. As the Incident Manager, your role is crucial to coordinating responses, mitigating impacts, and restoring services quickly.
Drawing extensively from best practices outlined in Google's "Site Reliability Engineering" and "Accelerate" by Nicole Forsgren, Jez Humble, and Gene Kim, this comprehensive guide will take you through leading an incident call end-to-end.
Understanding Your Role as Incident Manager
As Incident Manager (IM), your core responsibility is clarity, rapid decision-making, and team coordination during incident resolution. You're the orchestrator, not necessarily the problem solver.
Key Responsibilities:
- Clearly communicate incident details
- Assign and confirm team roles
- Maintain the focus and direction of the call
- Regularly update stakeholders and management
- Ensure a blame-free, collaborative environment
Preparing for the Call
Before initiating the call:
- Assess Impact: Quickly gauge customer impact, financial implications, and urgency.
- Gather Initial Data: Obtain basic incident information (nature, timing, impact).
- Assemble Participants: Invite essential responders including Developers, Platform Engineers, Product Managers, and Customer Support.
- Set Up Communication Channels: Establish clear internal and external communication channels.
Need Better Incident Management Tools?
Get professional incident management capabilities with automated notifications, status pages, and streamlined communication tools.
Start Your Free TrialLeading the Incident Call: Detailed Step-by-Step Guide
Step 1: Clearly Open the Call
Begin succinctly to align all participants:
Incident Manager: "Hello everyone, this is [your name], the Incident Manager. We are currently facing a critical incident impacting customer logins globally, starting approximately 20 minutes ago. Our immediate goal is rapid diagnosis and resolution."
Step 2: Quickly Assign and Confirm Roles
Clearly define roles upfront:
Incident Manager: "[Name] will handle platform-level troubleshooting, [Developer Name] will examine recent deployments and changes, [Name from Customer Support] will manage customer updates, and [Name] will document our timeline and actions."
Step 3: Structured Communication
Systematically prompt updates to maintain clarity:
Incident Manager: "Platform team, please give us your current findings and next steps."
Platform Engineer: "We've identified elevated database latency. Running further diagnostics."
Incident Manager: "Noted. Development team, any recent deployments that might be linked?"
Step 4: Maintain Call Focus and Flow
Politely redirect off-topic discussions:
Incident Manager: "Let's save this deep dive for our post-mortem review. Right now, focus on immediate actionable steps only."
Step 5: Conduct Regular Check-ins
Regularly summarize to keep everyone aligned:
Incident Manager: "Quick update: we're now 30 minutes into the incident. Database latency is our primary issue; rollbacks are underway. Customer Support, please update customers now."
Step 6: Quick, Decisive Leadership
Make timely decisions when necessary:
Incident Manager: "Proceed with deployment rollback immediately. Platform team, monitor closely and update us every five minutes."
Step 7: Foster a Calm, Blame-Free Atmosphere
Encourage collaboration:
Incident Manager: "Stay calm and collaborative—we're handling a challenging situation together. We'll analyze thoroughly post-incident. Your efforts are greatly appreciated."
Step 8: Explicit Resolution Confirmation
Clearly announce resolution:
Incident Manager: "Good news—the rollback has resolved our latency issues. Systems are stable, monitoring confirms recovery. Let's communicate this externally and begin drafting our post-incident notes."
Step 9: Immediate Post-Incident Actions
Quickly address post-resolution tasks:
- Confirm everyone understands the resolution clearly.
- Communicate externally to customers and stakeholders.
- Schedule a thorough post-mortem.
Conducting an Effective Post-Mortem
High-performing organizations leverage incidents for continuous improvement. Your post-mortem should include:
- Incident timeline and summary
- Root cause analysis
- Lessons learned
- Concrete action items
- Clear ownership for follow-up tasks
"A blamelessly written postmortem assumes everyone involved acted with good intentions based on information available at the time."
Real-Life Incident Manager Quotes
Use clear language throughout the call:
"Can you summarize your next steps clearly?"
"What do you need right now to resolve this issue?"
"Let's set clear timelines for updates—every 10 minutes."
"Any blockers or external resources you need?"
Tools and Resources for Incident Managers
Essential tools include:
- Collaboration: Slack, Microsoft Teams
- Monitoring: Datadog, New Relic, Prometheus
Recommended reading:
- Google's SRE Book
- Accelerate by Nicole Forsgren et al.
- The Phoenix Project by Gene Kim, Kevin Behr, George Spafford
Common Pitfalls to Avoid
Ambiguity:
Clearly define responsibilities and tasks to avoid confusion.
Delayed Communication:
Promptly communicate both internally and externally to maintain trust.
Blame Culture:
Foster a supportive atmosphere to encourage transparency and rapid resolution.
Final Thoughts
Leading incident calls effectively demands preparation, clarity, calm decision-making, and strong leadership. Your role is pivotal in resolving incidents quickly and leveraging them to improve system resilience. Remember, incident management success extends beyond technical resolution—it's about fostering teamwork, learning, and continuous improvement.