Here's a quick guide to effective incident communication:
- Plan Ahead
- Set Up Clear Communication Channels
- Use a Tiered Alert System
- Provide Timely and Transparent Updates
- Designate Spokespersons and Roles
- Conduct Post-Incident Reviews
Key points:
- Create a detailed plan before incidents occur
- Use multiple communication channels (status page, email, social media)
- Categorize issues by severity and respond accordingly
- Update stakeholders frequently and honestly
- Assign specific roles for communication during incidents
- Learn from each incident to improve future responses
Practice | Purpose | Example |
---|---|---|
Plan Ahead | Prepare for quick response | Pre-written templates |
Clear Channels | Reach all stakeholders | Status page, email, social media |
Tiered Alerts | Prioritize response | P1 (critical) to P4 (minor) |
Timely Updates | Keep everyone informed | Updates every 30 minutes for major issues |
Designated Speakers | Ensure consistent messaging | Tech lead for IT teams, PR for media |
Post-Incident Review | Improve future responses | Team debrief, customer feedback |
This guide helps maintain trust, minimize disruption, and improve incident management.
Related video from YouTube
Basics of Incident Communication
Main Parts of Incident Communication
Incident communication has three key phases:
- First contact: Initial alert about the incident
- Regular updates: Ongoing communication during the incident
- Resolution and post-mortem: Final update and analysis after the incident
To communicate well during incidents:
- Pick a spokesperson or team to handle messages
- Use many channels to reach everyone affected
- Adjust messages for different groups (teams, customers, public)
- Keep information clear and the same across all platforms
Common Problems When Communicating During Issues
Here are some challenges in incident communication and how to fix them:
Problem | Impact | Fix |
---|---|---|
Slow response | Users get upset | Set up a quick alert system |
Mixed messages | People get confused | Use pre-made templates and one main info source |
Not being open | Hurts company image | Tell people what happened, why, and how you're fixing it |
Not enough updates | People worry and guess | Set a schedule for regular updates |
Too much tech talk | Non-tech people don't understand | Write messages for each group |
To avoid these problems:
- Make a full incident plan before issues happen
- Give clear jobs to communication teams
- Write message templates for different problems
- Use tools like Statuspage for easy updates
- Look at what happened after each incident to do better next time
Real-World Example: Facebook's 2010 Outage
In 2010, Facebook had a big outage that lasted about 2.5 hours. Here's how they handled it:
- What happened: A problem with their database caused the site to go down
- How they communicated: After fixing the issue, a Facebook engineer wrote a detailed post
- What the post included:
- An apology to users
- An explanation of what went wrong
- Steps taken to prevent future incidents
This approach helped Facebook:
- Show they cared about users
- Explain the problem in simple terms
- Build trust by being open about the issue
Tips for Better Incident Communication
- Define incidents clearly: Use a system like the 4-tier severity scale many web companies use. This helps teams know how serious a problem is.
- Prepare ahead of time: Have your communication tools, channels, and message templates ready before incidents happen.
- Tell the right people: Start with your core team, then spread the word to other staff and customers as needed.
- Be extra careful with security issues: If there's a security problem or data loss, tell everyone right away.
-
Make your first update count: When you first tell people about an issue:
- Say you know there's a problem
- Tell them what's not working
- Promise to give more updates soon
- Let them know if their data is safe
1. Plan Ahead
Create a Clear Response Plan
To handle problems well, make a plan before they happen. Here's what to do:
1. Define what counts as an incident for your company
2. Set up a system to rate how bad incidents are
3. Write down how you'll respond, including:
- Which tools you'll use to talk to people
- What you'll say in different situations
- Who does what when there's a problem
List Key People and Their Jobs
Make a list of who does what during an incident:
Role | Job |
---|---|
Incident Manager | Runs the whole response |
Communications Lead | Writes and sends out messages |
Technical Lead | Gives updates on fixing the problem |
Customer Support | Answers user questions |
Executive Sponsor | Approves big announcements |
Make sure everyone knows their job and how to do it when there's a problem.
Make Message Templates
Write messages ahead of time for different types of problems. This saves time and keeps your messages the same. Make templates for:
- Telling people about a new problem
- Giving updates
- Saying the problem is fixed
- Following up after the problem
Remember to change your messages for different people. Start with your team, then tell others as needed. For security problems or lost data, tell everyone right away.
Use Different Ways to Talk to People
Don't just use one way to tell people about problems. Use many, like:
- A special website just for updates
- Chat tools
- Social media
- Text messages
This helps make sure everyone gets the news.
Real Example: Slack's Big Outage
In 2021, Slack, a popular work chat app, had a big problem. Here's how they handled it:
- They used their status page to give updates every 30 minutes
- They posted on Twitter to reach more people
- They explained what went wrong in simple terms
- They said sorry and thanked people for being patient
Slack's CEO, Stewart Butterfield, said: "We know how much you depend on Slack, and we take our reliability very seriously. We're deeply sorry for this disruption to your work day."
This shows how important it is to:
- Use different ways to talk to people
- Give regular updates
- Explain things simply
- Say sorry when things go wrong
2. Set Up Clear Communication Channels
Choose the Right Communication Tools
Pick tools that help you talk to people quickly when problems happen. A status page is the best way to do this. It's a special website that tells everyone what's going on.
Atlassian, a big software company, uses a status page as their main way to tell people about problems. This helps them:
- Let users sign up for updates
- Answer fewer questions from users
- Keep everyone up to date
Use Many Ways to Talk to People
Don't just use one way to tell people about problems. Use lots of ways:
Way to Talk | What It's For | Example |
---|---|---|
Status Page | Main place for updates | Statuspage |
Long messages | Company email list | |
Work Chat | Team updates | Slack or Microsoft Teams |
Social Media | Quick public messages | Twitter or LinkedIn |
Text Messages | Urgent alerts | Text message service |
Make sure you have backup ways to talk if the internet goes down. This could be extra internet connections or even satellite phones.
Make One Place for All Messages
Have one place where your team can talk about the problem. This could be:
- A special chat room in your work chat app
- A tool like Jira Service Management
For people outside your company, put status info right on your website. This way, they can see what's happening without going to another page.
Tips for Better Communication
- Pick your tools before problems happen
- Use a status page as your main way to tell people what's going on
- Put status info on your website
- Use many ways to talk to people (email, chat, social media)
- Have a place for your team to talk about the problem
- Keep contact lists up to date
- Test your plan to make sure it works
3. Use a Tiered Alert System
Group Issues by How Serious They Are
A tiered alert system helps teams respond to problems quickly and correctly. Here's how to group issues:
Level | What It Means | Example |
---|---|---|
P1 | Very bad: Whole service down | Website crashes |
P2 | Bad: Big part not working | Can't log in |
P3 | Not great: Small problem | Search is slow |
P4 | Small issue: Doesn't hurt much | Button looks wrong |
This system helps teams know how fast to act and who to call.
Steps for Each Problem Level
For each level, have clear steps:
1. P1 (Very Bad)
- Tell everyone right away
- Call the boss
- Start fixing in 15 minutes
- Update every 30 minutes
2. P2 (Bad)
- Tell team leaders
- Get main team together
- Update every hour
3. P3 (Not Great)
- Tell the team in charge
- Plan to fix soon
- Update daily
4. P4 (Small Issue)
- Write it down
- Fix when there's time
- Check weekly
Set Up Automatic Alerts
Use tools to send alerts fast:
- Connect watching tools to talking tools (like PagerDuty to Slack)
- Set alerts for certain problems
- Make sure alerts go to new people if no one answers
For big problems, the system could:
- Text the on-call team
- Make a new chat room
- Update the status page
Real-World Example: GitHub's 24-Hour Outage
In October 2018, GitHub had a big problem:
- What Happened: A data storage system broke
- How They Used Tiers: They called it a P1 (worst) problem
- What They Did:
- Told users in 5 minutes on status page
- Updated every 30 minutes
- Fixed in 24 hours
- Result: Users trusted them more for being open
GitHub's VP of Engineering, Sam Lambert, said: "We believe in being as transparent as possible about service disruptions."
Tips for Better Alerts
- Make your tiers fit your business
- Train teams on what each tier means
- Test your system often
- Learn from each problem to make the system better
sbb-itb-9890dba
4. Give Quick and Open Updates
Update Frequency Based on Issue Severity
Match your update frequency to how bad the problem is:
Issue Level | How Often to Update |
---|---|
P1 (Very Bad) | Every 30 minutes |
P2 (Bad) | Every hour |
P3 (Not Great) | Once a day |
P4 (Small) | Once a week |
Stick to this schedule. It helps people trust you and stays informed without too many messages.
Be Open While Keeping Some Things Private
Tell people:
- What the problem is
- How it affects users
- What you're doing to fix it
- When you think it will be fixed (if you know)
Don't share:
- Secret security info
- Personal details
- Things you're not sure about
Keep your messages clear and factual. This stops people from getting confused or worried for no reason.
Write Clear Messages That Tell People What to Do
Make your messages easy to understand and act on:
- Sum up the problem
- Say what's happening now
- List what you're doing to fix it
- Tell users what to do (if needed)
- Say when the next update will come
Use this format:
[Problem ID]: Short description
Status: Still happening / Fixed
Impact: What's not working
What we're doing: Steps we're taking
What users should do: Actions for users (if any)
Next update: When we'll say more
This helps everyone know what's going on and what to expect.
Real-World Example: Slack's 2021 Outage
On January 4, 2021, Slack had a big outage that lasted about 4 hours. Here's how they handled it:
- They posted updates on their status page every 30 minutes
- They used Twitter to reach more people, with 9 tweets during the outage
- They explained the problem simply: "Customers may have trouble connecting to Slack to send messages and files"
- After fixing it, they said sorry and thanked users for being patient
Slack's CEO, Stewart Butterfield, tweeted: "We're still in a holding pattern. There's no resolution yet, but we'll be sharing more news as soon as we have it. Thanks for your patience."
This shows how to:
- Use different ways to talk to people
- Give updates often
- Keep things simple
- Say sorry when things go wrong
Tips for Better Updates
- Make a list of who needs to know about problems
- Write message templates for different types of issues
- Train your team on how to write clear updates
- Have a backup plan if your main way of talking to people doesn't work
- After each problem, look at how you did and find ways to do better next time
5. Choose Who Speaks and What They Do
Pick People to Talk to Different Groups
Select the right people to talk to each group during a problem:
Role | Talks To | Example |
---|---|---|
Tech Lead | IT teams, developers | John Smith, CTO of Acme Corp |
Support Manager | Users, clients | Sarah Lee, Head of Customer Care at Zendesk |
PR Specialist | Media, public | Mike Johnson, Communications Director at Slack |
Executive | High-level stakeholders | Satya Nadella, CEO of Microsoft |
Make sure each person knows who they should talk to and how.
Train Speakers for Emergency Talks
Get your speakers ready for tough situations:
- Do practice runs often
- Learn to give short, clear messages
- Practice answering hard questions calmly
Make a list of key points for each type of problem. This helps speakers stay on track and give the same info to everyone.
Make a Clear Order for Sharing Information
Set up a clear way to share info:
- Problem team → Company staff
- Tech lead → Support manager
- Support manager → Users
- PR person → News and public
- Executive → Big partners
Keep all info in one place that's always up to date. This stops people from saying different things.
Use this plan to share info based on how bad the problem is:
How Bad | Who Tells Who |
---|---|
Small | Team lead → Department head |
Medium | Department head → Division boss |
Big | Division boss → Top leaders |
Very big | Top leaders → Board members |
Real-World Example: Cloudflare's 2019 Outage
In July 2019, Cloudflare, a big internet security company, had a major outage:
- What Happened: A config change caused 50% of their network to go down
- How They Talked About It:
- CTO John Graham-Cumming wrote a detailed blog post within 24 hours
- CEO Matthew Prince tweeted updates and answered questions
- What They Did Well:
- Quick first update (6 minutes after the problem started)
- Clear, honest explanations of what went wrong
- Regular updates on their status page
John Graham-Cumming said: "We believe in transparent communication during incidents. It's crucial for maintaining trust with our customers and the broader internet community."
This approach helped Cloudflare:
- Keep users informed
- Show they were working hard to fix the problem
- Build trust by being open about what happened
Tips for Better Speaker Management
- Make a list of who talks to who before problems happen
- Train your speakers regularly
- Use simple words to explain tech issues
- Have backup speakers ready
- After each problem, talk about what went well and what to do better next time
6. Review After the Problem is Fixed
Check How Well Communication Worked
After fixing an issue, it's important to look at how well you talked about it. Use this checklist:
Aspect | Questions to Ask |
---|---|
Speed | How fast did we tell people? |
Updates | Did we give enough updates? |
Clarity | Could everyone understand our messages? |
Channels | Did we reach everyone we needed to? |
Feedback | Did people feel well-informed? |
Use these questions to find what worked and what didn't. This helps you do better next time.
Ask People What They Thought
Get feedback from different groups to learn more:
- Send out surveys to teams, customers, and partners
- Talk one-on-one with key people
- Have a meeting with everyone involved
Ask questions like:
- "Did you get updates fast enough?"
- "Were our messages clear?"
- "Did you feel sure we were fixing the problem?"
- "How can we talk better next time?"
Look at what people say to find ways to improve.
Update the Plan Based on What You Learned
Use what you learned to make your plan better:
- Fix gaps: If some people didn't get messages, change how you reach them
- Improve messages: Update your ready-made messages based on feedback
- Make decisions faster: If things were slow, fix your process
- Train more: Help your speakers or team members if needed
- Get better tools: If your current tools didn't work well, find new ones
Real-World Example: Atlassian's 2022 Outage
In April 2022, Atlassian, a big software company, had a major outage that lasted two weeks. Here's how they handled it:
Action | Result |
---|---|
Quick first update | Told customers within hours |
Regular updates | Posted on status page daily |
Clear explanations | Explained the problem in simple terms |
CEO involvement | Zoe Nicholson, Atlassian's CTO, gave updates |
After the outage, Atlassian did a thorough review:
- They talked to affected customers
- They looked at their communication process
- They made changes to prevent similar issues
Scott Farquhar, Atlassian's co-CEO, said: "We've learned a lot from this incident and are taking steps to improve our systems and processes."
This shows how important it is to:
- Act fast when problems happen
- Keep talking to people throughout the issue
- Learn from what went wrong
- Make real changes to do better next time
How to Use These Tips
Steps to Implement These Practices
1. Review Current Process
- Look at past incidents
- Check existing tools
- Ask team members for input
2. Choose What to Fix First
- Start with easy, quick changes
- Plan for bigger updates later
3. Update Your Playbook
- Add these 6 tips to your guide
- Clearly state who does what
4. Train Your Team
- Hold workshops on new practices
- Practice with fake incidents
5. Start Small
- Use 1-2 new practices at a time
- Check if they work and get feedback
Common Problems and Fixes
Problem | How to Fix It |
---|---|
Different messages | Make ready-to-use message templates |
Slow alerts | Set up automatic alerts |
Unclear updates | Use a set format for all messages |
Too many messages | Use levels to send only important info |
Poor after-incident reviews | Always have a meeting after big problems |
Real-World Example: Atlassian's 2022 Outage
In April 2022, Atlassian had a major outage lasting two weeks. Here's what they did:
- Told customers within hours
- Posted daily updates on their status page
- Explained the problem simply
- Had their CTO, Zoe Nicholson, give updates
After fixing the issue, Atlassian:
- Talked to affected customers
- Looked at how they communicated
- Made changes to prevent similar problems
Scott Farquhar, Atlassian's co-CEO, said: "We've learned a lot from this incident and are taking steps to improve our systems and processes."
Key Points to Remember
- Check and update your plan often
- Ask everyone for honest feedback
- Be ready to change your approach
- Keep training your team on good communication
Wrap-up
Quick List of the 6 Main Tips
Here's a recap of the six best practices for an effective incident communication plan:
- Plan Ahead
- Set Up Clear Communication Channels
- Use a Tiered Alert System
- Provide Timely and Transparent Updates
- Designate Spokespersons and Roles
- Conduct Post-Incident Reviews
Keep Working on Better Communication
To maintain an effective incident communication plan:
- Regular Reviews: Check your plan every 3-6 months.
- Practice Runs: Do mock incidents to test your processes.
- Stay Current: Keep up with new communication tools and best practices.
- Get Feedback: Ask team members and stakeholders for input often.
- Be Ready to Change: Update your plan as your company grows or faces new challenges.
Real-World Example: Slack's 2021 Outage Response
On January 4, 2021, Slack faced a major outage lasting about 4 hours. Here's how they handled it:
Action | Details |
---|---|
Quick Updates | Posted on status page every 30 minutes |
Multiple Channels | Used Twitter, with 9 tweets during the outage |
Clear Communication | Explained the problem simply: "Customers may have trouble connecting to Slack to send messages and files" |
Leadership Involvement | CEO Stewart Butterfield tweeted updates |
After fixing the issue, Slack:
- Apologized to users
- Thanked them for their patience
- Conducted a thorough review to prevent future incidents
This approach helped Slack:
- Keep users informed
- Show they were actively working on the problem
- Build trust through open communication
Key Takeaways
- Act Fast: Tell users about problems quickly.
- Use Many Channels: Reach out through different platforms.
- Keep It Simple: Explain issues in easy-to-understand terms.
- Learn and Improve: Look at what happened and make your plan better.
FAQs
What is an incident communication plan?
An incident communication plan is a key part of IT incident management. It's a detailed guide that covers:
- Technical steps for response teams
- Who does what during an incident
- How to handle the incident quickly
This plan helps teams share information fast and fix problems with less impact on users.
What's a good way to talk to people during an outage?
A good process for talking to people during an outage includes:
- Picking the right ways to reach people
- Saying who will do the talking
- Telling people about the problem right away
- Giving updates often
- Being honest about what's happening
- Saying sorry to affected users
- Reaching out before users ask
These steps help keep trust and keep everyone informed while fixing the problem.
What's in an incident communication strategy?
An incident communication strategy is a plan that says:
- Who's in charge during a problem
- Who talks to users and news people
- How to talk to affected people
The strategy tries to keep things calm by making sure only certain people talk to users. It usually has rules about:
- What to say in messages
- How often to give updates
- Which ways to use for talking (like email or social media)
Real-world example: GitHub's 24-hour outage
In October 2018, GitHub had a major outage:
What Happened | How They Handled It | Result |
---|---|---|
Data storage system broke | Called it their worst-level problem | Users trusted them more |
Lasted 24 hours | Told users in 5 minutes on status page | |
Updated every 30 minutes |
GitHub's VP of Engineering, Sam Lambert, said: "We believe in being as transparent as possible about service disruptions."
Tips for better incident communication
- Make your plan fit your business
- Train teams on what to do
- Test your system often
- Learn from each problem
- Use simple words to explain tech issues
- Have backup speakers ready
- After each problem, talk about what to do better next time
Key things to remember
- Check and update your plan often
- Ask everyone for honest feedback
- Be ready to change how you do things
- Keep training your team on good communication