Incident Communication Plan: 6 Best Practices

published on 14 August 2024

Here's a quick guide to effective incident communication:

  1. Plan Ahead
  2. Set Up Clear Communication Channels
  3. Use a Tiered Alert System
  4. Provide Timely and Transparent Updates
  5. Designate Spokespersons and Roles
  6. Conduct Post-Incident Reviews

Key points:

  • Create a detailed plan before incidents occur
  • Use multiple communication channels (status page, email, social media)
  • Categorize issues by severity and respond accordingly
  • Update stakeholders frequently and honestly
  • Assign specific roles for communication during incidents
  • Learn from each incident to improve future responses
Practice Purpose Example
Plan Ahead Prepare for quick response Pre-written templates
Clear Channels Reach all stakeholders Status page, email, social media
Tiered Alerts Prioritize response P1 (critical) to P4 (minor)
Timely Updates Keep everyone informed Updates every 30 minutes for major issues
Designated Speakers Ensure consistent messaging Tech lead for IT teams, PR for media
Post-Incident Review Improve future responses Team debrief, customer feedback

This guide helps maintain trust, minimize disruption, and improve incident management.

Basics of Incident Communication

Main Parts of Incident Communication

Incident communication has three key phases:

  1. First contact: Initial alert about the incident
  2. Regular updates: Ongoing communication during the incident
  3. Resolution and post-mortem: Final update and analysis after the incident

To communicate well during incidents:

  • Pick a spokesperson or team to handle messages
  • Use many channels to reach everyone affected
  • Adjust messages for different groups (teams, customers, public)
  • Keep information clear and the same across all platforms

Common Problems When Communicating During Issues

Here are some challenges in incident communication and how to fix them:

Problem Impact Fix
Slow response Users get upset Set up a quick alert system
Mixed messages People get confused Use pre-made templates and one main info source
Not being open Hurts company image Tell people what happened, why, and how you're fixing it
Not enough updates People worry and guess Set a schedule for regular updates
Too much tech talk Non-tech people don't understand Write messages for each group

To avoid these problems:

  1. Make a full incident plan before issues happen
  2. Give clear jobs to communication teams
  3. Write message templates for different problems
  4. Use tools like Statuspage for easy updates
  5. Look at what happened after each incident to do better next time

Real-World Example: Facebook's 2010 Outage

In 2010, Facebook had a big outage that lasted about 2.5 hours. Here's how they handled it:

  • What happened: A problem with their database caused the site to go down
  • How they communicated: After fixing the issue, a Facebook engineer wrote a detailed post
  • What the post included:
    • An apology to users
    • An explanation of what went wrong
    • Steps taken to prevent future incidents

This approach helped Facebook:

  • Show they cared about users
  • Explain the problem in simple terms
  • Build trust by being open about the issue

Tips for Better Incident Communication

  1. Define incidents clearly: Use a system like the 4-tier severity scale many web companies use. This helps teams know how serious a problem is.

  2. Prepare ahead of time: Have your communication tools, channels, and message templates ready before incidents happen.

  3. Tell the right people: Start with your core team, then spread the word to other staff and customers as needed.

  4. Be extra careful with security issues: If there's a security problem or data loss, tell everyone right away.

  5. Make your first update count: When you first tell people about an issue:

    • Say you know there's a problem
    • Tell them what's not working
    • Promise to give more updates soon
    • Let them know if their data is safe

1. Plan Ahead

Create a Clear Response Plan

To handle problems well, make a plan before they happen. Here's what to do:

1. Define what counts as an incident for your company

2. Set up a system to rate how bad incidents are

3. Write down how you'll respond, including:

  • Which tools you'll use to talk to people
  • What you'll say in different situations
  • Who does what when there's a problem

List Key People and Their Jobs

Make a list of who does what during an incident:

Role Job
Incident Manager Runs the whole response
Communications Lead Writes and sends out messages
Technical Lead Gives updates on fixing the problem
Customer Support Answers user questions
Executive Sponsor Approves big announcements

Make sure everyone knows their job and how to do it when there's a problem.

Make Message Templates

Write messages ahead of time for different types of problems. This saves time and keeps your messages the same. Make templates for:

  • Telling people about a new problem
  • Giving updates
  • Saying the problem is fixed
  • Following up after the problem

Remember to change your messages for different people. Start with your team, then tell others as needed. For security problems or lost data, tell everyone right away.

Use Different Ways to Talk to People

Don't just use one way to tell people about problems. Use many, like:

  • A special website just for updates
  • Email
  • Chat tools
  • Social media
  • Text messages

This helps make sure everyone gets the news.

Real Example: Slack's Big Outage

Slack

In 2021, Slack, a popular work chat app, had a big problem. Here's how they handled it:

  • They used their status page to give updates every 30 minutes
  • They posted on Twitter to reach more people
  • They explained what went wrong in simple terms
  • They said sorry and thanked people for being patient

Slack's CEO, Stewart Butterfield, said: "We know how much you depend on Slack, and we take our reliability very seriously. We're deeply sorry for this disruption to your work day."

This shows how important it is to:

  • Use different ways to talk to people
  • Give regular updates
  • Explain things simply
  • Say sorry when things go wrong

2. Set Up Clear Communication Channels

Choose the Right Communication Tools

Pick tools that help you talk to people quickly when problems happen. A status page is the best way to do this. It's a special website that tells everyone what's going on.

Atlassian, a big software company, uses a status page as their main way to tell people about problems. This helps them:

  • Let users sign up for updates
  • Answer fewer questions from users
  • Keep everyone up to date

Use Many Ways to Talk to People

Don't just use one way to tell people about problems. Use lots of ways:

Way to Talk What It's For Example
Status Page Main place for updates Statuspage
Email Long messages Company email list
Work Chat Team updates Slack or Microsoft Teams
Social Media Quick public messages Twitter or LinkedIn
Text Messages Urgent alerts Text message service

Make sure you have backup ways to talk if the internet goes down. This could be extra internet connections or even satellite phones.

Make One Place for All Messages

Have one place where your team can talk about the problem. This could be:

For people outside your company, put status info right on your website. This way, they can see what's happening without going to another page.

Tips for Better Communication

  1. Pick your tools before problems happen
  2. Use a status page as your main way to tell people what's going on
  3. Put status info on your website
  4. Use many ways to talk to people (email, chat, social media)
  5. Have a place for your team to talk about the problem
  6. Keep contact lists up to date
  7. Test your plan to make sure it works

3. Use a Tiered Alert System

Group Issues by How Serious They Are

A tiered alert system helps teams respond to problems quickly and correctly. Here's how to group issues:

Level What It Means Example
P1 Very bad: Whole service down Website crashes
P2 Bad: Big part not working Can't log in
P3 Not great: Small problem Search is slow
P4 Small issue: Doesn't hurt much Button looks wrong

This system helps teams know how fast to act and who to call.

Steps for Each Problem Level

For each level, have clear steps:

1. P1 (Very Bad)

  • Tell everyone right away
  • Call the boss
  • Start fixing in 15 minutes
  • Update every 30 minutes

2. P2 (Bad)

  • Tell team leaders
  • Get main team together
  • Update every hour

3. P3 (Not Great)

  • Tell the team in charge
  • Plan to fix soon
  • Update daily

4. P4 (Small Issue)

  • Write it down
  • Fix when there's time
  • Check weekly

Set Up Automatic Alerts

Use tools to send alerts fast:

  1. Connect watching tools to talking tools (like PagerDuty to Slack)
  2. Set alerts for certain problems
  3. Make sure alerts go to new people if no one answers

For big problems, the system could:

  • Text the on-call team
  • Make a new chat room
  • Update the status page

Real-World Example: GitHub's 24-Hour Outage

GitHub

In October 2018, GitHub had a big problem:

  • What Happened: A data storage system broke
  • How They Used Tiers: They called it a P1 (worst) problem
  • What They Did:
    • Told users in 5 minutes on status page
    • Updated every 30 minutes
    • Fixed in 24 hours
  • Result: Users trusted them more for being open

GitHub's VP of Engineering, Sam Lambert, said: "We believe in being as transparent as possible about service disruptions."

Tips for Better Alerts

  1. Make your tiers fit your business
  2. Train teams on what each tier means
  3. Test your system often
  4. Learn from each problem to make the system better
sbb-itb-9890dba

4. Give Quick and Open Updates

Update Frequency Based on Issue Severity

Match your update frequency to how bad the problem is:

Issue Level How Often to Update
P1 (Very Bad) Every 30 minutes
P2 (Bad) Every hour
P3 (Not Great) Once a day
P4 (Small) Once a week

Stick to this schedule. It helps people trust you and stays informed without too many messages.

Be Open While Keeping Some Things Private

Tell people:

  • What the problem is
  • How it affects users
  • What you're doing to fix it
  • When you think it will be fixed (if you know)

Don't share:

  • Secret security info
  • Personal details
  • Things you're not sure about

Keep your messages clear and factual. This stops people from getting confused or worried for no reason.

Write Clear Messages That Tell People What to Do

Make your messages easy to understand and act on:

  1. Sum up the problem
  2. Say what's happening now
  3. List what you're doing to fix it
  4. Tell users what to do (if needed)
  5. Say when the next update will come

Use this format:

[Problem ID]: Short description
Status: Still happening / Fixed
Impact: What's not working
What we're doing: Steps we're taking
What users should do: Actions for users (if any)
Next update: When we'll say more

This helps everyone know what's going on and what to expect.

Real-World Example: Slack's 2021 Outage

On January 4, 2021, Slack had a big outage that lasted about 4 hours. Here's how they handled it:

  • They posted updates on their status page every 30 minutes
  • They used Twitter to reach more people, with 9 tweets during the outage
  • They explained the problem simply: "Customers may have trouble connecting to Slack to send messages and files"
  • After fixing it, they said sorry and thanked users for being patient

Slack's CEO, Stewart Butterfield, tweeted: "We're still in a holding pattern. There's no resolution yet, but we'll be sharing more news as soon as we have it. Thanks for your patience."

This shows how to:

  • Use different ways to talk to people
  • Give updates often
  • Keep things simple
  • Say sorry when things go wrong

Tips for Better Updates

  1. Make a list of who needs to know about problems
  2. Write message templates for different types of issues
  3. Train your team on how to write clear updates
  4. Have a backup plan if your main way of talking to people doesn't work
  5. After each problem, look at how you did and find ways to do better next time

5. Choose Who Speaks and What They Do

Pick People to Talk to Different Groups

Select the right people to talk to each group during a problem:

Role Talks To Example
Tech Lead IT teams, developers John Smith, CTO of Acme Corp
Support Manager Users, clients Sarah Lee, Head of Customer Care at Zendesk
PR Specialist Media, public Mike Johnson, Communications Director at Slack
Executive High-level stakeholders Satya Nadella, CEO of Microsoft

Make sure each person knows who they should talk to and how.

Train Speakers for Emergency Talks

Get your speakers ready for tough situations:

  1. Do practice runs often
  2. Learn to give short, clear messages
  3. Practice answering hard questions calmly

Make a list of key points for each type of problem. This helps speakers stay on track and give the same info to everyone.

Make a Clear Order for Sharing Information

Set up a clear way to share info:

  1. Problem team → Company staff
  2. Tech lead → Support manager
  3. Support manager → Users
  4. PR person → News and public
  5. Executive → Big partners

Keep all info in one place that's always up to date. This stops people from saying different things.

Use this plan to share info based on how bad the problem is:

How Bad Who Tells Who
Small Team lead → Department head
Medium Department head → Division boss
Big Division boss → Top leaders
Very big Top leaders → Board members

Real-World Example: Cloudflare's 2019 Outage

Cloudflare

In July 2019, Cloudflare, a big internet security company, had a major outage:

  • What Happened: A config change caused 50% of their network to go down
  • How They Talked About It:
    • CTO John Graham-Cumming wrote a detailed blog post within 24 hours
    • CEO Matthew Prince tweeted updates and answered questions
  • What They Did Well:
    • Quick first update (6 minutes after the problem started)
    • Clear, honest explanations of what went wrong
    • Regular updates on their status page

John Graham-Cumming said: "We believe in transparent communication during incidents. It's crucial for maintaining trust with our customers and the broader internet community."

This approach helped Cloudflare:

  • Keep users informed
  • Show they were working hard to fix the problem
  • Build trust by being open about what happened

Tips for Better Speaker Management

  1. Make a list of who talks to who before problems happen
  2. Train your speakers regularly
  3. Use simple words to explain tech issues
  4. Have backup speakers ready
  5. After each problem, talk about what went well and what to do better next time

6. Review After the Problem is Fixed

Check How Well Communication Worked

After fixing an issue, it's important to look at how well you talked about it. Use this checklist:

Aspect Questions to Ask
Speed How fast did we tell people?
Updates Did we give enough updates?
Clarity Could everyone understand our messages?
Channels Did we reach everyone we needed to?
Feedback Did people feel well-informed?

Use these questions to find what worked and what didn't. This helps you do better next time.

Ask People What They Thought

Get feedback from different groups to learn more:

  1. Send out surveys to teams, customers, and partners
  2. Talk one-on-one with key people
  3. Have a meeting with everyone involved

Ask questions like:

  • "Did you get updates fast enough?"
  • "Were our messages clear?"
  • "Did you feel sure we were fixing the problem?"
  • "How can we talk better next time?"

Look at what people say to find ways to improve.

Update the Plan Based on What You Learned

Use what you learned to make your plan better:

  1. Fix gaps: If some people didn't get messages, change how you reach them
  2. Improve messages: Update your ready-made messages based on feedback
  3. Make decisions faster: If things were slow, fix your process
  4. Train more: Help your speakers or team members if needed
  5. Get better tools: If your current tools didn't work well, find new ones

Real-World Example: Atlassian's 2022 Outage

Atlassian

In April 2022, Atlassian, a big software company, had a major outage that lasted two weeks. Here's how they handled it:

Action Result
Quick first update Told customers within hours
Regular updates Posted on status page daily
Clear explanations Explained the problem in simple terms
CEO involvement Zoe Nicholson, Atlassian's CTO, gave updates

After the outage, Atlassian did a thorough review:

  1. They talked to affected customers
  2. They looked at their communication process
  3. They made changes to prevent similar issues

Scott Farquhar, Atlassian's co-CEO, said: "We've learned a lot from this incident and are taking steps to improve our systems and processes."

This shows how important it is to:

  • Act fast when problems happen
  • Keep talking to people throughout the issue
  • Learn from what went wrong
  • Make real changes to do better next time

How to Use These Tips

Steps to Implement These Practices

1. Review Current Process

  • Look at past incidents
  • Check existing tools
  • Ask team members for input

2. Choose What to Fix First

  • Start with easy, quick changes
  • Plan for bigger updates later

3. Update Your Playbook

  • Add these 6 tips to your guide
  • Clearly state who does what

4. Train Your Team

  • Hold workshops on new practices
  • Practice with fake incidents

5. Start Small

  • Use 1-2 new practices at a time
  • Check if they work and get feedback

Common Problems and Fixes

Problem How to Fix It
Different messages Make ready-to-use message templates
Slow alerts Set up automatic alerts
Unclear updates Use a set format for all messages
Too many messages Use levels to send only important info
Poor after-incident reviews Always have a meeting after big problems

Real-World Example: Atlassian's 2022 Outage

In April 2022, Atlassian had a major outage lasting two weeks. Here's what they did:

  • Told customers within hours
  • Posted daily updates on their status page
  • Explained the problem simply
  • Had their CTO, Zoe Nicholson, give updates

After fixing the issue, Atlassian:

  1. Talked to affected customers
  2. Looked at how they communicated
  3. Made changes to prevent similar problems

Scott Farquhar, Atlassian's co-CEO, said: "We've learned a lot from this incident and are taking steps to improve our systems and processes."

Key Points to Remember

  • Check and update your plan often
  • Ask everyone for honest feedback
  • Be ready to change your approach
  • Keep training your team on good communication

Wrap-up

Quick List of the 6 Main Tips

Here's a recap of the six best practices for an effective incident communication plan:

  1. Plan Ahead
  2. Set Up Clear Communication Channels
  3. Use a Tiered Alert System
  4. Provide Timely and Transparent Updates
  5. Designate Spokespersons and Roles
  6. Conduct Post-Incident Reviews

Keep Working on Better Communication

To maintain an effective incident communication plan:

  1. Regular Reviews: Check your plan every 3-6 months.
  2. Practice Runs: Do mock incidents to test your processes.
  3. Stay Current: Keep up with new communication tools and best practices.
  4. Get Feedback: Ask team members and stakeholders for input often.
  5. Be Ready to Change: Update your plan as your company grows or faces new challenges.

Real-World Example: Slack's 2021 Outage Response

On January 4, 2021, Slack faced a major outage lasting about 4 hours. Here's how they handled it:

Action Details
Quick Updates Posted on status page every 30 minutes
Multiple Channels Used Twitter, with 9 tweets during the outage
Clear Communication Explained the problem simply: "Customers may have trouble connecting to Slack to send messages and files"
Leadership Involvement CEO Stewart Butterfield tweeted updates

After fixing the issue, Slack:

  • Apologized to users
  • Thanked them for their patience
  • Conducted a thorough review to prevent future incidents

This approach helped Slack:

  • Keep users informed
  • Show they were actively working on the problem
  • Build trust through open communication

Key Takeaways

  1. Act Fast: Tell users about problems quickly.
  2. Use Many Channels: Reach out through different platforms.
  3. Keep It Simple: Explain issues in easy-to-understand terms.
  4. Learn and Improve: Look at what happened and make your plan better.

FAQs

What is an incident communication plan?

An incident communication plan is a key part of IT incident management. It's a detailed guide that covers:

  • Technical steps for response teams
  • Who does what during an incident
  • How to handle the incident quickly

This plan helps teams share information fast and fix problems with less impact on users.

What's a good way to talk to people during an outage?

A good process for talking to people during an outage includes:

  1. Picking the right ways to reach people
  2. Saying who will do the talking
  3. Telling people about the problem right away
  4. Giving updates often
  5. Being honest about what's happening
  6. Saying sorry to affected users
  7. Reaching out before users ask

These steps help keep trust and keep everyone informed while fixing the problem.

What's in an incident communication strategy?

An incident communication strategy is a plan that says:

  • Who's in charge during a problem
  • Who talks to users and news people
  • How to talk to affected people

The strategy tries to keep things calm by making sure only certain people talk to users. It usually has rules about:

  • What to say in messages
  • How often to give updates
  • Which ways to use for talking (like email or social media)

Real-world example: GitHub's 24-hour outage

In October 2018, GitHub had a major outage:

What Happened How They Handled It Result
Data storage system broke Called it their worst-level problem Users trusted them more
Lasted 24 hours Told users in 5 minutes on status page
Updated every 30 minutes

GitHub's VP of Engineering, Sam Lambert, said: "We believe in being as transparent as possible about service disruptions."

Tips for better incident communication

  1. Make your plan fit your business
  2. Train teams on what to do
  3. Test your system often
  4. Learn from each problem
  5. Use simple words to explain tech issues
  6. Have backup speakers ready
  7. After each problem, talk about what to do better next time

Key things to remember

  • Check and update your plan often
  • Ask everyone for honest feedback
  • Be ready to change how you do things
  • Keep training your team on good communication

Related posts

Read more