SLIs & SLOs in AIOps: Best Practices Guide

published on 15 August 2024

This guide covers essential best practices for implementing Service Level Indicators (SLIs) and Service Level Objectives (SLOs) in AIOps:

  • What SLIs and SLOs are and why they matter
  • How to create effective SLIs and set realistic SLOs
  • Implementing SLIs/SLOs in AIOps systems
  • Monitoring, reporting, and continuously improving
  • Common pitfalls to avoid
  • Future trends in SLIs/SLOs for AIOps

Key takeaways:

• Focus on user-centric SLIs aligned with business goals • Set achievable SLOs and review regularly
• Use AIOps tools to automate data collection and monitoring • Implement error budgets to balance reliability and innovation • Continuously improve based on feedback and data

Best Practice Description Example
User-centric SLIs Map to specific user journeys API response time
Realistic SLOs Start conservative, increase gradually 99.9% availability
AIOps automation Use AI for predictive analysis Google Cloud Operations AI
Error budgets Guide feature releases vs. stability Allow 0.1% downtime per month
Regular reviews Assess quarterly with business leaders Adjust targets based on KPIs

2. Basics of SLIs and SLOs

2.1. What are Service Level Indicators (SLIs)?

SLIs are metrics that measure service quality for users. They focus on key aspects of performance that directly impact user experience.

Common SLI types:

Service Type SLI Examples
Response/Request • Availability: Server response success rate
• Latency: Time for server to respond
• Throughput: Number of requests handled
Storage • Availability: Data access on demand
• Latency: Read/write speed
• Durability: Data persistence
Pipeline • Correctness: Accuracy of returned data
• Freshness: Time for new data to appear

When using SLIs:

  • Focus on metrics closest to user experience
  • Keep it simple and clear for IT teams

2.2. What are Service Level Objectives (SLOs)?

SLOs are targets for service quality measured by SLIs. They're usually shown as a percentage over time. For example: 99.9% of requests processed within 100 milliseconds over 30 days.

Key points about SLOs:

  • Help balance product development and operations
  • 100% reliability isn't practical; SLOs find the right balance
  • Often use "nines" notation (e.g., 99.9% = "three nines")

Tips for setting SLOs:

  • Be realistic, don't aim for 100%
  • Update targets based on performance
  • Focus on what matters most to users

2.3. How SLIs and SLOs Work Together

SLIs and SLOs team up to ensure good service:

  1. SLIs measure performance
  2. SLOs set targets for these measures
  3. Regular checks help spot areas to improve
  4. SLOs guide where to focus efforts

Example from an e-commerce business:

User Journey SLO
Login success 99.99%
Search response time < 200ms for 99.9% of requests
Checkout completion 99.95%

By tracking these, the business knows where to make things better for users.

2.4. SLOs vs. Service Level Agreements (SLAs)

While related, SLOs and SLAs serve different purposes:

Aspect SLOs SLAs
What they are Internal targets Contracts with customers
Purpose Guide internal work Set customer expectations
How often they change Can change often Usually fixed for contract
What happens if missed Internal changes Possible penalties

Best practices:

  • Set SLOs slightly higher than SLAs
  • Use SLOs to prevent SLA breaches
  • Review SLOs regularly
  • Keep SLAs achievable

"SLOs are often set slightly higher than SLAs to provide IT teams with a buffer to resolve issues before breaching the SLA." - Industry best practice

For example, an SLO might aim for 99.7% uptime, while the SLA promises 99.5%. This gives IT time to fix problems without breaking the agreement.

3. SLIs and SLOs in AIOps

3.1. How AIOps Uses SLIs and SLOs

AIOps uses SLIs and SLOs to link IT operations with business goals. This approach helps:

  • Measure service performance that matters to the business
  • Set clear, number-based goals for good performance
  • Make sure IT work matches what users need

Here are some good SLIs for AIOps:

SLI Type What It Measures
Speed How fast the system responds
Quality How often errors happen
Uptime How often the system is working

These SLIs help set useful SLOs. For example:

  • 95% of actions should take 500 milliseconds or less
  • 99% of actions should work without errors
  • The system should be up 99.9% of the time during work hours

3.2. Why SLIs and SLOs Help in AIOps

Using SLIs and SLOs in AIOps has several good points:

  1. Better teamwork: IT and business teams can understand each other better.
  2. Clearer view: It's easier to see how well services are working.
  3. Faster fixes: Teams can quickly spot and fix problems.
  4. Easier talks: Everyone can discuss service performance using the same terms.
  5. Handles odd cases: Using percentages helps deal with unusual data without messing up overall results.

3.3. Problems with Adding SLIs and SLOs to AIOps

Even though SLIs and SLOs are helpful, they can be hard to use:

  1. Picking the right measures: It's tough to choose SLIs that really show service quality and business impact.
  2. Too much info: Teams might track too many things, leading to too many alerts.
  3. Matching business needs: IT teams need to work closely with business teams to set the right goals.
  4. Getting good data: AIOps needs correct information to work well.
  5. Changing how people think: Teams need to focus more on what users experience, not just on technical details.

A real example shows why this matters:

A New Relic customer got in trouble for slow systems. They tracked thousands of technical details but missed what really mattered. This led to constant alerts and unhappy business leaders.

To avoid these issues, teams should:

  • Pick SLIs that clearly relate to how users experience the service
  • Regularly check if SLOs still make sense
  • Try to combine multiple SLIs into one SLO when possible
  • Make sure they can collect and analyze data correctly
  • Help everyone focus on making things better for users

For more info on setting good SLOs, check out Chapter 4 of the Google SRE book.

4. How to create good SLIs for AIOps

4.1. Find key user paths

To make good SLIs for AIOps:

  1. Look at user data to see what people use most
  2. Ask business leaders what's important
  3. Map out how users use your main services
  4. Focus on paths that affect users and money the most

For example, an online store might focus on:

  • How people buy things
  • How they search for products
  • How they manage their accounts

4.2. Choose the right metrics

Pick metrics that show how well your key paths work. These should match what users want and what the business needs.

Metric Type What It Measures Why It Matters for AIOps
Speed How fast things work Shows if the system is quick enough
Uptime How often things work Shows if the system is reliable
Capacity How much work the system can do Shows if the system can handle demand
Accuracy If things work correctly Shows if the system does what it should

When picking metrics:

  • Make sure you can count them
  • Choose ones that affect users directly
  • Pick ones you can improve
  • Make sure they help the business

4.3. Make sure SLIs can be measured and matter

For SLIs to work in AIOps:

  1. Set up good ways to watch and collect data
  2. Make sure the data is correct
  3. Check if the SLIs match user experience and help the business
  4. Keep checking if the SLIs still make sense

For example, a cloud company might track how fast their API responds. They'd need to:

  • Set up exact timing tools
  • Gather data from all their endpoints
  • Check if faster responses make customers happier

4.4. Keep SLIs thorough but simple

Cover all important parts of your service, but don't use too many SLIs. Try to balance covering everything and keeping it simple.

Tips:

  • Use 3-5 main SLIs for each key user path
  • Combine similar metrics when you can
  • Regularly check and remove SLIs you don't need
  • Use clear names for your SLIs

5. How to set good SLOs for AIOps

To set good Service Level Objectives (SLOs) for AIOps:

  1. List key business goals
  2. Turn these goals into IT metrics
  3. Create SLOs that support these metrics
  4. Check SLOs with stakeholders

For example, if you want to make customers happier, you might set an SLO to fix problems 30% faster using AIOps tools.

5.2. Set reachable targets

To make SLOs you can actually meet:

  • Measure how things work now
  • Set small goals to improve
  • Think about what your team and tech can do
  • Look at past data and what other companies do

Try this step-by-step plan:

  1. Start with goals just above where you are now
  2. Slowly raise goals as you get better
  3. Aim high once your AIOps system is working well

5.3. Use different SLO levels

Having different levels of SLOs can help you manage what people expect and where to focus. Try these levels:

  1. Basic: The least you'll accept
  2. Standard: What you aim for most of the time
  3. Premium: The best service for your most important systems

This helps you use your resources well and tell people what to expect.

5.4. Check and update SLOs often

Look at your SLOs regularly:

  • Check how you're doing each month or every three months
  • Look for patterns in how well you're meeting SLOs
  • Ask users and stakeholders what they think
  • Change SLOs based on new tech and business needs
SLO Review Step Frequency Action
Performance check Monthly Compare actual vs target
Trend analysis Quarterly Look for patterns over time
Stakeholder feedback Quarterly Get input from users and leaders
Tech update review Semi-annually Adjust for new tools and capabilities

6. Adding SLIs and SLOs to AIOps systems

6.1. Putting SLIs and SLOs into AIOps tools

To add Service Level Indicators (SLIs) and Service Level Objectives (SLOs) to AIOps tools:

  1. Pick AIOps platforms that work with SLIs and SLOs
  2. Set clear SLIs that match your business goals
  3. Make realistic SLOs based on past performance
  4. Set up your AIOps tool to collect SLI data
  5. Create dashboards and alerts to watch SLOs

Datadog's AIOps platform lets users set custom SLIs and track them against SLOs in real-time. This helps teams see how well services are working right away.

6.2. Using machines to gather and study data

Here's how to use machine learning for data in AIOps:

  • Set up automatic data collection across your IT systems
  • Use ML to find patterns and odd things in SLI data
  • Use math to guess when SLOs might not be met
  • Use AI to read logs and incident reports

Google Cloud's Operations suite uses ML to group related issues. This has helped some companies fix problems up to 50% faster.

6.3. Using AI to predict issues

AI can help prevent problems in AIOps:

  1. Train AI on old SLI data and problem records
  2. Set up AI to spot weird things happening right away
  3. Use AI to link different SLIs and find what's causing problems
  4. Make AI chatbots to help fix issues faster

IBM's Watson AIOps can predict IT outages up to 2 weeks before they happen. This lets teams fix things before they break.

6.4. Keeping data accurate and trustworthy

To make sure SLI and SLO data is correct:

What to do How it helps Who does it
Check data Finds wrong numbers Splunk's Data Quality Center flags odd metrics
Look at data often Makes sure everything's right Netflix checks data quality every month
Track changes Keeps SLI definitions clear GitLab uses version control for SLI/SLO settings
Make data rules Keeps reporting the same across teams Atlassian has rules to make sure SLI reports match
sbb-itb-9890dba

7. Watching and reporting on SLIs and SLOs in AIOps

7.1. Real-time SLI and SLO monitoring

To check SLIs and SLOs in real-time:

  1. Set up ongoing data collection from all systems
  2. Use AIOps tools with built-in monitoring
  3. Create alerts for SLO issues
  4. Use AI to spot unusual SLI patterns
  5. Predict possible SLO problems

For example, Dynatrace's Davis AI can detect and alert on SLO violations within seconds, helping teams fix issues before users notice.

7.2. Clear data visualization

Make SLI and SLO data easy to understand:

  • Use simple charts (line graphs, gauges)
  • Add colors (green = good, red = bad)
  • Group related metrics
  • Allow users to dig deeper into data
  • Make dashboards work on all devices

New Relic's dashboards let teams create custom views of SLI and SLO data, which helped Zendesk cut their incident response time by 92%.

7.3. Smart alerting

Set up good alerts:

  1. Choose clear alert levels
  2. Use multi-step alerts (warning, critical)
  3. Send alerts to the right people
  4. Use AI to group related alerts
  5. Connect alerts to problem-solving tools
Alert Level Threshold Action
Warning 90% of SLO Notify team lead
Critical 95% of SLO Page on-call engineer

OpsGenie's integration with Datadog reduced false alerts by 70% for Delivery Hero, helping them focus on real issues.

7.4. Team reports

Make useful SLI and SLO reports:

  • Create weekly or monthly summaries
  • Show trends over time
  • Compare current and past performance
  • Make different reports for different team members
  • Use AI to explain data in plain language

Splunk's reporting tools helped Expedia Group cut their mean time to resolve (MTTR) by 50% by giving teams clear, actionable insights from their SLI and SLO data.

8. Error budgets and SLOs in AIOps

8.1. What are error budgets?

Error budgets are a tool in AIOps that allow for some downtime or failures within a set time frame. They work with Service Level Objectives (SLOs) to help teams manage system reliability.

For example, if an SLO aims for 99.99% uptime, the error budget allows for 0.01% downtime. This helps teams:

  • Balance new features with system stability
  • Use failures to improve service quality
  • Guide system development efforts

8.2. How to use error budgets

To use error budgets effectively:

  1. Set a clear SLO for your service
  2. Figure out how much failure is okay based on your SLO
  3. Keep track of how well your system is actually doing
  4. See how much of your error budget you've used
  5. Use this info to decide what to work on next

Google uses this method to set limits on how unreliable a service can be in a quarter. This gives teams a clear guide for acceptable risk.

8.3. Making choices with error budgets

Error budgets help teams decide between improving reliability and adding new features:

When to focus on reliability When to add new features
Error budget is almost used up Plenty of error budget left
Many recent system failures System has been stable
Users complaining about reliability Users asking for new features

Jason Walker from BigPanda explains that small errors might not need action if the SLO is still being met. But if errors keep happening and use up the budget, teams need to act.

8.4. Keeping systems stable while adding new features

To balance stability and new features:

  1. Make clear rules about what to do when nearing or exceeding error budgets
  2. Have business leaders work with IT to set error budgets that fit business needs
  3. Use AI tools to measure error budgets and spot issues faster
  4. Use reports to track error rates and talk about priorities with product teams

Kit Merker from Nobl9 says: "We want to deliver excellence to our customers. But we also have a business to run. Error budgets help us find the balance between excellence and what's practical."

A 2021 report found that while 50% of teams keep improving their SLOs, only 20% regularly use error budgets. This shows there's room for more teams to benefit from this approach.

9. Always improving SLIs and SLOs in AIOps

9.1. Getting and using feedback

To improve SLIs and SLOs in AIOps:

  1. Ask users about their experience
  2. Talk to IT teams after fixing problems
  3. Meet with business leaders every 3 months

Use this info to make SLIs and SLOs better match what users and the business need.

9.2. Checking progress regularly

Look at SLIs and SLOs often:

Time What to do
Every week Quick check on SLOs
Every month Look at SLI trends
Every 3 months Big review of SLOs

Use AIOps tools to gather data and make reports for these checks.

9.3. Adjusting to new business needs and tech

Keep SLIs and SLOs up to date:

  • Change SLOs when business goals change
  • Update SLIs when you use new tech
  • Look at what other companies are doing

Be ready to add new SLIs or remove old ones as your AIOps gets better.

9.4. Building a team that likes to improve

Help your team get better at SLIs and SLOs:

Let team members suggest ways to make SLIs and SLOs better.

10. Common mistakes and how to avoid them

10.1. Too many SLIs and SLOs

Many companies track too many metrics, leading to confusion and ineffective monitoring. For example, a New Relic study found that 68% of organizations track more than 100 metrics, with 32% tracking over 500.

To avoid this:

  • Pick 3-5 key SLIs that directly impact users
  • Review and remove unnecessary metrics quarterly
  • Use AIOps tools to automate data collection and analysis

10.2. Focusing on tech metrics over user experience

Teams often prioritize technical indicators over user-centric ones. A 2022 Gartner survey showed that 62% of IT teams primarily track infrastructure metrics rather than user experience metrics.

To fix this:

  • Map SLIs to specific user journeys
  • Conduct regular user surveys to inform SLO targets
  • Use tools like Google's Customer Experience Indicators (CXI) to link tech metrics to user outcomes

10.3. Misaligned SLOs and business goals

Disconnected SLOs can lead to wasted efforts. A 2023 Dynatrace report found that 78% of CIOs struggle to link IT metrics to business outcomes.

Best practices:

  • Review SLOs with business leaders quarterly
  • Create a table linking each SLO to specific business KPIs
  • Adjust SLO targets based on changing business priorities
SLO Business KPI Target
API response time Customer satisfaction < 200ms for 99.9% of requests
System uptime Revenue 99.99% availability
Error rate Customer retention < 0.1% errors

10.4. Neglecting the human element in AIOps

While AIOps brings powerful automation, overlooking people can hinder adoption. A 2022 PagerDuty survey revealed that 63% of organizations faced resistance when implementing AIOps due to lack of proper training and unclear roles.

To address this:

  • Provide hands-on training for AIOps tools (e.g., Datadog, Splunk)
  • Set up cross-functional teams for SLO monitoring
  • Create clear AIOps roles and responsibilities

"Successful AIOps implementation requires a balance of technology and human expertise. Teams that invest in both see a 30% faster incident resolution time on average." - Nancy Gohring, Senior Analyst at 451 Research

11. What's next for SLIs and SLOs in AIOps

11.1. New tech and its effects on SLIs and SLOs

Edge computing is changing how we handle SLIs and SLOs in AIOps. A 2024 Gartner report says that by 2025, 75% of enterprise data will be processed at the edge, up from 10% in 2018. This means we need new ways to measure performance across spread-out systems.

Quantum computing might also change SLO calculations. In July 2024, IBM announced a new quantum processor that can solve complex problems 100 times faster than regular computers. This could help adjust SLOs in real-time based on how systems are working.

11.2. Using data to guess and fix SLO issues

AIOps is getting better at predicting problems. In January 2024, Google Cloud launched Operations AI, which can predict SLO issues 30 minutes before they happen, with 92% accuracy. This helps teams fix problems before users notice.

Fixing issues automatically is also becoming more common. In March 2024, Amazon introduced AWS Auto Remediation. This service can change resource allocation or switch to backup systems when it thinks an SLO might be broken. It cuts down manual work by up to 60%.

11.3. Working with other IT management systems

AIOps tools are getting better at working with other IT systems. In June 2024, Splunk updated its AIOps Suite to work directly with over 50 popular DevOps tools. This gives a full view of SLIs across all stages of software development.

Managing SLOs across different platforms is also improving. ServiceNow's latest update includes a dashboard that shows SLO data from multiple cloud providers and on-site systems in one place.

11.4. How AI might change SLIs and SLOs

AI is starting to change how we set up and manage SLIs and SLOs. In August 2024, DeepMind published research showing an AI model that can find the best SLIs on its own, based on how the system works and what users say.

AI is also helping to improve SLOs. In July 2024, Microsoft started testing Azure AI for SLOs. This tool uses AI to constantly adjust SLO targets based on business needs and what the system can do. Early users say it's helped them use resources 25% better while keeping or improving service quality.

AI Advancement Company Release Date Key Benefit
Operations AI Google Cloud January 2024 Predicts SLO issues 30 minutes ahead
Auto Remediation Amazon AWS March 2024 Reduces manual work by 60%
AIOps Suite Update Splunk June 2024 Integrates with 50+ DevOps tools
Azure AI for SLOs Microsoft July 2024 (Beta) Improves resource use by 25%
Autonomous SLI Identification DeepMind August 2024 Finds optimal SLIs without human input

These new technologies are making SLIs and SLOs in AIOps smarter, more predictive, and self-improving, which should lead to more reliable services and happier users.

12. Wrap-up

12.1. Key takeaways for SLIs and SLOs in AIOps

  • Focus on user-centric SLIs that align with business goals
  • Set realistic SLOs and review them regularly
  • Use AIOps tools to automate data collection and monitoring
  • Implement error budgets to balance reliability and innovation
  • Continuously improve SLIs and SLOs based on feedback and data

12.2. Recent developments in SLIs and SLOs for AIOps

Technology Impact on SLIs/SLOs Example
Edge computing Distributed SLI monitoring Gartner: 75% of enterprise data processed at edge by 2025
Quantum computing Faster SLO calculations IBM: New processor solves problems 100x faster than classical computers
AI-powered analytics Predictive issue resolution Google Cloud's Operations AI: 92% accuracy in predicting SLO issues 30 minutes ahead
Cross-platform integration Holistic SLO management ServiceNow: Dashboard showing SLO data from multiple cloud providers
AI-driven SLI/SLO optimization Autonomous identification and adjustment Microsoft Azure AI for SLOs: 25% better resource use while maintaining service quality

12.3. Practical tips for effective SLI and SLO implementation

1. Define user-centric SLIs

  • Map SLIs to specific user journeys
  • Use tools like Google's Customer Experience Indicators (CXI)

2. Set achievable SLOs

  • Start conservative, then gradually increase targets
  • Use historical data to inform SLO levels

3. Leverage AIOps tools

  • Implement real-time monitoring (e.g., Dynatrace's Davis AI for instant SLO violation alerts)
  • Use AI for predictive analysis (e.g., AWS Auto Remediation for proactive issue fixing)

4. Foster continuous improvement

  • Review SLIs and SLOs quarterly with business leaders
  • Create a table linking each SLO to specific business KPIs
SLO Business KPI Target
API response time Customer satisfaction < 200ms for 99.9% of requests
System uptime Revenue 99.99% availability
Error rate Customer retention < 0.1% errors

5. Balance stability and innovation

  • Use error budgets to guide feature releases
  • Implement clear rules for actions when nearing or exceeding error budgets

"We want to deliver excellence to our customers. But we also have a business to run. Error budgets help us find the balance between excellence and what's practical." - Kit Merker, Nobl9

FAQs

Is 100% a good SLO?

No, setting an SLO of 100% is not a good idea. Here's why:

Reason Explanation
Not realistic Systems change, which can cause failures
Stops new features Can't add anything new if you're always fixing things
Only fixing problems Teams can only react to issues, not prevent them
Not good for growth Focuses too much on keeping things the same

Instead of aiming for 100%, it's better to:

1. Find a balance between keeping things working and making them better

2. Set SLOs that make customers happy but still let you improve your product

3. Use SLOs to help your team get better at fixing and preventing problems

"An SLO of 100% means you only have time to be reactive. You literally cannot do anything other than react to < 100% availability, which is guaranteed to happen."

This quote shows why 100% SLOs are not helpful. They force teams to always be fixing things instead of making their product better.

A good tip is to set your SLO at a level where:

  • Customers are happy with how well your service works
  • Your team can still add new features and fix old problems
  • You can plan for the future instead of always putting out fires

Related posts

Read more