SLIs & SLOs in AIOps: Best Practices Guide

This guide covers essential best practices for implementing Service Level Indicators (SLIs) and Service Level Objectives (SLOs) in AIOps:

What SLIs and SLOs are and why they matter
How to create effective SLIs and set realistic SLOs
Implementing SLIs/SLOs in AIOps systems
Monitoring, reporting, and continuously improving
Common pitfalls to avoid
Future trends in SLIs/SLOs for AIOps

Key takeaways:

• Focus on user-centric SLIs aligned with business goals • Set achievable SLOs and review regularly
• Use AIOps tools to automate data collection and monitoring • Implement error budgets to balance reliability and innovation • Continuously improve based on feedback and data

Best Practice	Description	Example
User-centric SLIs	Map to specific user journeys	API response time
Realistic SLOs	Start conservative, increase gradually	99.9% availability
AIOps automation	Use AI for predictive analysis	Google Cloud Operations AI
Error budgets	Guide feature releases vs. stability	Allow 0.1% downtime per month
Regular reviews	Assess quarterly with business leaders	Adjust targets based on KPIs

2. Basics of SLIs and SLOs

2.1. What are Service Level Indicators (SLIs)?

SLIs are metrics that measure service quality for users. They focus on key aspects of performance that directly impact user experience.

Common SLI types:

Service Type	SLI Examples
Response/Request	• Availability: Server response success rate • Latency: Time for server to respond • Throughput: Number of requests handled
Storage	• Availability: Data access on demand • Latency: Read/write speed • Durability: Data persistence
Pipeline	• Correctness: Accuracy of returned data • Freshness: Time for new data to appear

When using SLIs:

Focus on metrics closest to user experience
Keep it simple and clear for IT teams

2.2. What are Service Level Objectives (SLOs)?

SLOs are targets for service quality measured by SLIs. They're usually shown as a percentage over time. For example: 99.9% of requests processed within 100 milliseconds over 30 days.

Key points about SLOs:

Help balance product development and operations
100% reliability isn't practical; SLOs find the right balance
Often use "nines" notation (e.g., 99.9% = "three nines")

Tips for setting SLOs:

Be realistic, don't aim for 100%
Update targets based on performance
Focus on what matters most to users

2.3. How SLIs and SLOs Work Together

SLIs and SLOs team up to ensure good service:

SLIs measure performance
SLOs set targets for these measures
Regular checks help spot areas to improve
SLOs guide where to focus efforts

Example from an e-commerce business:

User Journey	SLO
Login success	99.99%
Search response time	< 200ms for 99.9% of requests
Checkout completion	99.95%

By tracking these, the business knows where to make things better for users.

2.4. SLOs vs. Service Level Agreements (SLAs)

While related, SLOs and SLAs serve different purposes:

Aspect	SLOs	SLAs
What they are	Internal targets	Contracts with customers
Purpose	Guide internal work	Set customer expectations
How often they change	Can change often	Usually fixed for contract
What happens if missed	Internal changes	Possible penalties

Best practices:

Set SLOs slightly higher than SLAs
Use SLOs to prevent SLA breaches
Review SLOs regularly
Keep SLAs achievable

"SLOs are often set slightly higher than SLAs to provide IT teams with a buffer to resolve issues before breaching the SLA." - Industry best practice

For example, an SLO might aim for 99.7% uptime, while the SLA promises 99.5%. This gives IT time to fix problems without breaking the agreement.

3. SLIs and SLOs in AIOps

3.1. How AIOps Uses SLIs and SLOs

AIOps uses SLIs and SLOs to link IT operations with business goals. This approach helps:

Measure service performance that matters to the business
Set clear, number-based goals for good performance
Make sure IT work matches what users need

Here are some good SLIs for AIOps:

SLI Type	What It Measures
Speed	How fast the system responds
Quality	How often errors happen
Uptime	How often the system is working

These SLIs help set useful SLOs. For example:

95% of actions should take 500 milliseconds or less
99% of actions should work without errors
The system should be up 99.9% of the time during work hours

3.2. Why SLIs and SLOs Help in AIOps

Using SLIs and SLOs in AIOps has several good points:

Better teamwork: IT and business teams can understand each other better.
Clearer view: It's easier to see how well services are working.
Faster fixes: Teams can quickly spot and fix problems.
Easier talks: Everyone can discuss service performance using the same terms.
Handles odd cases: Using percentages helps deal with unusual data without messing up overall results.

3.3. Problems with Adding SLIs and SLOs to AIOps

Even though SLIs and SLOs are helpful, they can be hard to use:

Picking the right measures: It's tough to choose SLIs that really show service quality and business impact.
Too much info: Teams might track too many things, leading to too many alerts.
Matching business needs: IT teams need to work closely with business teams to set the right goals.
Getting good data: AIOps needs correct information to work well.
Changing how people think: Teams need to focus more on what users experience, not just on technical details.

A real example shows why this matters:

A New Relic customer got in trouble for slow systems. They tracked thousands of technical details but missed what really mattered. This led to constant alerts and unhappy business leaders.

To avoid these issues, teams should:

Pick SLIs that clearly relate to how users experience the service
Regularly check if SLOs still make sense
Try to combine multiple SLIs into one SLO when possible
Make sure they can collect and analyze data correctly
Help everyone focus on making things better for users

For more info on setting good SLOs, check out Chapter 4 of the Google SRE book.

4. How to create good SLIs for AIOps

4.1. Find key user paths

To make good SLIs for AIOps:

Look at user data to see what people use most
Ask business leaders what's important
Map out how users use your main services
Focus on paths that affect users and money the most

For example, an online store might focus on:

How people buy things
How they search for products
How they manage their accounts

4.2. Choose the right metrics

Pick metrics that show how well your key paths work. These should match what users want and what the business needs.

Metric Type	What It Measures	Why It Matters for AIOps
Speed	How fast things work	Shows if the system is quick enough
Uptime	How often things work	Shows if the system is reliable
Capacity	How much work the system can do	Shows if the system can handle demand
Accuracy	If things work correctly	Shows if the system does what it should

When picking metrics:

Make sure you can count them
Choose ones that affect users directly
Pick ones you can improve
Make sure they help the business

4.3. Make sure SLIs can be measured and matter

For SLIs to work in AIOps:

Set up good ways to watch and collect data
Make sure the data is correct
Check if the SLIs match user experience and help the business
Keep checking if the SLIs still make sense

For example, a cloud company might track how fast their API responds. They'd need to:

Set up exact timing tools
Gather data from all their endpoints
Check if faster responses make customers happier

4.4. Keep SLIs thorough but simple

Cover all important parts of your service, but don't use too many SLIs. Try to balance covering everything and keeping it simple.

Tips:

Use 3-5 main SLIs for each key user path
Combine similar metrics when you can
Regularly check and remove SLIs you don't need
Use clear names for your SLIs

5. How to set good SLOs for AIOps

5.1. Link SLOs to business goals

To set good Service Level Objectives (SLOs) for AIOps:

List key business goals
Turn these goals into IT metrics
Create SLOs that support these metrics
Check SLOs with stakeholders

For example, if you want to make customers happier, you might set an SLO to fix problems 30% faster using AIOps tools.

5.2. Set reachable targets

To make SLOs you can actually meet:

Measure how things work now
Set small goals to improve
Think about what your team and tech can do
Look at past data and what other companies do

Try this step-by-step plan:

Start with goals just above where you are now
Slowly raise goals as you get better
Aim high once your AIOps system is working well

5.3. Use different SLO levels

Having different levels of SLOs can help you manage what people expect and where to focus. Try these levels:

Basic: The least you'll accept
Standard: What you aim for most of the time
Premium: The best service for your most important systems

This helps you use your resources well and tell people what to expect.

5.4. Check and update SLOs often

Look at your SLOs regularly:

Check how you're doing each month or every three months
Look for patterns in how well you're meeting SLOs
Ask users and stakeholders what they think
Change SLOs based on new tech and business needs

SLO Review Step	Frequency	Action
Performance check	Monthly	Compare actual vs target
Trend analysis	Quarterly	Look for patterns over time
Stakeholder feedback	Quarterly	Get input from users and leaders
Tech update review	Semi-annually	Adjust for new tools and capabilities

6. Adding SLIs and SLOs to AIOps systems

6.1. Putting SLIs and SLOs into AIOps tools

To add Service Level Indicators (SLIs) and Service Level Objectives (SLOs) to AIOps tools:

Pick AIOps platforms that work with SLIs and SLOs
Set clear SLIs that match your business goals
Make realistic SLOs based on past performance
Set up your AIOps tool to collect SLI data
Create dashboards and alerts to watch SLOs

Datadog's AIOps platform lets users set custom SLIs and track them against SLOs in real-time. This helps teams see how well services are working right away.

6.2. Using machines to gather and study data

Here's how to use machine learning for data in AIOps:

Set up automatic data collection across your IT systems
Use ML to find patterns and odd things in SLI data
Use math to guess when SLOs might not be met
Use AI to read logs and incident reports

Google Cloud's Operations suite uses ML to group related issues. This has helped some companies fix problems up to 50% faster.

6.3. Using AI to predict issues

AI can help prevent problems in AIOps:

Train AI on old SLI data and problem records
Set up AI to spot weird things happening right away
Use AI to link different SLIs and find what's causing problems
Make AI chatbots to help fix issues faster

IBM's Watson AIOps can predict IT outages up to 2 weeks before they happen. This lets teams fix things before they break.

6.4. Keeping data accurate and trustworthy

To make sure SLI and SLO data is correct:

What to do	How it helps	Who does it
Check data	Finds wrong numbers	Splunk's Data Quality Center flags odd metrics
Look at data often	Makes sure everything's right	Netflix checks data quality every month
Track changes	Keeps SLI definitions clear	GitLab uses version control for SLI/SLO settings
Make data rules	Keeps reporting the same across teams	Atlassian has rules to make sure SLI reports match

7. Watching and reporting on SLIs and SLOs in AIOps

7.1. Real-time SLI and SLO monitoring

To check SLIs and SLOs in real-time:

Set up ongoing data collection from all systems
Use AIOps tools with built-in monitoring
Create alerts for SLO issues
Use AI to spot unusual SLI patterns
Predict possible SLO problems

For example, Dynatrace's Davis AI can detect and alert on SLO violations within seconds, helping teams fix issues before users notice.

7.2. Clear data visualization

Make SLI and SLO data easy to understand:

Use simple charts (line graphs, gauges)
Add colors (green = good, red = bad)
Group related metrics
Allow users to dig deeper into data
Make dashboards work on all devices

New Relic's dashboards let teams create custom views of SLI and SLO data, which helped Zendesk cut their incident response time by 92%.

7.3. Smart alerting

Set up good alerts:

Choose clear alert levels
Use multi-step alerts (warning, critical)
Send alerts to the right people
Use AI to group related alerts
Connect alerts to problem-solving tools

Alert Level	Threshold	Action
Warning	90% of SLO	Notify team lead
Critical	95% of SLO	Page on-call engineer

OpsGenie's integration with Datadog reduced false alerts by 70% for Delivery Hero, helping them focus on real issues.

7.4. Team reports

Make useful SLI and SLO reports:

Create weekly or monthly summaries
Show trends over time
Compare current and past performance
Make different reports for different team members
Use AI to explain data in plain language

Splunk's reporting tools helped Expedia Group cut their mean time to resolve (MTTR) by 50% by giving teams clear, actionable insights from their SLI and SLO data.

8. Error budgets and SLOs in AIOps

8.1. What are error budgets?

Error budgets are a tool in AIOps that allow for some downtime or failures within a set time frame. They work with Service Level Objectives (SLOs) to help teams manage system reliability.

For example, if an SLO aims for 99.99% uptime, the error budget allows for 0.01% downtime. This helps teams:

Balance new features with system stability
Use failures to improve service quality
Guide system development efforts

8.2. How to use error budgets

To use error budgets effectively:

Set a clear SLO for your service
Figure out how much failure is okay based on your SLO
Keep track of how well your system is actually doing
See how much of your error budget you've used
Use this info to decide what to work on next

Google uses this method to set limits on how unreliable a service can be in a quarter. This gives teams a clear guide for acceptable risk.

8.3. Making choices with error budgets

Error budgets help teams decide between improving reliability and adding new features:

When to focus on reliability	When to add new features
Error budget is almost used up	Plenty of error budget left
Many recent system failures	System has been stable
Users complaining about reliability	Users asking for new features

Jason Walker from BigPanda explains that small errors might not need action if the SLO is still being met. But if errors keep happening and use up the budget, teams need to act.

8.4. Keeping systems stable while adding new features

To balance stability and new features:

Make clear rules about what to do when nearing or exceeding error budgets
Have business leaders work with IT to set error budgets that fit business needs
Use AI tools to measure error budgets and spot issues faster
Use reports to track error rates and talk about priorities with product teams

Kit Merker from Nobl9 says: "We want to deliver excellence to our customers. But we also have a business to run. Error budgets help us find the balance between excellence and what's practical."

A 2021 report found that while 50% of teams keep improving their SLOs, only 20% regularly use error budgets. This shows there's room for more teams to benefit from this approach.

9. Always improving SLIs and SLOs in AIOps

9.1. Getting and using feedback

To improve SLIs and SLOs in AIOps:

Ask users about their experience
Talk to IT teams after fixing problems
Meet with business leaders every 3 months

Use this info to make SLIs and SLOs better match what users and the business need.

9.2. Checking progress regularly

Look at SLIs and SLOs often:

Time	What to do
Every week	Quick check on SLOs
Every month	Look at SLI trends
Every 3 months	Big review of SLOs

Use AIOps tools to gather data and make reports for these checks.

9.3. Adjusting to new business needs and tech

Keep SLIs and SLOs up to date:

Change SLOs when business goals change
Update SLIs when you use new tech
Look at what other companies are doing

Be ready to add new SLIs or remove old ones as your AIOps gets better.

9.4. Building a team that likes to improve

Help your team get better at SLIs and SLOs:

Try new ways to use SLIs and SLOs
Say "good job" when teams meet SLO goals
Learn from missed SLOs, don't punish people
Teach team about new AIOps tools

Let team members suggest ways to make SLIs and SLOs better.

10. Common mistakes and how to avoid them

10.1. Too many SLIs and SLOs

Many companies track too many metrics, leading to confusion and ineffective monitoring. For example, a New Relic study found that 68% of organizations track more than 100 metrics, with 32% tracking over 500.

To avoid this:

Pick 3-5 key SLIs that directly impact users
Review and remove unnecessary metrics quarterly
Use AIOps tools to automate data collection and analysis

10.2. Focusing on tech metrics over user experience

Teams often prioritize technical indicators over user-centric ones. A 2022 Gartner survey showed that 62% of IT teams primarily track infrastructure metrics rather than user experience metrics.

To fix this:

Map SLIs to specific user journeys
Conduct regular user surveys to inform SLO targets
Use tools like Google's Customer Experience Indicators (CXI) to link tech metrics to user outcomes

10.3. Misaligned SLOs and business goals

Disconnected SLOs can lead to wasted efforts. A 2023 Dynatrace report found that 78% of CIOs struggle to link IT metrics to business outcomes.

Best practices:

Review SLOs with business leaders quarterly
Create a table linking each SLO to specific business KPIs
Adjust SLO targets based on changing business priorities

SLO	Business KPI	Target
API response time	Customer satisfaction	< 200ms for 99.9% of requests
System uptime	Revenue	99.99% availability
Error rate	Customer retention	< 0.1% errors

10.4. Neglecting the human element in AIOps

While AIOps brings powerful automation, overlooking people can hinder adoption. A 2022 PagerDuty survey revealed that 63% of organizations faced resistance when implementing AIOps due to lack of proper training and unclear roles.

To address this:

Provide hands-on training for AIOps tools (e.g., Datadog, Splunk)
Set up cross-functional teams for SLO monitoring
Create clear AIOps roles and responsibilities

"Successful AIOps implementation requires a balance of technology and human expertise. Teams that invest in both see a 30% faster incident resolution time on average." - Nancy Gohring, Senior Analyst at 451 Research

11. What's next for SLIs and SLOs in AIOps

11.1. New tech and its effects on SLIs and SLOs

Edge computing is changing how we handle SLIs and SLOs in AIOps. A 2024 Gartner report says that by 2025, 75% of enterprise data will be processed at the edge, up from 10% in 2018. This means we need new ways to measure performance across spread-out systems.

Quantum computing might also change SLO calculations. In July 2024, IBM announced a new quantum processor that can solve complex problems 100 times faster than regular computers. This could help adjust SLOs in real-time based on how systems are working.

11.2. Using data to guess and fix SLO issues

AIOps is getting better at predicting problems. In January 2024, Google Cloud launched Operations AI, which can predict SLO issues 30 minutes before they happen, with 92% accuracy. This helps teams fix problems before users notice.

Fixing issues automatically is also becoming more common. In March 2024, Amazon introduced AWS Auto Remediation. This service can change resource allocation or switch to backup systems when it thinks an SLO might be broken. It cuts down manual work by up to 60%.

11.3. Working with other IT management systems

AIOps tools are getting better at working with other IT systems. In June 2024, Splunk updated its AIOps Suite to work directly with over 50 popular DevOps tools. This gives a full view of SLIs across all stages of software development.

Managing SLOs across different platforms is also improving. ServiceNow's latest update includes a dashboard that shows SLO data from multiple cloud providers and on-site systems in one place.

11.4. How AI might change SLIs and SLOs

AI is starting to change how we set up and manage SLIs and SLOs. In August 2024, DeepMind published research showing an AI model that can find the best SLIs on its own, based on how the system works and what users say.

AI is also helping to improve SLOs. In July 2024, Microsoft started testing Azure AI for SLOs. This tool uses AI to constantly adjust SLO targets based on business needs and what the system can do. Early users say it's helped them use resources 25% better while keeping or improving service quality.

AI Advancement	Company	Release Date	Key Benefit
Operations AI	Google Cloud	January 2024	Predicts SLO issues 30 minutes ahead
Auto Remediation	Amazon AWS	March 2024	Reduces manual work by 60%
AIOps Suite Update	Splunk	June 2024	Integrates with 50+ DevOps tools
Azure AI for SLOs	Microsoft	July 2024 (Beta)	Improves resource use by 25%
Autonomous SLI Identification	DeepMind	August 2024	Finds optimal SLIs without human input

These new technologies are making SLIs and SLOs in AIOps smarter, more predictive, and self-improving, which should lead to more reliable services and happier users.

12. Wrap-up

12.1. Key takeaways for SLIs and SLOs in AIOps

Focus on user-centric SLIs that align with business goals
Set realistic SLOs and review them regularly
Use AIOps tools to automate data collection and monitoring
Implement error budgets to balance reliability and innovation
Continuously improve SLIs and SLOs based on feedback and data

12.2. Recent developments in SLIs and SLOs for AIOps

Technology	Impact on SLIs/SLOs	Example
Edge computing	Distributed SLI monitoring	Gartner: 75% of enterprise data processed at edge by 2025
Quantum computing	Faster SLO calculations	IBM: New processor solves problems 100x faster than classical computers
AI-powered analytics	Predictive issue resolution	Google Cloud's Operations AI: 92% accuracy in predicting SLO issues 30 minutes ahead
Cross-platform integration	Holistic SLO management	ServiceNow: Dashboard showing SLO data from multiple cloud providers
AI-driven SLI/SLO optimization	Autonomous identification and adjustment	Microsoft Azure AI for SLOs: 25% better resource use while maintaining service quality

12.3. Practical tips for effective SLI and SLO implementation

1. Define user-centric SLIs

Map SLIs to specific user journeys
Use tools like Google's Customer Experience Indicators (CXI)

2. Set achievable SLOs

Start conservative, then gradually increase targets
Use historical data to inform SLO levels

3. Leverage AIOps tools

Implement real-time monitoring (e.g., Dynatrace's Davis AI for instant SLO violation alerts)
Use AI for predictive analysis (e.g., AWS Auto Remediation for proactive issue fixing)

4. Foster continuous improvement

Review SLIs and SLOs quarterly with business leaders
Create a table linking each SLO to specific business KPIs

SLO	Business KPI	Target
API response time	Customer satisfaction	< 200ms for 99.9% of requests
System uptime	Revenue	99.99% availability
Error rate	Customer retention	< 0.1% errors

5. Balance stability and innovation

Use error budgets to guide feature releases
Implement clear rules for actions when nearing or exceeding error budgets

"We want to deliver excellence to our customers. But we also have a business to run. Error budgets help us find the balance between excellence and what's practical." - Kit Merker, Nobl9

FAQs

Is 100% a good SLO?

No, setting an SLO of 100% is not a good idea. Here's why:

Reason	Explanation
Not realistic	Systems change, which can cause failures
Stops new features	Can't add anything new if you're always fixing things
Only fixing problems	Teams can only react to issues, not prevent them
Not good for growth	Focuses too much on keeping things the same

Instead of aiming for 100%, it's better to:

1. Find a balance between keeping things working and making them better

2. Set SLOs that make customers happy but still let you improve your product

3. Use SLOs to help your team get better at fixing and preventing problems

"An SLO of 100% means you only have time to be reactive. You literally cannot do anything other than react to < 100% availability, which is guaranteed to happen."

This quote shows why 100% SLOs are not helpful. They force teams to always be fixing things instead of making their product better.

A good tip is to set your SLO at a level where:

Customers are happy with how well your service works
Your team can still add new features and fix old problems
You can plan for the future instead of always putting out fires