Adding automated anomaly detection to Datadog

Anomaly detection identifies unusual data patterns or behaviors in systems and applications, allowing you to detect potential issues early before they cause major problems.

Benefits:

Early issue detection
Reduced downtime
Streamlined incident response
Deeper insights into system behavior and performance

How it Works:

Anomaly detection analyzes metric historical data to understand normal behavior patterns. It uses algorithms that consider seasonality and trends to determine the expected value range. When a metric's value falls outside this range, it is flagged as an anomaly.

Normal Behavior	Anomaly
Metric values follow the expected pattern or trend	Metric values significantly deviate from the normal pattern or trend
Consistent with historical data and baselines	Inconsistent with historical data and baselines
Within the defined range or threshold	Outside the defined range or threshold

Getting Started:

Learn about Datadog's anomaly detection algorithms, configuration options, and integration with other features.
Set up custom monitors to detect anomalies in your metrics, logs, and traces.
Automate incident response to reduce mean time to detect (MTTD) and mean time to resolve (MTTR).
Continuously refine your setup to improve detection accuracy and reduce false positives.

Getting Started

Datadog Account

First, you'll need an active Datadog account to access the monitoring tools. If you don't have one yet, sign up for a Datadog account.

Set Up Metrics

To enable anomaly detection, you must have metrics sending data to Datadog. These metrics can come from your applications, infrastructure, or services. Make sure you have the right integrations and configurations to collect and send metric data to Datadog.

Historical Data

Anomaly detection works best with enough historical data for your metrics. Datadog's algorithms use this data to understand normal patterns and trends. Aim for at least 2-3 weeks of historical data, but more is better. With sufficient data, the algorithms can accurately identify anomalies.

Requirement	Description
Datadog Account	You need an active Datadog account to access monitoring features.
Metric Setup	Configure your applications, infrastructure, and services to send metric data to Datadog.
Historical Data	Provide at least 2-3 weeks of historical metric data for accurate anomaly detection. More data is better.

Once you have these requirements in place, you can enable automated anomaly detection in Datadog. This will help you quickly identify and address potential issues, improving the reliability of your systems.

Understanding anomaly detection in Datadog

Anomaly detection in Datadog helps you find unknown issues in your applications and infrastructure. It does this by analyzing metrics, traces, and logs to identify unusual data points or patterns. This feature is useful for discovering problems you didn't know existed, making it an essential tool for monitoring.

Anomaly detection algorithms

Datadog offers three main anomaly detection algorithms:

Algorithm	Description
Basic	Uses a simple rolling quantile calculation to determine the expected value range. It adjusts quickly to changes but doesn't account for seasonality or long-term trends.
Agile	A robust version of the SARIMA (seasonal autoregressive integrated moving average) algorithm. It's sensitive to seasonality and can quickly adjust to level shifts in the metric.
Robust	A seasonal-trend decomposition algorithm that works best for seasonal metrics with a relatively stable baseline. Its predictions are very stable, so long-lasting anomalies don't unduly influence the forecast.

Seasonality and trends

Datadog's anomaly detection algorithms consider seasonality and trends in metrics. For example, a metric may peak during business hours on weekdays and drop at night, with a lull on weekends. The algorithm can accurately forecast the metric's value, including peaks, because the pattern repeats weekly.

Configuration options

You can configure anomaly detection in Datadog:

The bounds parameter in the query editor sets the tolerance of the algorithm, determining the width of the "normal" gray band. Set bounds to 2 or 3 to capture most "normal" points.
Adjust alert windows and seasonality settings to fine-tune anomaly detection for your needs.

Setting up an anomaly detection monitor

Choosing metrics

When setting up an anomaly detection monitor, choose metrics that are important for your application or infrastructure. Look for metrics that tend to fluctuate or have high variability, such as application throughput, web requests, or user logins. Datadog offers a wide range of metrics, including custom metrics you can create based on your needs.

Selecting an algorithm

Datadog provides three anomaly detection algorithms:

Algorithm	Best For
Basic	Metrics with simple seasonality patterns
Agile	Metrics with complex seasonality patterns
Robust	Metrics with a stable baseline, less prone to false positives

Select the algorithm that best suits the type of metric you're monitoring.

Configuring alerts

After choosing the metric and algorithm, configure the alert conditions:

Bounds: Determines the tolerance of the algorithm, defining the "normal" range.
Alert window: The time period for which anomaly detection is active.
Recovery window: The time period after which an alert is resolved.

You can also adjust settings like algorithm sensitivity and notification preferences.

Setting notifications

Set up notifications for the monitor to receive timely alerts when an anomaly is detected. Datadog offers notification options like email, Slack, and PagerDuty. Choose the channel that works best for you and configure the notification settings accordingly.

Analyzing anomaly detection results

When Datadog detects an anomaly, it's crucial to analyze the results to understand the root cause. Here's how to read anomaly graphs, use historical context, and investigate anomalies.

Reading anomaly graphs

Anomaly graphs visually show the detected anomaly, with:

The gray band representing the predicted range
The red line indicating the actual metric value

Look for:

Band width: A narrower band means higher prediction confidence, while a wider band suggests lower confidence.
Distance from predicted range: A larger distance indicates a more severe deviation from the norm.
Duration: Shorter anomalies may be minor issues, while longer ones could signify bigger problems.

Using historical context

Datadog's evaluation previews show how the metric has behaved over time, helping you:

Identify seasonal patterns or trends contributing to the anomaly
Determine if it's a one-time event or part of a larger issue
Compare current behavior to past behavior to spot changes or shifts

Investigating anomalies

To investigate the root cause:

Review the metric's configuration for accuracy
Check for recent application or infrastructure changes
Investigate related metrics for anomalies
Use Datadog's log and trace analysis to identify underlying issues

Optimizing Anomaly Detection

Adjusting Settings

To optimize anomaly detection, you need to understand what normal and abnormal behavior looks like for your metric. Monitor the metric closely after making changes to ensure you get the desired detection results.

Consider these adjustments:

Bounds: Adjust the bounds of your algorithm to better capture normal and abnormal behavior.
Algorithms: Try different algorithms to find the one that best suits your metric's behavior.

Setting Alert Windows

Setting appropriate alert and recovery times is crucial to minimize false positives and false negatives:

Setting	Best Practice
Alert Windows	Set windows long enough to capture anomalies but short enough to minimize false positives.
Recovery Times	Allow sufficient time to investigate and resolve issues before marking an anomaly as resolved.

Handling False Alerts

False positives and false negatives can undermine your anomaly detection system. Here are some techniques to address them:

Regular Reviews: Regularly review and update your settings to align with your metric's behavior.
Investigate Anomalies: Investigate anomalies to determine their root cause and adjust settings accordingly.
Multiple Detection Methods: Use multiple detection methods, like machine learning and statistical methods, to minimize false positives and false negatives.

Integrating with other Datadog features

Combining anomaly detection with other Datadog tools can enhance your monitoring and incident response abilities. Here's how to integrate anomaly detection with dashboards, automated incident response, and log and trace analysis.

Dashboards and visualizations

Add anomaly detection metrics to custom dashboards to visualize abnormal behavior and spot trends easily. For example, create a dashboard showing the top 10 anomalous metrics to prioritize investigations.

Automated incident response

Automate detection and response by integrating anomaly detection with incident response workflows. This reduces workload and ensures prompt incident handling.

Log and trace analysis

Analyze logs and traces alongside anomaly detection for deeper insights into root causes. This integrated approach helps identify anomaly sources and prevent future occurrences.

Integration	Benefits
Dashboards	Visualize anomalies, identify trends
Automated Response	Streamline incident handling, reduce MTTD/MTTR
Log and Trace Analysis	Gain deeper insights into root causes

Summary

What is Anomaly Detection?

Anomaly detection helps you find unusual patterns or behaviors in your systems and applications. It automatically identifies data points that deviate from the expected norm or baseline. This allows you to detect potential issues early before they cause major problems.

Benefits of Automated Anomaly Detection

Datadog's automated anomaly detection offers several key advantages:

Benefit	Description
Early Issue Detection	Identify problems before they impact users or customers
Reduced Downtime	Quickly detect and respond to anomalies, minimizing downtime
Streamlined Incident Response	Automate incident response workflows for faster issue resolution
Deeper Insights	Gain better visibility into system behavior and performance

Getting Started

To get started with anomaly detection in Datadog:

Learn about Datadog's anomaly detection algorithms, configuration options, and integration with other features.
Set up custom monitors to detect anomalies in your metrics, logs, and traces.
Automate incident response to reduce mean time to detect (MTTD) and mean time to resolve (MTTR).
Continuously refine your setup to improve detection accuracy and reduce false positives.

FAQs

What is anomaly detection in Datadog?

Anomaly detection in Datadog is a tool that helps you find unusual patterns or behaviors in your systems and applications. It automatically identifies data points that deviate from the expected normal range or baseline. This allows you to detect potential issues early before they cause major problems.

How does Datadog anomaly detection work?

Datadog's anomaly detection analyzes a metric's historical data to understand its normal behavior patterns. It uses algorithms that consider seasonality and trends to determine the expected value range. When a metric's value falls outside this range, it is flagged as an anomaly. This provides context for why an alert was triggered, allowing you to quickly investigate and resolve the issue.

Normal Behavior	Anomaly
Metric values follow the expected pattern or trend	Metric values significantly deviate from the normal pattern or trend
Consistent with historical data and established baselines	Inconsistent with historical data and established baselines
Within the defined range or threshold	Outside the defined range or threshold

Adding automated anomaly detection to Datadog