Outgrowing Datadog? You're not alone. As systems get more complex, many companies find Datadog can't keep up. Here's what you need to know:
- Costs skyrocket
- Data retention is limited
- Performance slows down
Signs it's time to move on:
- High costs
- Increasing MTTR
- Using multiple observability tools
Ready to scale up? Here's how:
- Review current needs
- Set new goals
- Improve data handling
- Boost query performance
- Implement large-scale tracing
- Explore AIOps and open-source tools
- Create custom dashboards
- Standardize data formats
- Use smart alerting
- Build an observability culture
Remember: Observability is ongoing. Keep improving as your needs change.
Strategy | Benefit |
---|---|
Regular reviews | Align with business goals |
Incorporate feedback | Learn from incidents |
Stay updated on new tools | Leverage new tech |
Related video from YouTube
1. Identifying Scaling Problems
As you grow, Datadog might struggle. Let's look at common issues and signs it's time to explore other options.
1.1 Common Datadog Scaling Issues
1. High Data Volume Handling
Datadog can buckle under massive data influx:
- Slower queries
- Increased latency
- Higher costs
2. Cost Escalation
Datadog's pricing can become a burden:
- SKU-based pricing is hard to predict
- Custom metrics start at $0.05 per metric per month
- No full-platform pricing advertised
3. Performance Slowdowns
As you expand, you might notice:
- Longer dashboard load times
- Delayed alerts
- Slower API responses
1.2 Signs to Consider Alternatives
1. Cluster Health Issues
Watch your cluster status. Consistent red or yellow? That's a problem.
Cluster Status | Meaning | Action |
---|---|---|
Green | All good | None |
Yellow | Some issues | Investigate |
Red | Serious problems | Urgent action needed |
2. Data Node Disk Space
Set alerts for 80% disk usage. Constantly adding nodes or removing data? Time to reassess.
3. Rising Costs
If Datadog expenses outpace your growth, it's an issue.
4. MTTR Increase
MTTR over an hour? Not good, but common:
"In 2021, 64% of DevOps Pulse Survey respondents reported their MTTR during production incidents was over an hour."
5. Tool Overload
Using multiple observability tools is common:
- 66% use 2-4 systems
- 24% use 5-10 systems
But more tools often mean more problems.
If this sounds familiar, it might be time to look beyond Datadog.
2. Getting Ready to Change
Ready to move on? Let's review your setup and plan for the future.
2.1 Reviewing Current Observability Needs
Take stock of what you have:
1. Data Volume: How much data do you handle?
Recurly manages 70TB of logs monthly.
2. Data Types: What are you collecting?
Data Type | Example |
---|---|
Logs | 18-60MB/s in production |
Metrics | 4 million time series (1.5TB/month) |
Traces | Request paths through the system |
3. Current Pain Points: What's not working?
- Slow queries?
- High costs?
- Limited scalability?
4. Tool Usage: How many tools are you juggling?
66% of organizations use 2-4 systems for observability.
2.2 Setting New Observability Goals
Set targets for your new system:
1. Scalability: Plan for growth. Wise handles 50+ terabytes of telemetry data daily.
2. Cost Efficiency: Aim for better pricing.
"Middleware has proven to be a more cost-effective and user-friendly alternative to New Relic, enabling us to capture comprehensive telemetry across our platform." - John D'Emic, Revenium CTO
3. Performance Improvement: Set clear targets. 64% of organizations using observability tools saw a 25% or better MTTR improvement.
4. Data Integration: Look for tools that unify telemetry data.
5. Future-Proofing: Think ahead. Will you need real-time streaming or new data sources soon?
sbb-itb-9890dba
3. Methods for Scaling Up
Let's dive into four key areas for scaling up:
3.1 Improving Data Collection
Collect only what you need:
- Use REST APIs or data access tools
- Define clear data requirements
- Automate the process
Try Observability Pipelines Worker to intercept and manipulate data before forwarding.
3.2 Better Data Storage
Manage storage for performance and cost:
- Implement data retention policies
- Use tiered storage
- Apply compression and deduplication
Compute-optimized instance guide:
Cloud Provider | Recommended Instance |
---|---|
AWS | c6i.2xlarge |
Azure | f8 |
Google Cloud | c2 (8 vCPUs, 16 GiB) |
Private | 8 vCPUs, 16 GiB |
3.3 Faster Query Performance
To improve response times:
- Optimize queries
- Implement caching
- Use real-time data ingestion
Observability Pipelines Worker scales automatically to use all vCPUs.
3.4 Large-Scale Distributed Tracing
For effective distributed tracing:
- Use tools supporting serverless and containerized ecosystems
- Enrich trace data
- Automate problem baselining
4. New Observability Techniques
Explore new methods for better insights and automation:
4.1 Using AIOps
AIOps combines machine learning with observability:
- Reduce alert noise by up to 90%
- Detect unusual patterns
- Provide early warnings
BigPanda uses ML to correlate alerts into high-level incidents.
4.2 Open-Source Tools
Flexible, cost-effective options include:
Tool | Key Features |
---|---|
Prometheus | Self-sufficient, fast querying with PromQL |
Grafana | Customizable dashboards, multiple data sources |
SigNoz | Integrates logs, traces, and metrics |
SigNoz uses ClickHouse for faster aggregation queries.
4.3 Custom Dashboards
Create tailored dashboards:
- Choose a platform (e.g., Google Cloud Console, Grafana)
- Select relevant widgets
- Organize the layout
- Add filters for troubleshooting
Share dashboards for better collaboration.
5. Tips for Large-Scale Observability
Key strategies for scaling:
5.1 Consistent Data Formats
Standardize your logging format:
- Create a uniform structure
- Define clear fields
- Use consistent naming conventions
Wise Payments moved to in-house expertise for consistent data formats.
5.2 Smart Alerting
Reduce alert noise:
Alert Type | Action |
---|---|
Critical Errors | Immediate notifications |
Warning Levels | Daily or weekly digests |
Informational | Log for analysis, no real-time alerts |
BigPanda helped companies reduce alert noise by over 90%.
5.3 Building an Observability Culture
Create a culture of observability:
- Provide training
- Encourage collaboration
- Make data accessible
"Observability is defined as a concept, a goal, and direction that will help your organization to gain the most insight from the data you can collect." - Cribl
Conclusion
Key strategies for scaling observability:
- Identify scaling issues early
- Set clear goals
- Improve data handling
- Embrace new techniques
- Standardize data formats
- Implement smart alerting
- Build an observability culture
Keep improving as your needs change:
Action | Benefit |
---|---|
Regular reviews | Ensure alignment with goals |
Incorporate feedback | Learn from incidents |
Stay updated on new tools | Leverage advancements |