Moving Beyond Datadog: Scaling Your Observability

published on 27 August 2024

Outgrowing Datadog? You're not alone. As systems get more complex, many companies find Datadog can't keep up. Here's what you need to know:

  • Costs skyrocket
  • Data retention is limited
  • Performance slows down

Signs it's time to move on:

  • High costs
  • Increasing MTTR
  • Using multiple observability tools

Ready to scale up? Here's how:

  1. Review current needs
  2. Set new goals
  3. Improve data handling
  4. Boost query performance
  5. Implement large-scale tracing
  6. Explore AIOps and open-source tools
  7. Create custom dashboards
  8. Standardize data formats
  9. Use smart alerting
  10. Build an observability culture

Remember: Observability is ongoing. Keep improving as your needs change.

Strategy Benefit
Regular reviews Align with business goals
Incorporate feedback Learn from incidents
Stay updated on new tools Leverage new tech

1. Identifying Scaling Problems

As you grow, Datadog might struggle. Let's look at common issues and signs it's time to explore other options.

1.1 Common Datadog Scaling Issues

Datadog

1. High Data Volume Handling

Datadog can buckle under massive data influx:

  • Slower queries
  • Increased latency
  • Higher costs

2. Cost Escalation

Datadog's pricing can become a burden:

  • SKU-based pricing is hard to predict
  • Custom metrics start at $0.05 per metric per month
  • No full-platform pricing advertised

3. Performance Slowdowns

As you expand, you might notice:

  • Longer dashboard load times
  • Delayed alerts
  • Slower API responses

1.2 Signs to Consider Alternatives

1. Cluster Health Issues

Watch your cluster status. Consistent red or yellow? That's a problem.

Cluster Status Meaning Action
Green All good None
Yellow Some issues Investigate
Red Serious problems Urgent action needed

2. Data Node Disk Space

Set alerts for 80% disk usage. Constantly adding nodes or removing data? Time to reassess.

3. Rising Costs

If Datadog expenses outpace your growth, it's an issue.

4. MTTR Increase

MTTR over an hour? Not good, but common:

"In 2021, 64% of DevOps Pulse Survey respondents reported their MTTR during production incidents was over an hour."

5. Tool Overload

Using multiple observability tools is common:

  • 66% use 2-4 systems
  • 24% use 5-10 systems

But more tools often mean more problems.

If this sounds familiar, it might be time to look beyond Datadog.

2. Getting Ready to Change

Ready to move on? Let's review your setup and plan for the future.

2.1 Reviewing Current Observability Needs

Take stock of what you have:

1. Data Volume: How much data do you handle?

Recurly manages 70TB of logs monthly.

2. Data Types: What are you collecting?

Data Type Example
Logs 18-60MB/s in production
Metrics 4 million time series (1.5TB/month)
Traces Request paths through the system

3. Current Pain Points: What's not working?

  • Slow queries?
  • High costs?
  • Limited scalability?

4. Tool Usage: How many tools are you juggling?

66% of organizations use 2-4 systems for observability.

2.2 Setting New Observability Goals

Set targets for your new system:

1. Scalability: Plan for growth. Wise handles 50+ terabytes of telemetry data daily.

2. Cost Efficiency: Aim for better pricing.

"Middleware has proven to be a more cost-effective and user-friendly alternative to New Relic, enabling us to capture comprehensive telemetry across our platform." - John D'Emic, Revenium CTO

3. Performance Improvement: Set clear targets. 64% of organizations using observability tools saw a 25% or better MTTR improvement.

4. Data Integration: Look for tools that unify telemetry data.

5. Future-Proofing: Think ahead. Will you need real-time streaming or new data sources soon?

sbb-itb-9890dba

3. Methods for Scaling Up

Let's dive into four key areas for scaling up:

3.1 Improving Data Collection

Collect only what you need:

  • Use REST APIs or data access tools
  • Define clear data requirements
  • Automate the process

Try Observability Pipelines Worker to intercept and manipulate data before forwarding.

3.2 Better Data Storage

Manage storage for performance and cost:

  • Implement data retention policies
  • Use tiered storage
  • Apply compression and deduplication

Compute-optimized instance guide:

Cloud Provider Recommended Instance
AWS c6i.2xlarge
Azure f8
Google Cloud c2 (8 vCPUs, 16 GiB)
Private 8 vCPUs, 16 GiB

3.3 Faster Query Performance

To improve response times:

  • Optimize queries
  • Implement caching
  • Use real-time data ingestion

Observability Pipelines Worker scales automatically to use all vCPUs.

3.4 Large-Scale Distributed Tracing

For effective distributed tracing:

  • Use tools supporting serverless and containerized ecosystems
  • Enrich trace data
  • Automate problem baselining

4. New Observability Techniques

Explore new methods for better insights and automation:

4.1 Using AIOps

AIOps combines machine learning with observability:

  • Reduce alert noise by up to 90%
  • Detect unusual patterns
  • Provide early warnings

BigPanda uses ML to correlate alerts into high-level incidents.

4.2 Open-Source Tools

Flexible, cost-effective options include:

Tool Key Features
Prometheus Self-sufficient, fast querying with PromQL
Grafana Customizable dashboards, multiple data sources
SigNoz Integrates logs, traces, and metrics

SigNoz uses ClickHouse for faster aggregation queries.

4.3 Custom Dashboards

Create tailored dashboards:

  1. Choose a platform (e.g., Google Cloud Console, Grafana)
  2. Select relevant widgets
  3. Organize the layout
  4. Add filters for troubleshooting

Share dashboards for better collaboration.

5. Tips for Large-Scale Observability

Key strategies for scaling:

5.1 Consistent Data Formats

Standardize your logging format:

  • Create a uniform structure
  • Define clear fields
  • Use consistent naming conventions

Wise Payments moved to in-house expertise for consistent data formats.

5.2 Smart Alerting

Reduce alert noise:

Alert Type Action
Critical Errors Immediate notifications
Warning Levels Daily or weekly digests
Informational Log for analysis, no real-time alerts

BigPanda helped companies reduce alert noise by over 90%.

5.3 Building an Observability Culture

Create a culture of observability:

  • Provide training
  • Encourage collaboration
  • Make data accessible

"Observability is defined as a concept, a goal, and direction that will help your organization to gain the most insight from the data you can collect." - Cribl

Conclusion

Key strategies for scaling observability:

  1. Identify scaling issues early
  2. Set clear goals
  3. Improve data handling
  4. Embrace new techniques
  5. Standardize data formats
  6. Implement smart alerting
  7. Build an observability culture

Keep improving as your needs change:

Action Benefit
Regular reviews Ensure alignment with goals
Incorporate feedback Learn from incidents
Stay updated on new tools Leverage advancements

Related posts

Read more