Moving Beyond Datadog: Scaling Your Observability

Outgrowing Datadog? You're not alone. As systems get more complex, many companies find Datadog can't keep up. Here's what you need to know:

Costs skyrocket
Data retention is limited
Performance slows down

Signs it's time to move on:

High costs
Increasing MTTR
Using multiple observability tools

Ready to scale up? Here's how:

Review current needs
Set new goals
Improve data handling
Boost query performance
Implement large-scale tracing
Explore AIOps and open-source tools
Create custom dashboards
Standardize data formats
Use smart alerting
Build an observability culture

Remember: Observability is ongoing. Keep improving as your needs change.

Strategy	Benefit
Regular reviews	Align with business goals
Incorporate feedback	Learn from incidents
Stay updated on new tools	Leverage new tech

1. Identifying Scaling Problems

As you grow, Datadog might struggle. Let's look at common issues and signs it's time to explore other options.

1.1 Common Datadog Scaling Issues

1. High Data Volume Handling

Datadog can buckle under massive data influx:

Slower queries
Increased latency
Higher costs

2. Cost Escalation

Datadog's pricing can become a burden:

SKU-based pricing is hard to predict
Custom metrics start at $0.05 per metric per month
No full-platform pricing advertised

3. Performance Slowdowns

As you expand, you might notice:

Longer dashboard load times
Delayed alerts
Slower API responses

1.2 Signs to Consider Alternatives

1. Cluster Health Issues

Watch your cluster status. Consistent red or yellow? That's a problem.

Cluster Status	Meaning	Action
Green	All good	None
Yellow	Some issues	Investigate
Red	Serious problems	Urgent action needed

2. Data Node Disk Space

Set alerts for 80% disk usage. Constantly adding nodes or removing data? Time to reassess.

3. Rising Costs

If Datadog expenses outpace your growth, it's an issue.

4. MTTR Increase

MTTR over an hour? Not good, but common:

"In 2021, 64% of DevOps Pulse Survey respondents reported their MTTR during production incidents was over an hour."

5. Tool Overload

Using multiple observability tools is common:

66% use 2-4 systems
24% use 5-10 systems

But more tools often mean more problems.

If this sounds familiar, it might be time to look beyond Datadog.

2. Getting Ready to Change

Ready to move on? Let's review your setup and plan for the future.

2.1 Reviewing Current Observability Needs

Take stock of what you have:

1. Data Volume: How much data do you handle?

Recurly manages 70TB of logs monthly.

2. Data Types: What are you collecting?

Data Type	Example
Logs	18-60MB/s in production
Metrics	4 million time series (1.5TB/month)
Traces	Request paths through the system

3. Current Pain Points: What's not working?

Slow queries?
High costs?
Limited scalability?

4. Tool Usage: How many tools are you juggling?

66% of organizations use 2-4 systems for observability.

2.2 Setting New Observability Goals

Set targets for your new system:

1. Scalability: Plan for growth. Wise handles 50+ terabytes of telemetry data daily.

2. Cost Efficiency: Aim for better pricing.

"Middleware has proven to be a more cost-effective and user-friendly alternative to New Relic, enabling us to capture comprehensive telemetry across our platform." - John D'Emic, Revenium CTO

3. Performance Improvement: Set clear targets. 64% of organizations using observability tools saw a 25% or better MTTR improvement.

4. Data Integration: Look for tools that unify telemetry data.

5. Future-Proofing: Think ahead. Will you need real-time streaming or new data sources soon?

3. Methods for Scaling Up

Let's dive into four key areas for scaling up:

3.1 Improving Data Collection

Collect only what you need:

Use REST APIs or data access tools
Define clear data requirements
Automate the process

Try Observability Pipelines Worker to intercept and manipulate data before forwarding.

3.2 Better Data Storage

Manage storage for performance and cost:

Implement data retention policies
Use tiered storage
Apply compression and deduplication

Compute-optimized instance guide:

Cloud Provider	Recommended Instance
AWS	c6i.2xlarge
Azure	f8
Google Cloud	c2 (8 vCPUs, 16 GiB)
Private	8 vCPUs, 16 GiB

3.3 Faster Query Performance

To improve response times:

Optimize queries
Implement caching
Use real-time data ingestion

Observability Pipelines Worker scales automatically to use all vCPUs.

3.4 Large-Scale Distributed Tracing

For effective distributed tracing:

Use tools supporting serverless and containerized ecosystems
Enrich trace data
Automate problem baselining

4. New Observability Techniques

Explore new methods for better insights and automation:

4.1 Using AIOps

AIOps combines machine learning with observability:

Reduce alert noise by up to 90%
Detect unusual patterns
Provide early warnings

BigPanda uses ML to correlate alerts into high-level incidents.

4.2 Open-Source Tools

Flexible, cost-effective options include:

Tool	Key Features
Prometheus	Self-sufficient, fast querying with PromQL
Grafana	Customizable dashboards, multiple data sources
SigNoz	Integrates logs, traces, and metrics

SigNoz uses ClickHouse for faster aggregation queries.

4.3 Custom Dashboards

Create tailored dashboards:

Choose a platform (e.g., Google Cloud Console, Grafana)
Select relevant widgets
Organize the layout
Add filters for troubleshooting

Share dashboards for better collaboration.

5. Tips for Large-Scale Observability

Key strategies for scaling:

5.1 Consistent Data Formats

Standardize your logging format:

Create a uniform structure
Define clear fields
Use consistent naming conventions

Wise Payments moved to in-house expertise for consistent data formats.

5.2 Smart Alerting

Reduce alert noise:

Alert Type	Action
Critical Errors	Immediate notifications
Warning Levels	Daily or weekly digests
Informational	Log for analysis, no real-time alerts

BigPanda helped companies reduce alert noise by over 90%.

5.3 Building an Observability Culture

Create a culture of observability:

Provide training
Encourage collaboration
Make data accessible

"Observability is defined as a concept, a goal, and direction that will help your organization to gain the most insight from the data you can collect." - Cribl

Conclusion

Key strategies for scaling observability:

Identify scaling issues early
Set clear goals
Improve data handling
Embrace new techniques
Standardize data formats
Implement smart alerting
Build an observability culture

Keep improving as your needs change:

Action	Benefit
Regular reviews	Ensure alignment with goals
Incorporate feedback	Learn from incidents
Stay updated on new tools	Leverage advancements

Moving Beyond Datadog: Scaling Your Observability