2:14am. Checkout P99 is on fire. Slack wants a root cause in 10 minutes.
You open Grafana. Jaeger. Logs. Topology. Notes doc. Twenty minutes later you think you know β but you're still not sure if it's order-service, MySQL, or that sketchy downstream RPC.
Metrics didn't fail you. The workflow did.
We built DataBuff β open-source, OpenTelemetry-native APM β to fix that: one question in, evidence-backed root cause out. Not a chat box glued to dashboards. An AI Brain that dispatches specialists (metrics, traces, topology, inspection) and merges real telemetry into an incident-ready answer.
Try it first (5 minutes)
curl -fsSL https://databuff.ai/databuff/ai-apm-install.sh | bash
curl -fsSL https://databuff.ai/databuff/ai-apm-demo-install.sh | bash
Open http://YOUR_HOST:27403 β login admin / Databuff@123 β add LLM key under Settings β AI model.
Paste this:
Which services had errors in the last hour? For the slowest one,
show me a typical trace, the slowest span, and what I should do first.
That's the product. Below: architecture, agent squad, and a real demo output.
Before vs after
Before (tab safari)
- 5β6 tools, zero shared context
- You translate the question into 12 queries
- Senior engineer stitches the story by hand
- Slack gets guesses and screenshots
After (DataBuff)
- One UI, unified Doris storage
- You ask in plain English
- AI Brain + agents query live metrics, traces, topology
- Slack gets root cause, TraceId, P0/P1 actions
You already pay for observability. You're still paying in human attention at 2am.
AI-native β ChatGPT on Grafana
Most "AI observability" in 2024:
- Only sees what you paste
- Cannot query your trace store
- Guesses under incident pressure
DataBuff agents call real tools. Every claim should trace to evidence.
The agent squad
- AI Brain β plans, dispatches, synthesizes
- Smart query β P99, error rate, QPS from Doris
- Trace analyst β slow traces, hottest spans
- Topology β upstream/downstream blast radius
- Inspection β sustained pain vs one-off spikes
- Report β incident summary you can forward
Apache 2.0 Β· self-hosted Β· data stays on your network.
Architecture: 3 components, not 13
Data flow: OTLP apps β Ingest (4317/4318) β Apache Doris β Web UI (:27403) + AI Brain
Legacy stack vs DataBuff
- Components: 10+ β 3
- RAM: 16 GB+ β ~8 GB
- First deploy: 1β2 days β ~5 min (one install script)
- Storage: siloed β unified Doris
- AI: bolt-on chat β multi-agent native
OpenTelemetry in. Unified storage. Agents on top.
Product at a glance
Demo walkthrough
After demo install you get service-a β service-b with real traces.
1. Service health
2. Ask the hard question
Analyze why service-a calls slowed in the last 30 minutes.
Find highest-P99 endpoint, a typical slow trace, slowest span,
root cause (app vs DB vs downstream), impact, and P0 fixes.
3. Watch agents work in parallel
4. Real demo output
Smart query
service-a GET /demo/checkout ~240ms avg (17 reqs)
service-b HTTP 100ms + Dubbo 80ms ~ 75% of latency
Inspection
Sustained slowness, not a spike; service-a JVM/errors normal
Trace 12e3a078bdbe183d567a2f7e888fe7b3
Slowest span: service-b -> MySQL SELECT demo_order (~45ms)
Root cause
Downstream service-b + slow SQL (not service-a)
P0: dedupe service-b double calls; fix MySQL slow queries
P1: fix InsufficientStockException on inventory path
5. Topology proof
6. Call graph
vs SigNoz / Datadog
DataBuff wins on: multi-agent AI RCA built-in, OTel-native, self-host in minutes, you own the data.
SigNoz wins on: mature classic OSS APM, huge community.
Datadog wins on: SaaS polish, enterprise integrations.
We're betting the next moat is orchestrated agents on OTel data β for teams without a 24/7 SRE bench.
Who should try this?
- Running OTel but still living in 5 tabs during incidents
- Self-hosting (finance, gov, privacy-sensitive)
- Want agents that use tools, not vibes
- Apache 2.0 you can audit
Your turn
curl -fsSL https://databuff.ai/databuff/ai-apm-install.sh | bash
curl -fsSL https://databuff.ai/databuff/ai-apm-demo-install.sh | bash
GitHub: https://github.com/databufflabs/databuff
- What would you ask the agent squad first? (comment below)
- What OTel signals are we missing?
- Star the repo if 2am tab-hopping should die.
Built in public Β· Apache 2.0





















