Last Tuesday an agent I shipped decided, mid-conversation, that the user's name was "Export CSV." It wasn't. Seven turns earlier, a tool result had come back with a quoted header row where a username field should have been, and the model silently absorbed that string as ground truth. Every subsequent turn degraded quietly — apologetic tone, subtle hallucinations, a refusal that referenced "your account, Export CSV."
The per-call logs looked fine. The latencies were green. Token usage was nominal. The only way to see the break was to reconstruct the whole conversation as a causal graph and follow the poison forward.
This is the 8-turn problem. It's the single most expensive class of bug I ship, and most of the observability stacks I've tried were built for a world where requests are independent. They aren't.
Why request-level monitoring lies
Traditional APM assumes a request is a closed unit: it came in, it did something, it came out, and if you aggregate p99 and error rate you know whether the system is healthy. That model was fine for stateless services. It's openly broken for agents.
An agent request carries state that isn't in the HTTP payload. It carries the conversation. It carries the tool results that previous turns wrote into context. It carries the model's own prior outputs, which are now training the next inference. A turn that looks locally correct — valid JSON, successful tool call, reasonable response — can be the exact moment your agent quietly goes off the rails for the next 40 minutes of user conversation.
I watch three numbers more than I watch latency:
- Turn-over-turn intent drift: does turn N still match the user's original ask?
- Tool result contamination rate: how often does a tool response contain strings that look like instructions?
- Session success rate, not request success rate: did the user actually get what they came for?
None of those are visible from a metrics dashboard that aggregates individual calls. You need traces that span the whole session, and you need them structured so you can walk them backward from the failure.
What a useful trace actually looks like
The OpenTelemetry GenAI SIG has been converging on gen_ai.* semantic conventions, which is good. The prevailing shape: each tool call, each LLM invocation, each retrieval is a child span, parented to the turn, parented to the session. Do that, and your trace tree tells the story of the reasoning chain.
A few things people get wrong here:
Don't put prompts in span attributes. Attributes are indexed, have size caps, and leak straight into your observability backend as PII. Use span events. Events can be sampled, redacted, or dropped at the Collector without touching app code. This one change will save you a compliance conversation later.
Parent spans by the turn, not just the call. If every LLM call is a root span, you lose the conversational structure. The parent-child relationship between turn 3 and turn 7 is the thing you actually want to trace. If you're building this yourself, each session gets a trace_id, each turn gets a span_id under it, and tool calls and inferences nest under the turn.
Emit a "decision" span. The LLM call itself is one span, but what the agent did with the output — picked a tool, rephrased, escalated — is a different concern and worth its own span. This is where drift shows up first.
At RapidClaw we default to this layout and bolt on session-level rollups so you can ask "which turn did this fail at?" without scrolling through 40 spans.
The debugging workflow that actually works
When a user reports an agent did something weird, the temptation is to grep logs for the error. There's usually no error. Here's the loop I run instead:
- Pull the full session trace. Not the failing turn — the whole conversation, from the first user message forward.
- Diff the system state between turns. What changed in memory, in the scratchpad, in the retrieved context? This is where you find the poisoned field.
- Replay from the suspected turn with the same tool responses. Most agent frameworks let you rehydrate a session; if yours doesn't, you need to fix that before anything else.
- Mutate one variable at a time. Change the tool response. Change the model. Change the system prompt. Bisect until the behavior flips.
- Write the regression test at the session level. Not a unit test on a single call — a full conversation fixture with expected final state.
Step 3 is where most teams stall. If you can't replay a session deterministically, you're guessing. The replay and re-simulate workflow is the single feature I'd build first in any agent observability tool, including ones I don't run.
Practical hygiene for small teams
I run a small operation — think five agents in production, not five hundred — and the infrastructure choices reflect that. A few things that have held up at this scale:
One OTLP pipeline, everything flows through it. Don't run a separate tracing stack for agents. Emit gen_ai.* spans into the same Collector your regular services use, then branch at the exporter if you want a specialized backend for LLM-specific analysis. Vendor lock-in is a real risk and OTel is the escape hatch.
Sample aggressively on success, keep everything on failure. Full-conversation traces are expensive. A 1% tail-based sampler plus 100% retention for sessions that flagged any of: tool error, user thumbs-down, abnormal turn count, or model refusal — that gives you the signal without drowning.
Tag sessions with the outcome, not just the request. Instrument your app to send a session-end event with "did the user get what they wanted?" If you can't answer that, instrument it first. Every other metric is downstream.
Treat evals and tracing as the same system. Evaluation runs are just traces with known expected outputs. The moment you split them into different tools you start writing glue code that never gets maintained.
The uncomfortable part
Most agent reliability issues I've seen in the last six months aren't model issues. They're context management issues. The model is doing its job — taking what's in the window and producing a plausible next token. The bug is upstream, in what we let accumulate in that window.
Observability for agents is, practically, observability for the context window over time. If your tooling can't show you how a single field mutated across seven turns, it can't help you debug the 8-turn problem. And the 8-turn problem is most of the bugs.
If you want to see how we handle session-level tracing in practice, the RapidClaw quickstart walks through instrumenting a LangGraph agent in about ten minutes. But the principle matters more than the tool: trace the session, not the request, and save yourself the compliance conversation by keeping prompts out of attributes.
Your agents are going to hallucinate. The question is whether you find out at turn 3 or turn 73.
![[The 8-Turn Problem] Why Your Agent Fails at Turn 3 and You Only Notice at Turn 7](https://media2.dev.to/dynamic/image/width=1000,height=420,fit=cover,gravity=auto,format=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnykwv243k5z9b5dj9i39.png)








![Defluffer - reduce token usage 📉 by 45% using this one simple trick! [Earthday challenge]](https://media2.dev.to/dynamic/image/width=1000,height=420,fit=cover,gravity=auto,format=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fiekbgepcutl4jse0sfs0.png)


