I kept seeing the same pattern across production agent systems. An agent would start processing a document, extract the key clauses from the first page, then on the next step ask for the document title again. It was working. It just had no memory.
That experience taught me something most tutorials get backwards. The hard part of building reliable AI agents isn't prompt engineering. It's memory. Specifically, how you manage context across multiple LLM calls when your agent needs to reason about a conversation, a document, or a multi-step task.
Here's what I've learned from shipping production agents that don't forget.
The Context Window Trap
The most common mistake I see is treating the LLM's context window as your agent's memory. It's not. It's a scratchpad that gets wiped clean on every new call.
Suppose your agent processes a 50-page legal document, extracts key clauses, then generates a summary. If you shove the full document into the system prompt on every call, you'll hit token limits fast and pay for tokens you already processed. Worse, the model's attention dilutes across irrelevant text. I've seen agents miss critical clauses simply because the relevant sentence was buried 40 pages deep in the prompt.
The fix is obvious in hindsight: separate your agent's working memory (what it's doing right now) from its long-term memory (what it already processed). But most frameworks don't enforce this distinction, so teams build agents that work in demos and collapse in production.
Consider a document analysis pipeline processing a batch of contracts. The agent correctly identifies risks in the first few documents, then starts hallucinating clauses on later ones. The root cause isn't the model. It's context pollution. Leftover data from earlier documents bleeds into the prompt for the current one. I've watched teams chase prompt tweaks for weeks when the real fix was memory architecture.
What Actually Works: Vector Store Compaction
The production pattern I now use for every agent system is vector store compaction with provenancing.
Here's the idea. Instead of dumping raw text into the context window, you chunk and embed your documents, store them in a vector database, and retrieve only the relevant chunks for each step. But naive retrieval has its own failure mode: you pull 20 chunks, 3 are relevant, and the model wastes attention on noise.
Compaction solves this. After retrieval, you run a lightweight LLM call that filters and deduplicates the chunks, keeping only what's needed for the current reasoning step. The output is a compact, deduplicated context block.
async function compactContext(
chunks: Chunk[],
currentTask: string
): Promise<string> {
const response = await openai.chat.completions.create({
model: "gpt-4o-mini",
messages: [
{
role: "system",
content: `You are a context compaction tool. Given a list of text chunks and the agent's current task, return only the chunks that are directly relevant. Remove duplicates. Merge overlapping information. Output as a single, deduplicated text block.`
},
{
role: "user",
content: JSON.stringify({
task: currentTask,
chunks: chunks.map(c => c.text)
})
}
],
response_format: { type: "text" }
});
return response.choices[0].message.content;
}
This single function dramatically cut hallucination rates in the systems I've shipped. The key insight is that compaction happens on every reasoning step, not once at the start. As the agent's task changes, the relevant context changes too. A stale compaction is almost as bad as no compaction.
Provenancing: The Anti-Black Box Pattern
The second problem I kept hitting was the black box. An agent would make a decision, and I had no way to trace why. Did it find the relevant information in the vector store? Did it hallucinate a fact? Was it acting on stale context from an earlier step?
Provenancing means every piece of information the agent uses carries a source identifier. When the agent retrieves a chunk, the chunk's metadata includes its document ID, section heading, and chunk index. When the agent makes a claim, it cites the source.
I built this into every agent pipeline after a production incident where an agent was generating recommendations based on the wrong location data. It had pulled a chunk from a different geographic area's document during retrieval, and there was no audit trail to catch it. That was the moment provenancing became non-negotiable for me.
interface ProvenancedChunk {
text: string;
source: {
documentId: string;
section: string;
chunkIndex: number;
};
}
// Every retrieval response includes provenance metadata
async function retrieveWithProvenance(
query: string,
topK: number = 5
): Promise<ProvenancedChunk[]> {
const results = await vectorStore.similaritySearch(query, topK);
return results.map(r => ({
text: r.pageContent,
source: {
documentId: r.metadata.docId,
section: r.metadata.section,
chunkIndex: r.metadata.chunkIndex
}
}));
}
When the agent generates output, the system prompt instructs it to include source references inline. This isn't for the end user. It's for debugging. When something goes wrong, you can trace the agent's reasoning back to the exact chunk it used.
Eval-Driven Testing Catches the Silent Failures
The hardest lesson was that memory bugs are silent. The agent doesn't crash. It just produces worse results over time. A document analysis agent might miss the third clause in a contract. An outreach agent might reference outdated information. These failures compound across steps.
The only reliable catch is eval-driven testing with synthetic data. I now maintain a test suite that simulates long-running agent sessions and checks for memory consistency.
describe("Agent memory consistency", () => {
it("should not forget user context across steps", async () => {
const session = createTestSession({
userName: "Alice",
industry: "healthcare",
preferences: { format: "detailed" }
});
const step1 = await agent.process(session, "analyze document");
const step2 = await agent.process(session, "generate summary");
// The summary should reference Alice and healthcare
expect(step2.output).toContain("Alice");
expect(step2.output).toContain("healthcare");
expect(step2.output).toContain("detailed");
});
it("should not hallucinate data from unrelated chunks", async () => {
const session = createTestSession({
documents: [
{ id: "doc-1", content: "Revenue: $10M" },
{ id: "doc-2", content: "Revenue: $50M" }
]
});
// Agent should only reference doc-1 when asked about it
const result = await agent.answer(session, "What is the revenue in doc-1?");
expect(result.citations).toContain("doc-1");
expect(result.citations).not.toContain("doc-2");
});
});
These tests catch the failures that feel like the model is "getting dumber" over time. It's not. The context is degrading, and the agent is making decisions on incomplete or polluted information.
The Pattern That Ships
After shipping a handful of production agent systems, the pattern I default to is simple.
One, separate working memory from long-term memory. Working memory lives in the current LLM call. Long-term memory lives in a vector store. Two, compact on every reasoning step. Don't retrieve once and assume relevance. Three, provenance everything. Every fact the agent uses must be traceable to a source. Four, test for memory consistency, not just output quality.
If your team is building agents that need to hold context across multiple steps and finding that reliability drops as the session length grows, that's exactly the kind of thing I help with. Happy to compare notes on what's worked in production.
Written by Abdul Rehman, full-stack AI engineer building production SaaS, MVPs, and AI automation. More at PrimeStrides.












