The Rise of Agentic Engineering — Part 4: Fixing Context & Multi-Agent Systems

Fixing Context & Multi-Agent Systems

Part 4 of a chronological survey of the craft around large language models. Part 3 named the field and catalogued the four ways contexts fail. This installment covers the response: a toolkit for repairing a context — and how the most powerful fix became an architecture.

TL;DR — Breunig's six tactics (RAG, tool loadout, quarantine, pruning, summarization, offloading) all serve one rule: context is not free. The strongest, isolating context across separate agents, grew into multi-agent systems — Anthropic's research setup beat a single agent by 90.2%. The catch, foreshadowing Part 7: it burned ~15× the tokens. Multi-agent isn't magic; it's spending enough to brute-force the problem.

From diagnosis to treatment

Once the failure modes had names, the obvious next question was what to do about them. Drew Breunig's follow-up, How to Fix Your Context, organized the scattered remedies into six tactics. His framing throughout was a return to an old programming adage — "garbage in, garbage out" — and a single governing principle: context is not free. Every token in the context influences the response, for better or worse, so the work is information management. As Karpathy put it, the job is to "pack the context windows just right."

The six tactics:

Each addresses one or more of the failure modes from Part 3. We'll take them in turn, then follow the third one — quarantine — into the larger story of multi-agent systems.

RAG, reconsidered

Part 2 covered RAG's near-death experience during the context-window arms race. By 2025 its role had clarified: not a workaround for small windows, but a permanent technique for keeping the signal-to-noise ratio of a context high. Breunig's treatment is brief precisely because the point is settled — "it's very much alive" — and the reason is exactly the Part 2 finding: if you treat the context like a junk drawer, the junk influences the response. RAG is the discipline of not putting the whole junk drawer in.

Tool Loadout: retrieving the right tools

"Loadout" is a gaming term — the specific set of weapons and equipment you select before a match, tailored to the situation. Applied to agents, it means selecting only the tool definitions relevant to the current task, rather than exposing every tool at once. This directly targets Context Confusion (Part 3): more tools measurably degrade performance.

The cleanest treatment is RAG-MCP (Tiantian Gan and Qiyao Sun, 2025), which applies retrieval to the tools themselves. The motivating problem is "prompt bloat": as the Model Context Protocol ecosystem expanded rapidly after Anthropic's late-2024 release of MCP — with thousands of server implementations appearing across the community — agents were increasingly drowning in tool descriptions. RAG-MCP stores tool descriptions in an external index and, for each query, semantically retrieves only the most relevant ones before the LLM is ever engaged; only those selected descriptions enter the prompt.

The results quantify how much the bloat was costing. In an MCP stress test varying the tool pool from 1 up to 11,100 servers, RAG-MCP cut prompt tokens by over 50% (49.2%) and more than tripled tool-selection accuracy — 43.13% versus a 13.62% baseline. Breunig adds the threshold detail from the paper's analysis: selecting tools becomes critical past about 30 tools, where descriptions begin to overlap and confuse the model; beyond roughly 100 tools, failure was nearly guaranteed without retrieval.

For smaller models the problem starts even earlier. The "Less is More" / GeoEngine study from Part 3 — where Llama 3.1 8B failed with 46 tools but succeeded with 19 — built a dynamic, fine-tuning-free tool selector that reduces the tool set to a smaller, more relevant loadout before the model call. On the GeoEngine benchmark this raised Llama 3.1 8B's success rate to about 56% (with tool-selection accuracy improving similarly). The paper also noted side benefits that matter at the edge (running models on phones or laptops): execution time fell by up to roughly 40% and power consumption by around 12%. Smaller contexts are not just more accurate; they are cheaper and faster.

Context Quarantine: the bridge to multi-agent systems

Context Quarantine is the act of isolating contexts in their own dedicated threads, each used separately by one or more LLMs. This is the tactic that turned out to be far more than a tactic.

The logic is simple. We get better results when contexts aren't too long and don't carry irrelevant content (Parts 2 and 3). One way to guarantee that is to break a task into smaller, isolated jobs, each with its own clean context. And once you do that, you are no longer fixing a single context — you are designing a system of agents, each with its own context, tools, and instructions. The fix becomes an architecture. That architecture is the subject of the rest of this part.

Quarantine a context hard enough and it stops being a tactic. It becomes an architecture.

Context Pruning, Summarization, and Offloading

The remaining three tactics all manage the context as it accumulates over an agent's run.

Pruning — removing irrelevant or no-longer-needed information. Breunig points to Provence (2025), "an efficient and robust context pruner for question answering" — a small, fast DeBERTa-based model that, given a question, edits a document down to only the relevant portions. In his test on a Wikipedia article it cut roughly 95% of the content while preserving exactly what mattered. Pruning is a strong argument, he notes, for keeping a structured version of your context (in a dictionary or similar) from which you compile the prompt before each call — so you can prune the document or history sections while protecting the core instructions and goals.

Summarization — compressing accrued context into a condensed summary. This began as a way to stay under context limits (the familiar "ask the chatbot to recap, then paste into a fresh thread"), but acquired a second rationale once Context Distraction was understood: even when you could keep everything, you often shouldn't, because length itself degrades reasoning past the distraction ceiling (the Gemini Pokémon agent's ~100k-token threshold from Part 3). Breunig's practical advice is to make summarization its own dedicated, evaluated step, since deciding what to preserve is hard and worth optimizing directly.

Offloading — storing information outside the context, via a tool that holds it for later reference. Breunig's favorite example, "so simple you don't believe it will work," is Anthropic's "think" tool — effectively a scratchpad where the model writes notes that stay out of the main context but remain available.

Anthropic reported that pairing the think tool with a domain-specific prompt yielded up to a 54% improvement on a specialized-agent benchmark. It helps most in three situations: analyzing tool outputs before acting (with room to backtrack), navigating policy-heavy environments that need compliance checks, and sequential decision-making where each step builds on the last and mistakes are costly.

(Breunig's aside — that the tool would be clearer if it were simply called scratchpad — is a nice illustration of the series' recurring theme that naming shapes understanding.)

Context quarantine becomes architecture: Anthropic's multi-agent research system

The single most consequential elaboration of these ideas in 2025 was Anthropic's write-up of how it built the multi-agent research system behind Claude's Research feature (June 2025). It is the moment context quarantine stops being a tactic and becomes a design pattern with its own engineering discipline.

The system uses an orchestrator-worker pattern. A lead agent analyzes the user's query, develops a strategy, and spawns specialized subagents that explore different aspects of the question in parallel, each in its own context window.

Anthropic's own framing connects this directly to the context-engineering ideas of the era:

The essence of search is compression: distilling insights from a vast corpus. Subagents
facilitate compression by operating in parallel with their own context windows, exploring
different aspects of the question simultaneously before condensing the most important tokens
for the lead research agent. Each subagent also provides separation of concerns — distinct
tools, prompts, and exploration trajectories — which reduces path dependency and enables
thorough, independent investigations.

In other words: each subagent is a context quarantine. Its window stays clean and focused; it distills its findings into a compact summary; and only that summary returns to the lead agent — which is itself context offloading and summarization operating at the level of agents rather than strings.

The payoff was large. On Anthropic's internal research eval, a multi-agent system with Claude Opus 4 as lead and Claude Sonnet 4 subagents outperformed single-agent Claude Opus 4 by 90.2%. Their example: asked to identify all the board members of the companies in the IT S&P 500, the multi-agent system decomposed the task across subagents and found the answer, while the single agent ground through slow sequential searches and failed.

The economics, stated plainly

Anthropic was equally candid about the cost, which becomes a recurring theme in later parts of this series. In their data, agents use about 4× more tokens than chat interactions, and multi-agent systems about 15× more tokens than chat. Three factors explained 95% of the performance variance on the BrowseComp evaluation, with token usage alone explaining 80%. The blunt conclusion: multi-agent systems "work mainly because they help spend enough tokens to solve the problem," so they only make economic sense for high-value tasks that genuinely parallelize. They explicitly flagged that most coding tasks — with fewer truly independent subtasks and heavy shared context — were a worse fit than research at the time. (Part 7 returns to exactly this tension when coding agents start running many-in-parallel anyway.)

The +90% came at ~15× the tokens. Multi-agent isn't magic — it's spending enough to brute-force the problem.

Lessons that recur

Several findings from this post echo forward through the series:

Prompt the orchestrator to delegate well. Vague subagent instructions ("research the semiconductor shortage") caused duplicated work and gaps; subagents need an objective, an output format, tool guidance, and clear boundaries.
Scale effort to complexity. Early agents spawned 50 subagents for simple queries; the fix was embedding explicit scaling heuristics in prompts (simple fact-finding: one agent, 3–10 calls; complex research: 10+ subagents).
Let agents improve themselves. Given a prompt and a failure mode, Claude 4 could diagnose and suggest fixes; a tool-testing agent that rewrote a flawed tool's description cut task time 40% for later agents. (This self-improvement thread becomes its own topic later.)
LLM-as-judge, plus human eyes. Free-form research output was best graded by a single LLM-judge call against a rubric (factual accuracy, citation accuracy, completeness, source quality, tool efficiency) — but humans still caught what automation missed, like a bias toward SEO-optimized content farms over authoritative sources.
Errors compound. In agentic systems, "minor issues for traditional software can derail agents entirely"; one wrong step sends an agent down a divergent trajectory. This motivated durable execution, checkpoints, and full production tracing.

The same pattern, arrived at independently: Microsoft's Magentic-One

Anthropic was not alone in landing on the orchestrator-worker shape. A month before, in November 2024, Microsoft Research released Magentic-One (Fourney et al., arXiv:2411.04468), a generalist multi-agent system built on the open-source AutoGen framework.

Its architecture is strikingly parallel. A lead Orchestrator agent plans, tracks progress with a "ledger," and re-plans to recover from errors — directing four specialists: a WebSurfer (browser), a FileSurfer (local files), a Coder (writes code), and a ComputerTerminal (executes it). The Orchestrator runs two loops: an outer one that maintains the task ledger, an inner one that assigns the next action.

The result: Magentic-One reached statistically competitive performance with the state of the art on three demanding agentic benchmarks (GAIA, AssistantBench, WebArena) without architecture changes.

That two major labs independently converged on "a planning lead agent directing specialized workers with isolated tools" is a strong signal — the pattern is a genuine structural answer to context management, not one company's idiosyncrasy.

That earlier point — that small errors compound catastrophically across a long agent run — is the seed of everything Parts 5 through 7 call harness and loop engineering. If a stray token or a single wrong turn can derail an agent (Part 3's "lost in conversation"; this part's compounding errors), then making agents reliable requires building structure around the model: guides, sensors, feedback loops, and recovery paths.

Programming the context, not writing it

One more thread from this era points directly at Part 6. Breunig argued that the whole list of tactics is an argument for programming your contexts rather than hand-writing them — assembling each prompt from a structured representation, with dedicated, separately-evaluated stages for summarization, pruning, and tool selection. The same logic later shows up in research on agents that manage their own context autonomously — for example, 2026 work on "self-compacting" agents that decide for themselves when and what to summarize, reportedly improving task performance while cutting token costs by a third to two-thirds. The trajectory is consistent: from a human choosing what goes in the window, toward systems that make that choice programmatically.

Part 5 picks up the compounding-error problem head-on. As coding agents went mainstream in early 2026, the field's attention widened from the context you feed a model to the entire harness you build around it — and a new term arrived to name that work.

Key sources for Part 4

Drew Breunig, How to Fix Your Context (2025) — the six tactics (RAG, tool loadout, context quarantine, pruning, summarization, offloading); "context is not free."
Tiantian Gan & Qiyao Sun, RAG-MCP: Mitigating Prompt Bloat in LLM Tool Selection via Retrieval-Augmented Generation (arXiv:2505.03275, 2025) — semantic tool retrieval; >50% token reduction; 43.13% vs 13.62% selection accuracy; MCP stress test to 11,100 servers.
Less is More / GeoEngine tool study (arXiv:2411.15399) — fine-tuning-free dynamic tool selection; Llama 3.1 8B success rate ~56% on GeoEngine; ~40% execution-time and ~12% power reduction on edge hardware (Jetson AGX Orin).
Provence context pruner (arXiv:2501.16214, 2025) — fast QA-oriented pruning; ~95% content cut while preserving relevance. [Cited via Breunig.]
Anthropic, The "think" tool (2025) — context offloading / scratchpad; up to 54% improvement on a specialized-agent benchmark. [Cited via Breunig.]
Anthropic, How we built our multi-agent research system (2025) — orchestrator-worker architecture; +90.2% over single-agent on internal eval; 4×/15× token economics; delegation, effort-scaling, self-improvement, LLM-as-judge, compounding errors.
Self-Compacting Language Model Agents (arXiv:2606.23525, 2026) — autonomous context management; reported 33–67% token savings. [Forward-looking reference.]

Next up · Part 5 — Harness Engineering Emerges: "Agent = Model + Harness," guides and sensors, and what it looks like when whole companies build software with agents writing every line.