The Loop, Not the Model: Inside the Architecture Behind Every AI Agent

Ask someone to describe an AI agent and they'll talk about the model. GPT-this, Claude-that, Llama-the-other. The model is the part with a name, a benchmark score, a release announcement.

It's also not where most of the engineering happens.

A model is stateless: send it a prompt, get a completion, and as far as the weights are concerned the conversation is over. What turns that into something that can read a file, decide what to do next, call a tool, check the result, and try again — dozens of times, unattended, until a task is actually finished — isn't the model. It's the loop wrapped around it.

That loop is underrated, mostly because nobody writes admiring posts about control flow. But underrating the loop isn't the same as the model not mattering. A brilliant loop wrapped around a model that can't reliably pick the right tool, or loses the plan six steps in, doesn't produce a good agent — it produces a very well-instrumented failure. The honest claim isn't "the loop instead of the model." It's that the loop and the model solve two different problems, and collapsing them into one is how most explanations of agents go wrong.

Two different engineering problems

Model intelligence determines what's possible: can it correctly interpret an ambiguous instruction, choose the right tool from a crowded list, notice when a result looks wrong, and recover instead of confidently continuing down a bad path. No amount of orchestration fixes a model that's bad at these things. It can only contain the damage.

Agent architecture determines what's reliable: does the system remember what happened last time, retry sensibly when something fails, stay within its context budget, log enough to debug later. A weak model in a strong architecture fails safely and visibly. A strong model in a weak architecture fails unpredictably, often silently.

Treat those as the same problem and you get takes like "agents are just a wrapper, the model does the real work," or its mirror image, "the model barely matters, it's all in the harness." Neither survives contact with a production system. The architecture half is the more interesting engineering story — and the one this piece focuses on — but it's a complement to model quality, not a substitute for it.

What the loop replaces

Before agents, using an LLM meant a human in the loop by default. Ask a question, read the answer, decide what to do with it, type the next prompt. The model never acted on anything; it only ever responded.

The shift that created "AI agents" as a category — formalized in 2022 by the ReAct paper out of Princeton and Google Research, and popularized through projects like Auto-GPT and BabyAGI in 2023 — was simple to describe and hard to engineer well: let the model's own output decide what happens next. Instead of a human reading a response and deciding to run a command, the agent runtime reads it, detects an intended action, runs it automatically, and feeds the result straight back in.

Repeat that enough times and a single prompt — "research this, draft a summary, file it" — turns into a chain of model calls, each one informed by everything that came before. The user sees one task. The system underneath ran a loop.

Anatomy of one iteration

Most agent frameworks keep this logic private. Nous Research's Hermes Agent is unusually transparent about it — its developer documentation lays out the loop step by step, which makes it a useful concrete example of what's normally a black box.

A single pass looks roughly like this:

flowchart TD
    A[User message arrives] --> B[Append to history]
    B --> C[Build or reuse cached system prompt]
    C --> D{Context over ~50%?}
    D -->|Yes| E[Compress history]
    D -->|No| F[Convert history to API message format]
    E --> F
    F --> G[Inject ephemeral context:<br/>budget warnings, pressure signals]
    G --> H[Send request to model]
    H --> I{Tool calls in response?}
    I -->|Yes| J[Execute tools, append results]
    J -.->|loop back| F
    I -->|No| K[Persist session]
    K --> L[Return to user]

The branch at the bottom is the entire mechanism. A response with a tool call doesn't go back to the person who asked the question — it goes back into the model, with the tool's output now part of the context. Only a response with no further tool calls actually ends the loop.

Notice where the loop-back arrow lands: at message formatting, not back at the start. The expensive setup — checking whether to compress, rebuilding the system prompt — happens once per user turn, not once per tool call inside it. A turn that makes ten internal tool-call round trips only pays that setup cost once; the other nine passes just reassemble messages and re-send. It's a small detail, and it's exactly the kind of efficiency that separates a documented production loop from a toy implementation.

A request that searches a file, reads it, and summarizes it isn't one inference call. It's three or four passes through this loop, each one waiting on the last, with the model deciding after every single one whether it's actually done.

Where the time actually goes

"Waiting on the last" is doing a lot of work in that sentence, and it's worth unpacking — because it isn't just model inference time.

Every iteration stacks three latency components: the model's own generation time, the execution time of whatever tool got called, and the network round-trip to wherever that tool lives. A model generating at 50 tokens/second sounds fast in isolation. If the tool it calls is a web search hitting an external API, that single step can cost two or three seconds regardless of how fast the model itself runs — the model is sitting idle, waiting on the network, not computing.

This is the detail that conversations about agent speed tend to skip. A faster model shortens the inference portion of every iteration, and that compounds across a long loop. It does nothing for the tool-execution portion — which for browsing agents, API-calling agents, or anything hitting a database is often the larger share of wall-clock time. An agent bottlenecked on a slow third-party API feels just as sluggish on a fast model as a slow one. The fix there is concurrency or caching, not a faster model.

Why context management is the quiet bottleneck

Loops accumulate history. Every tool call, every result, every intermediate thought gets appended to the same conversation the model re-reads on the next pass. Left unmanaged, that conversation eventually exceeds the model's context window — and well before it does, a bloated context degrades response quality and slows every subsequent call.

Hermes's answer is a compression trigger: once a conversation crosses roughly half its context budget, the system compresses history before the next request, rather than waiting for a hard limit. It's a small design choice, and it's the difference between a loop that degrades gracefully over a long task and one that simply falls over.

There's a second, less visible constraint: message role alternation. Agent frameworks built on OpenAI-style chat formats enforce a strict pattern — user, then assistant, then user again — except during tool execution, where assistant and tool messages can chain together before control returns to the user. Violate that pattern and providers reject the request outright. It's the kind of detail that never shows up in a demo and matters enormously in production.

The real cost curve: quadratic, not exponential

It's common to hear that agent loops get "exponentially" more expensive the longer they run. That's the wrong word — and the right answer is more useful than the wrong one.

Without optimization, each iteration resends the entire accumulated conversation, because that's how stateless APIs work: the model has no memory of a previous call beyond what's in the prompt. If context grows by roughly a fixed amount each turn, total tokens processed across N iterations is the sum of an arithmetic sequence — iteration one reprocesses a little, iteration N reprocesses nearly everything, and the total comes out proportional to N². That's quadratic growth, not exponential: bad enough to matter, but a different curve with a different fix.

The fix is prompt caching — persisting the model's internal representation of prior context (the KV cache) across turns, so each new request only computes the newest increment instead of reprocessing everything from scratch. Done well, this flattens the curve from roughly N² back toward N. That's exactly why caching went from a latency nicety to a load-bearing requirement the moment people started running loops with double-digit iteration counts. Hermes applies caching markers for this reason; so does essentially every serious agent runtime once it leaves the demo stage.

This also reframes the right unit of measurement. It isn't "tasks completed" — that metric hides whether a task took two passes through the loop or two hundred, which is the engineering equivalent of measuring a sales team by deals closed without ever asking how many calls each deal took. Two systems that look identical on a task-completion dashboard can have wildly different cost and latency profiles, and the difference is invisible until you start counting loop passes instead of finished tasks.

Memory: the hardest infrastructure problem in the stack

A loop that forgets everything between sessions is just a longer single conversation. What separates a genuine agent from a chatbot with extra steps is what happens to all that accumulated context after the task ends — and this is where the engineering gets disproportionately harder than anything described so far.

Tool systems have a clean contract: a defined input schema, a defined output schema, a request that succeeds or fails observably. You can write a test for a tool. Memory doesn't get that luxury. What's worth remembering is a judgment call, not a schema. A fact that was true last week can be false this week with no signal that it changed. Two stored memories can quietly contradict each other with no error thrown anywhere.

Hermes Agent structures memory in three distinct layers, each suited to a different kind of recall:

Layer	What it stores	How it's retrieved
Prompt memory	Persistent facts about the user and project, kept in plain files	Loaded into every system prompt
Episodic archive	Full session transcripts	Searched via SQLite full-text search when relevant
Skill memory	Reusable solved-task procedures	Loaded when a similar task recurs

This is a meaningfully different bet than the two approaches that dominate most "AI memory" marketing. One is a RAG pipeline that retrieves loosely related past conversations and hopes the embedding search finds something relevant. The other is a scratchpad the model is told to maintain itself, which works until the model forgets to update it. A three-layer system with a dedicated, queryable archive and a separate procedural store doesn't solve staleness and contradiction outright — nothing fully does — but it gives each kind of memory a place to live instead of one undifferentiated pile.

From loop to library: skill formation

The most distinctive piece of this architecture is what happens after a hard task succeeds. When a task crosses a threshold of tool calls — five or more, by Nous Research's own design — the agent doesn't just complete it and move on. It writes a skill file: a markdown document describing the approach it took, the edge cases it hit, and the domain knowledge it had to work out along the way.

The next time a similar task appears, the loop can load that file instead of re-deriving the same solution from a blank context. Solve a tricky deployment script once, and the next deployment starts from a remembered procedure instead of from scratch. Nous Research has also positioned this around an open skill-portability standard, so a skill built in one agent runtime isn't necessarily locked to it.

This is the part that earns the label "self-improving" rather than just "persistent." A loop with session memory remembers what happened. A loop with skill formation gets measurably better at recurring categories of task — and sidesteps the staleness problem somewhat, since a skill is a procedure rather than a claim about the world, so it ages better than a remembered fact does.

The tool layer: a registry, not a hardcoded list

None of the above works without a consistent way to actually execute actions, and this is where agent frameworks tend to either stay simple (a small fixed toolset) or sprawl badly (every tool wired in by hand, brittle to change).

Hermes resolves this with a registry pattern: each tool self-registers at import time into a shared catalog, covering dozens of toolsets and far more individual tools by current documentation. The loop doesn't need to know what's available in advance; it queries the registry, gets a schema, and dispatches. Execution itself is backend-agnostic — the same tool call can run against a local shell, a Docker container, a remote SSH session, or an isolated sandbox, depending on configuration, without changing how the loop calls it.

That abstraction is what makes "give the agent a workspace, not just a conversation" a meaningful design goal instead of a slogan. The loop stays the same regardless of where the action executes — which also means the registry, not the loop, is usually where new capability actually gets added.

Beyond react-and-repeat

Everything described so far is one architectural family: reason, act, observe, repeat — interleaving thought and action one step at a time, the pattern formalized as ReAct. It's the dominant approach for a reason: it's simple, and the model can course-correct after every single step.

It isn't the only pattern, and it isn't free of downsides. Interleaved looping pays for a fresh round of reasoning on every step even when the plan hasn't actually changed, which is a large part of why the cost curve above matters so much. Two responses to that have emerged:

Reflection. Frameworks built on the Reflexion approach add a verification pass after a failed attempt — the agent generates a verbal self-critique of what went wrong, stores it, and retries with that critique as added context. It's reinforcement through language rather than weight updates, and it catches a category of error pure ReAct loops miss: the model executing a flawed plan competently, rather than failing in an obvious way.

Plan, then execute. A different family, exemplified by ReWOO, decouples reasoning from acting entirely — generate a full multi-step plan in a single pass, execute the tool calls it specifies, and only loop back into the model if something genuinely needs reconsidering. This trades some adaptability for a real reduction in repeated reasoning, closer to linear cost growth instead of the per-step reasoning tax interleaved looping pays every time.

Production systems increasingly mix these: plan up front for the predictable parts of a task, interleave reasoning and action for the parts that aren't, and add a reflection pass when a branch fails outright. None of this shows up in a basic loop diagram, and a meaningful share of current agent-engineering effort is going into exactly this.

Why most production failures are orchestration failures

Loops that call themselves automatically carry an obvious risk: nothing technically stops one from looping forever. A tool that returns an ambiguous result, a model that misreads success as failure, or a malformed response that breaks the parser can all turn a five-step task into an unbounded one. Iteration caps, per-turn cost budgets, and strict format enforcement — reject a malformed tool call rather than guess at it, terminate after N consecutive failures — guard against this, and none of it is glamorous. Hermes, for instance, defaults to a 90-iteration budget per top-level task, with any spawned sub-agents capped independently; cross that and the loop stops and reports back rather than running indefinitely.

But the bigger, less-discussed truth is that runaway loops aren't usually what actually shows up in production. The more common failure is quieter: a tool call that partially succeeds and gets retried as if it fully failed, duplicating a side effect. A compression event that fires mid-task and drops a detail the next iteration needed. Two concurrent loop branches writing to the same memory file, one silently overwriting the other's update. None of these are the model reasoning incorrectly — the model can perform exactly as designed and the task still fails, because the state underneath it got corrupted.

That distinction matters because it points debugging effort in the right direction. Teams that assume failures mean "the model got confused" spend their time tweaking prompts. Teams that assume failures mean "something in state management broke" spend their time on idempotency, locking, and compression boundaries — and that's usually where the actual bug lives.

Observability: debugging a process nobody watches

A chatbot failure is visible immediately — the answer is wrong, right there on screen. A loop failure can happen on iteration eleven of forty, invisible behind a spinner that just says the agent is thinking, and by the time a wrong final answer surfaces, the actual point of failure is several steps and several minutes back in a conversation nobody was reading in real time.

This is why production agent systems increasingly treat each iteration as its own logged span — an ID, inputs, outputs, timing — rather than treating the whole loop as one opaque call. Replay tooling that reconstructs exactly what the model saw at iteration eleven, not just what it ultimately answered, is the difference between fixing a bug in minutes and reconstructing it from vibes. The more a system relies on autonomous iteration, the less optional this becomes — "it worked in testing" stops meaning much once a loop behaves differently every time depending on what a tool happens to return.

Where this is heading

This general shape — cached prompts, context-budget-aware compression, a tool-call branch, layered memory, skill accumulation, increasingly blended with planning and reflection passes — isn't unique to one project. NVIDIA's RTX AI Garage team has written about pairing comparable local-agent architectures with their own hardware, and the broader ecosystem of agent frameworks is converging on overlapping pieces even where they differ on emphasis: some prioritize breadth of integrations, others prioritize the depth and reliability of what the loop remembers, others lean further into plan-first design to control cost.

That convergence, not any single framework's design, is the real signal. When independent teams arrive at similar answers to "how do you keep a loop from collapsing under its own context" and "how do you stop it from forgetting everything it learns," it stops looking like one team's clever idea and starts looking like the actual shape of the problem.

The loop sets the ceiling on reliability. The model sets the ceiling on everything else.

Swap the model behind a well-built loop and the agent still runs. The compression strategy still works, the memory layers still write, the tool registry still dispatches. In that narrow sense, models have become more interchangeable than they were three years ago, sitting behind the same OpenAI-compatible interface regardless of who trained them.

But swap in a meaningfully weaker model and the same loop reflects that weakness back, faithfully and at scale: more failed tool calls retried more often, more reflection passes triggered by more mistakes, more skill files full of procedures learned from bad attempts. The loop doesn't compensate for a weak model. It makes a weak model's failures more visible, more frequent, and more expensive.

So the real claim isn't that the model doesn't matter. It's narrower, and more useful: the loop is what makes intelligence reliable enough to run unattended, and the model is what determines how much intelligence there was to make reliable in the first place. Get the loop wrong, and a brilliant model produces an unreliable agent. Get the model wrong, and a brilliant loop produces a reliable machine for doing the wrong thing, faster.