A local model opened 41 of our pull requests in five weeks. The model is the least interesting part.

This was originally published on the LLMKube blog.

Here is the claim, up front and checkable: between May 21 and June 25, 2026, a fleet of local models opened 41 pull requests that we merged into LLMKube, our open-source Kubernetes operator for self-hosted inference. No code or prompts left the building. The marginal inference cost was a few cents of electricity. Across those five weeks they were about a fifth of everything merged into the repo, and closer to half in the busiest recent stretch, sitting next to pull requests from five human contributors who showed up in the same weeks.

If you have used a 27-billion-parameter open-weight model as a coding agent, your first reaction is correct skepticism. A model that size is a coin flip on a non-trivial issue. It drifts. It writes tests that do not test anything. It declares victory on code that does not compile.

That is all true, and it is also beside the point. We never bet on the model. We bet on the harness around it. This post is the evidence for that bet, including the parts where it failed.

The setup: a weak model, a strict harness, heterogeneous hardware

The agentic coder is a component of LLMKube called Foreman. Its design premise is one sentence: trust the harness, not the model.

The model is whatever local coder we have loaded. Over these five weeks that was mostly a dense 27B (Qwopus-27B-Coder) on an AMD Strix Halo mini-PC over Vulkan, and a 35B mixture-of-experts (Qwen3.6-35B-A3B) on an Apple Silicon Mac over Metal. A second, different model on a Mac Studio acts as the reviewer. None of them is a frontier model. None of them is close.

The harness is where the work went. Around every run sits a stack of deterministic checks, each of which can reject the model's output regardless of how confident the model is:

A fast in-workspace gate: gofmt, go vet, go build, golangci-lint, and the scoped unit tests. If any fail, the failure is fed back to the coder for up to three fix attempts. No PR opens on a red gate.
A scope-drift guard: if the diff touches a subsystem the issue does not imply, the run is rejected rather than approved. A confidently-wrong change to the wrong package never reaches a PR.
A bite check: every new test is run against the pre-fix baseline. If a test passes without the fix, it does not actually test the fix, and the run is rejected. This is the single most common failure mode of LLM-written tests, and it is now a gate, not a hope.
An issueAsk check: the reviewer has to demonstrate, against the actual fetched issue body, that it understood what was asked. A reviewer that confabulates a plausible-but-wrong summary is demoted, not trusted.
A separate reviewer model on separate hardware, so the thing judging the work is not the thing that produced it.

The coder is stochastic. Every one of those rails is deterministic. That asymmetry is the entire product.

The numbers

Five weeks. The verifiable shape of it:

41 merged PRs authored by Foreman, all from foreman/issue-* branches, May 21 to June 25.
39 of them in June, as the harness matured and we trusted it with more.
31 of the 41 carried no human commit at all: Foreman is the only author on the branch. The other ten took a small human touch-up before merge, the same hand-finishing this post is honest about.
Across the full five weeks they were about 20% of everything merged (201 PRs in total); in the most active recent stretch, closer to half. They sat alongside five human contributors (adebrie, arychj, eleboucher, joryirving, matiasinsaurralde) working the same repo in the same weeks.
$0.00 in API spend. A representative overnight batch of six issues cost roughly six cents of electricity versus an estimated eighteen to thirty cents of equivalent cloud-API tokens. The pennies are not the point. The shape is: zero per-call cost, constant and repeatable, and nothing routed through someone else's data center.

The work was real maintenance, not toy issues: CLI flags, controller reconciliation fixes, metrics plumbing, test-coverage slices, a supply-chain CI scan, observability spans. The kind of backlog that is too important to ignore and too unglamorous to prioritize. Exactly the kind a tireless overnight coworker should take.

Three times it went wrong (and what caught it)

A case study that only reports wins is marketing. Here is where the model was not good enough, in detail, because the failures are the argument.

Issue #731 took six runs to converge. A single feature-plus-tests task. The mixture-of-experts coder kept getting partway and stalling. We drove it forward by tightening the harness one layer at a time: a budget guard, then an edit-streak forcing function, then a test tier. Each layer fixed one failure mode and revealed the next. The rails always contained it. It never shipped a broken change. But a 3-billion-active-parameter mixture-of-experts could not nail that task autonomously. What finally cleared it was not a sharper prompt or a human rescue, it was a different model: a denser 27B coder, on the same AMD hardware, later landed the same issue cleanly and autonomously, gate-verified, in about forty minutes. That is the thesis from another angle. The harness is the constant. The model is a dial you turn when the one you have is not enough.

Issue #813 is still not done. It is a harness change that needs a hermetic git-fixture test. Two autonomous attempts, including a second with a sharpened prompt that explicitly demanded a fixture and a fast fallback, both came back INCOMPLETE: the first hung for 180 seconds on a test that did real I/O instead of using a fixture; the second still failed the gate. After two honest tries, we marked it a human hand-finish. The harness did its job by refusing to land a failing change. The model did not do the job at all.

A run produced a false GO, and CI caught what the in-workspace gate could not. We routed one coder loop onto an in-cluster agent that, it turned out, had no Go toolchain installed. The fast gate runs in the coder's own workspace, so with no Go it silently no-op'd, and the run reported a confident GO on code with a backwards test assertion and a formatting violation. The model was sure. The local gate was blind. The full CI suite caught both immediately, we fixed them by hand, and we filed the toolchain gap as a tracked issue. The lesson is the thesis restated: a harness is only as good as its coverage, and the moment one rail goes dark, you find out exactly how much you were leaning on it.

None of these three shipped a broken PR. That is the whole claim. The model is unreliable; the system is not.

Why "harness, not model" is the right bet for weak models

There is a tidy intuition under all of this. Generating a correct change is hard and open-ended. Verifying one is narrow and mechanical: does it compile, do the tests bite, did the diff stay in scope, did the reviewer actually read the issue. The recent verifier literature makes the same point more formally, that a stack of weak, independent verifiers can close most of the gap to an oracle, and that the weaker your generator is, the more load-bearing your verifiers become.

A frontier cloud model is good enough that you can get away with a thin harness. A local 27B is not, which is precisely why a local 27B forces you to build the harness you should have built anyway. We did not set out to prove a research point. We set out to fix our own backlog without paying for or trusting a cloud API, and the harness is what fell out of taking that constraint seriously.

The cloud's other problem: the bill stopped being predictable

There is a second reason to run the coder on hardware you own, and over 2025 and 2026 it stopped being hypothetical.

The flat-rate era of cloud AI coding ended in public. OpenAI's CEO admitted in early 2025 that the company was losing money on its $200-a-month ChatGPT Pro plan because people were using it far more than the company expected. Cursor apologized and issued refunds after a June 2025 repricing left users burning a month of credits in a single agentic session. GitHub ended flat-rate Copilot billing on June 1, 2026, after its own product chief called the prior premium-request model "no longer sustainable"; developers posting their own bills projected typical agentic workflows costing several times more.

The cause is structural, not a botched rollout. A human types for eight hours and stops. An agent has no natural ceiling: point it at a backlog and it consumes tokens until you tell it not to. Usage-metered pricing meets unbounded consumption, and the bill stops being a line you can budget.

At enterprise scale that gets vivid. Uber rolled an AI coding assistant out to its engineering organization and burned through its entire 2026 AI-tools budget in four months, with its COO openly questioning whether the spend tied to features the company actually shipped. Microsoft, by separate reporting, began canceling Claude Code licenses across a division and steering engineers to a flat-rate tool over the per-seat-plus-tokens math. Even Meta capped internal token budgets after costs approached the billions and its CTO pushed back on the "tokenmaxxing" culture, writing that "token usage alone is not a measure of impact of any kind."

On-prem inverts the whole model. The hardware is a one-time capital cost; every run after that carries a zero marginal token bill. Our 41 PRs cost the same in API dollars whether the number is 41 or 4,100, which is to say nothing. That is not a discount, it is a different axis. To be honest about it, the hardware, the power, and the operations are real total cost of ownership. The claim is narrower and it is the one that matters: the marginal cost of the next agentic run, the thing that blew up Uber's budget, is gone.

Why this matters if you cannot use the cloud at all

For most teams this is a cost-and-control story. For some teams it is the only story there is.

GitHub Copilot and Amazon Q do not run on-premises. For an organization on GitHub Enterprise Server behind an air gap, in defense, in regulated finance, in healthcare with code that touches protected data, the dominant agentic coding tools are not a policy fight, they are architecturally unavailable. Sending source code to a third party's inference endpoint is the thing the compliance regime exists to prevent.

A coder that runs entirely on hardware you own, talks only to a model you host, and emits an auditable record of every gate it passed is a different kind of object. It is agentic coding for the rooms that cloud agents cannot enter. That is the same constraint that makes LLMKube exist at all, applied to the act of building LLMKube.

That auditability is not aspirational. This week we shipped a durable, exportable audit record for every Foreman run, capturing which model and endpoint served it, the verdict, and which rails fired, surviving long enough to be a real compliance trail. The harness now writes down what it checked.

The crowd that already owns the hardware

None of this is news to one group: the people who have been running local models in their own home labs for years.

That community is not fringe anymore. r/LocalLLaMA crossed 757,000 members as of June 2026, up more than 267,000 in a single year, and the growth tracks the arrival of open-weight coding models that are genuinely good. These are not toys you settle for. Devstral Small 2, a 24B model, scores 68% on SWE-bench Verified and runs in about 15GB of VRAM, a single RTX 4090 or a 32GB Mac, by its own model card. Qwen's coder models run on the same class of hardware. Capability first; the price is a consequence.

When Ethereum's Vitalik Buterin published his own fully local inference stack in April 2026, running a 35B model on a single laptop GPU, his reason was not the bill. It was not wanting to "take ten steps backward" on privacy just as the tools got good. That is the same instinct underneath the enterprise compliance story: when inference runs on hardware you own, no prompt leaves the machine, no terms of service govern what gets logged, and the ten-thousandth run costs exactly what the first one did.

LLMKube and Foreman are that instinct taken to production. The same hardware a hobbyist already has, plus the operator that schedules it across a fleet and the harness that makes a coin-flip model trustworthy enough to leave pointed at a real repository overnight. We are not going to tell you it is push-button. Running your own inference is a real operational surface, and this crowd knows that better than anyone. We are telling you it is worth it, and that the gate is the thing that turns "a model on my 4090" into "a coworker I can actually hand the backlog to."

What is next, honestly

The most useful thing we can publish next is a number we do not have yet: the harness uplift on a standard benchmark. Not our resolved rate versus a frontier model, a race we would lose, but the same local model's resolved rate with the rails on versus off, per hardware tier, with the false-GO rate alongside it. The delta is the product. We will run it and publish it, good or bad.

Until then, the honest framing is the one we actually operate under. Foreman is a tireless coworker for the routine and the well-scoped, with a human triage queue for everything it declines. It is not a sprint that finishes itself overnight. It is a backlog that gets quietly smaller while the machines work and the gate refuses to lie.

The model will keep being a coin flip. We are going to keep not betting on it.

LLMKube is Apache 2.0 and runs on NVIDIA, Apple Silicon, and AMD. Whether you are a regulated team that cannot send code to the cloud, an organization watching its Copilot bill go usage-based, or someone with a spare GPU and a backlog that never shrinks, the bet is the same: own the hardware, trust the harness. The repo, including every one of those 41 PRs, is on GitHub, and we are in Discord. If you are running local models on your own hardware, we would like to hear what you are building.