The thing I built was a Go binary that copied markdown files into a repository. I called it an installer. The README called it an installer. The CLI verb was init. For the several months in early 2026, that framing held. Then I sat down to plan the 1.0 release, started typing, and could not write the next paragraph without using the word framework. Not in the marketing-blurb sense. In the load-bearing sense: a small runtime, a set of named extension points, conventional file layouts, a resolver that picked a winner among competing sources, a lockfile, a migration runner, per-backend adapters. The thing I had been calling an installer for four months had been doing framework work the whole time, and I just had not noticed.
This post is about that moment of recognition and what came after. It is about the refactor that turned an installer-shaped framework into a framework-shaped framework. The most surprising part was not the rewrite. It was how little of the concepts had to change. The names were already right. The directories needed to move. The runtime needed a physical line down the middle between framework code and content. But the abstractions I had been using (guides, corpus, sensors, actions, playbooks, adapters) held up under the new framing without revision. The work I had done naming things over the previous months turned out to be the work that made the refactor cheap.
I want to walk through how that happened, what tipped me off, and the one wrong turn the refactor almost made before backing out of it. The story has a tidy ending; the lesson does not, and I will get to that.
The shape was there before the word
Keystone started as keystone init — a binary you ran in a fresh repo and it laid down a harness/ directory full of markdown. Rules for the agent. A small lifecycle. A few sensors that ran your existing lint and test commands. Nothing especially framework-ish. The 0.1 README said installer four times.
Then I started adding things. Each addition felt like a feature, not an architectural shift. The commit log reads like a feature log:
- 0.2: install-time options. Pick your agent. Add a target.
- 0.3: restructured the harness into corpus, guides, sensors, flywheels. Named taxonomy for the first time.
- 0.4: a kind taxonomy on guides and sensors. Different kinds, different load behavior.
- 0.5: a migrate command for forward-compatible upgrades. The harness had a schema now, and the schema had a version.
- 0.10: policy plugins. Org-level rules that projects pulled in.
- 0.11: the policy cascade. Org → Team → Project. Strict and non-strict layers.
- 0.12: playbooks. Ordered chains of actions.
- 0.13: sensors as a tier-aware policy kind.
Read that list once and it looks like product growth. Read it twice and the shape gets harder to miss. By 0.13 there was a runtime with conventions, plugins, a cascade, a lockfile, migrations, and per-agent rendering. The README still said installer.
The thing about a shape forming under a name is that the name is the last thing to change. Each individual commit looks like a small feature; the architectural mass piles up underneath and nobody is watching the mass. I was not, anyway. I was writing the next feature.
When I opened a fresh document called PLAN-10.md and tried to summarize where things stood, the first sentence I wrote was: Convert Keystone from a harness installer with org policy plugins into a harness framework. I stared at that for a while. The conversion was not a conversion. It was an admission.
What an installer was pretending not to be
Here is what makes the installer label dishonest in retrospect. An installer drops files. It is done. The user takes it from there. The framework that Keystone had become did all of this on top:
- It defined named extension points. Guides, corpus, sensors, actions, playbooks, adapters per agent. Each one had a path convention, a frontmatter contract, and a load behavior. That is an API, not a directory listing.
- It resolved conflicts across sources. The same /.md could exist in the project, in a team plugin, in an org plugin. Exactly one won. The resolution order mattered, and the ordering rules were stable across versions. That is a runtime, not a copy step.
- It tracked drift. A lockfile pinned per-source SHAs and per-file hashes. The binary would notice when plugin files had been edited under it. That is integrity enforcement, not file-laying.
- It migrated old layouts forward. keystone migrate would walk an installed harness, apply numbered transforms, and bring it to the current schema. That is a schema migrator, with all the implications of having a schema.
- It rendered the same harness into multiple targets. Claude Code’s CLAUDE.md shape, Codex’s AGENTS.md, Cursor’s .cursor/rules/. The adapter layer translated one source into many. That is a code generator, not a paste-once-and-done.
Take any one of those in isolation and you can argue it was a feature an installer happened to include. Take them together and the argument falls over. Installers do not carry adapter layers. They do not have lockfiles. They do not have migration runners. They certainly do not have a cascade resolver with strict and non-strict overrides.
The give-away, when I went looking, was the layout of the source tree. Go files at the repo root next to a harness/ directory full of markdown that the binary embedded. No physical boundary. A single Go module mixing the runtime with the content the runtime shipped. If you asked me to point at “the framework” and “the content” in the 0.x tree, I would have had to do it with words, not directories. That is a tell.
The vocabulary did the heavy lifting
Here is the part of the story that surprised me most. I was bracing for a rewrite — the kind where you sit down with a blank internal/framework/ directory and start naming concepts from scratch, then translate the old code into them, then translate again because the first names were wrong. That is the usual cost of recognizing a shape late.
It did not happen. The abstractions held.
When I sat down to write the ports-and-adapters layer of the 1.0 plan, I had a table to draft. One row per port. For each port, a name, a path convention, an activation rule. I wrote it expecting to discover gaps. Here is what I found instead.
Every port already had a name. Every name was already in use. Every path was already conventional. Guide meant a rule loaded on every turn at guides//.md. Corpus meant reasoning loaded on demand at corpus//.md. Sensor meant an automated check at sensors/.md. Action meant a single unit of lifecycle work. Playbook meant an ordered chain of actions. Adapter meant the per-agent binding at adapters//…. The table I sat down to invent had been the truth of the code since half-way to 1.0. I was writing it down, not designing it.
Writing it down still mattered. The contract was implicit until that table existed. New ports would have slipped in by accident the same way old ones had. But the design work had already happened, one commit at a time, while I thought I was just shipping features.
That is the angle worth chewing on. The names I had given things while building features carried more architectural weight than any single commit they appeared in. Guide and corpus drew the line between rules-loaded-every-turn and reasoning-loaded-on-demand. Sensor drew the line between an automated check and an aspirational rule. Action and playbook drew the line between an atomic unit of work and an ordered chain of them. Every one of those lines was a load-bearing distinction in the runtime, and I had named them all before I knew they were load-bearing.
The corollary is the part to hold onto: when the refactor came, I did not have to invent a single new concept. I had to relocate the code that implemented the concepts. The concepts themselves stayed put. Naming had front-loaded the design work. The refactor was a directory move on top of a vocabulary that was already correct.
This is the underrated payoff of taking names seriously while building. If you have named the abstractions well in version 0.3, you can refactor the runtime in version 1.0 without renaming anything. If you have named them badly, version 1.0 starts with a six-month tax to fix the words before you can fix the code. The cheap version of a hard refactor is the one where the user-facing vocabulary is already correct and only the implementation moves.
It also changes the social cost of the refactor. A user who learned guide and corpus in 0.3 still knows what those words mean in 1.0. The blog post they wrote about the harness in 0.5 is still accurate. The wiki page in their team’s Notion has not gone stale. Refactor-without-renames is the kind a user does not feel.
Writing it down made it real
Once I admitted what was happening, the next move was to make the framing physical. Not “we say it is a framework now.” That is marketing. The framework had to look like a framework when you opened the source tree, and behave like one when you read the code.
The plan went through six phases. Each had a small handful of commits, none too clever, each one a small structural improvement. Order mattered, because some moves enabled others.
The phased refactor, in order. Each arrow is a phase the next one depends on.
A few things about the order are worth pulling out.
Phase 0 was the cheapest and the most important. Before any code moved, I wrote eight architecture decision records and one port contract per abstraction. Each ADR was a single page: context, decision, consequences, alternatives considered. Each port contract was a single page: path convention, required frontmatter, cascade behavior, an example, the command that scaffolds it. The total page count was around twenty. Writing them took two days. They turned out to be the most useful pages in the whole project, because every subsequent phase referenced them. When a phase started, I re-read the relevant ADR; when a phase finished, I re-read it again to check we had stayed inside the lines.
Phase 1 was the physical line down the middle. Every Go file under the repo root got moved into internal/framework/. The CLI entrypoint moved to cmd/keystone/. The template tree relocated to internal/framework/scaffold/templates/. After Phase 1, you could point at the framework in the directory tree without using words. That is a small thing that turns out to be a big thing. Two months later, when a contributor asks “is this framework behavior or content?”, the answer is a path, not a paragraph.
Phases 2 through 5 were small structural improvements stacking on top of that line. JSON-only config (Phase 2), because YAML had drifted across half a dozen schemas during 0.x. Vendored read-only plugins (Phase 3), because the plugin model had been “edit the markdown in harness/policies/” and that turned out to be the wrong default. Projects would silently diverge from upstream and nobody would notice. Conventions, generators, and a doctor command (Phase 4), because once you have named ports you can write generators that scaffold an adapter for each port. Per-port token budgets (Phase 5), because once you have named ports you can also count tokens per port and tell the user when one is bloated.
Phase 6 was the documentation phase. An upgrade guide for 0.x users. A compatibility doc spelling out what 1.x promised to keep stable. Nothing structural changed in Phase 6. The point was to make a contract that the next year’s worth of changes had to respect.
Each phase landed in a separate set of commits. Each commit was small enough to read in one sitting. None of them were a “big refactor commit.” That is the texture of a refactor that respects the work already there: many small surgical moves, each one preserving behavior, with the architecture emerging from the sum.
The plugin draft we threw away
I want to talk about a wrong turn, because every refactor has one, and the ones that do not get talked about tend to be the ones that bite later.
The first draft of the 1.0 plan made the built-in defaults (universal guides, lifecycle playbook, default sensors, per-agent adapters) into first-class plugins shipped embedded in the binary. The plan read:
The universal engineering corpus/guides, the lifecycle actions, the task playbook, and the default sensors all become first-class policy plugins — same shape as user-installed plugins, loaded by the same engine.
The promise was symmetry. Built-ins and external policy would travel the same pipeline. One loader, one cascade, no special cases. It was clean. It was elegant. It also had a giant cost buried in the elegance: the moment defaults are loaded as plugins, editing a default stops being “edit a markdown file in your repo” and becomes “fork the embedded plugin or override it from a higher layer.”
That breaks the Rails-style ergonomics I wanted at the project layer. The whole point of conventions over configuration is that the conventional file is just sitting there in the repo, your repo, your git, ready to be edited like any other file. If the default lives inside the binary and only appears in the consumer’s repo as a shadow that overrides it, you have turned a one-line edit into a four-step debugging session. Where is this rule actually coming from? Is the override winning? Why is my edit not taking effect? I have seen that pattern in tools that load defaults from inside their distribution, and the answer is always the same: it is a constant low-level tax on every user, paid forever.
I caught it because I sat with the plan for a day before starting any code. Two of the ADRs (number five, Conventions, not plugins, and number two, Framework / client boundary) ended up being the place where the wrong turn got walked back. The decision was: defaults are scaffolded into the consumer’s harness// on keystone init, from embedded templates. From that moment on, defaults are project content. The user edits them as markdown files in their own git. There is no override mechanism for defaults, because there is nothing to override. There is just one file sitting on their disk.
Plugins still exist. They do one job: share policy across projects. Read-only, vendored, hash-verified, drift-reset on the next run. They are not the mechanism for shipping defaults, and the mechanism for shipping defaults is not the mechanism for sharing policy. Two concerns, two mechanisms. The earlier symmetry was buying elegance with the user’s debugging time.
The lesson I want to draw from this is narrower than “avoid premature symmetry.” It is: when two mechanisms produce a surface that looks identical to the user, you should still ask whether the editing model is identical. If the user edits both the same way, symmetry is paying its rent. If the user edits one of them and merely reads the other, you have two concerns that happen to share a shape, and unifying them costs you the editing UX of the one the user actually edits.
I had to throw away two days of plan-writing to walk that back. It was worth it. The walkback is what made the 1.0 surface usable.
Code moved, concepts did not
When 1.0 shipped, I went back through the changelog to count. The framework had moved every Go file into a new location. It had dropped YAML. It had rewritten the cascade resolver. It had added a vendored plugin model, a doctor command, a budget command, a port-level scaffold generator, and per-agent adapter regeneration. The repo layout looked nothing like 0.x.
The user-facing vocabulary had not changed.
A 0.x user who never read the 1.0 plan could open the new docs and find every word they already knew. Guide still meant rules loaded every turn. Corpus still meant reasoning loaded on demand. Sensor still meant an automated check. Action and playbook still meant what they meant. The only new word at 1.0 was port, and port was a name for something the user already understood without the name. Namely, the named slot a piece of content lives in.
This is the payoff I underestimated when I was naming things in 0.2 and 0.3. The cost of naming a thing well at the start is that you have to actually sit with it for a few minutes and ask whether the name says what the thing does. The benefit is that later, when you refactor the runtime, you do not have to renegotiate the contract with every user who learned the old word. The user’s mental model does not change. Only the implementation does.
A refactor that does not rename anything is the cheapest kind for a user. They keep their docs, their wiki, their training onboarding. They keep the words they say out loud when describing the product to a teammate. The internals can move freely as long as the names stay fixed.
How shapes form under names
Step back from Keystone for a second. The pattern is broader than one project.
Tools grow into frameworks. They almost always do, if they live long enough. The path looks like this: a small utility solves a small problem. It picks up a configuration file. The config file grows extension points. The extension points need naming conventions. The naming conventions need a resolver. The resolver needs ordering rules. The ordering rules need a way to opt out, then a way to opt back in. Suddenly there is a runtime, and the runtime has a contract with everything plugged into it, and the contract is a framework whether you call it one or not.
The interesting question is when to notice. Too early and you are over-designing: naming abstractions before you have three concrete uses for them, building cascade rules for a single tier, writing port contracts for one port. Too late and the runtime has accumulated implicit contracts that nobody wrote down, and the cost of writing them down is paid in surprise regressions when you try to clean up.
The way I think about it now: the moment you find yourself adding a resolver (anything that picks a winner among multiple sources of the same concept) you are in framework territory. Lockfiles and migrations are also strong signals. Per-target rendering (the adapter layer) is a hard signal. Any one of those, taken alone, is not enough. Two of them together is. Three of them together and the question is not if you are building a framework, it is whether the framework is going to be honest about itself in its README.
If you only catch it at three, the way I did, you are not in trouble. You are in the spot where naming the shape costs you a few days of writing and a few weeks of refactoring, and the user-facing surface comes out untouched on the other side. If you catch it at six or seven signals deep, the cost is higher, because by then there are real users with real assumptions and the cleanup runs into compatibility costs.
Three things to do this week
If you are reading this and recognizing your own tool in the description, here is what I would do this week, in order.
Write the port table. Open a new document. List every named extension point in the tool. For each one, write the path convention, the activation rule, and the shape of the file (or config) that goes there. Do not stop until the table is exhaustive. The table is the easiest possible draft of the framework contract. If it takes an hour, you were already a framework and your vocabulary was holding the design up; if it takes a week and you keep discovering anonymous concepts, you have found the work that needs naming first. Either way, this is the highest-value hour you can spend on the project this week.
Draw the physical boundary. Look at your source tree. Can you point at “framework code” and “content the framework happens to ship” using a path, or do you have to use words? If it is words, move directories until it is a path. This is the single most clarifying refactor for a maturing tool, and the tests will catch you if you break anything along the way. One day’s work for a small project, a week for a medium one. From the moment it lands, every conversation about “where does this go?” gets shorter.
Write ADRs for the decisions you have already made. Not the future decisions, but the ones already locked in by the code. Three to ten ADRs of one page each. Context, decision, consequences, alternatives considered. Future you will thank present you the next time someone asks why X is the way it is. The act of writing them often surfaces a wrong turn you can still back out of cheaply, exactly like the plugin draft I walked back. Better to find that during ADR-writing than during a refactor PR.
After those three, you will have a clear picture of what you have, where it lives, and why. The work that comes next is the work that actually changes things — the renaming, if any, the directory moves, the new scaffolders, the migration path for existing users. That work is real and it takes time. But you can scope it, plan it, and ship it in phases that do not break users, because the contract you wrote down in the first three steps is the contract every subsequent phase has to honor.
The thing I keep coming back to from the Keystone refactor is how much of the work had been done already by past versions of me when I named guide and corpus and sensor and action in different commits weeks apart. They were not designing a framework. They were each adding one feature and giving it a name. The framework emerged from the sum of those names. The 1.0 refactor was the moment I looked up and noticed.
If you have been adding features and naming them well, your framework is already there. It is just waiting for you to write it down.
Keystone
Keytsone is now 2.1.1! It is open source and MIT-licensed. I am very open to feedback, so if you have any, please create a discussion or an issue on GitHub. It has a real chance of landing.
A lot landed between the 1.0 cut and 2.1.1, and all of it sits on top of the framework/client split 1.0 introduced. The shape of the project didn’t change. The surfaces around it did.
2.0 was the big one, shipped June 17. A new primitive taxonomy ran through the harness end to end. An in-binary MCP server moved keystone onto the same wire its agents already speak, so there’s no separate process to babysit. A localhost dashboard surfaced live state for the first time. An eval suite came with it. The disk layout moved alongside all of that, and the upgrade was still one command: keystone migrate.
2.1.0 followed the next day. The old patches mechanism retired in favor of a versioned migrations subsystem with paired Up/Down transforms under migrations//, run with keystone migrate up | down | status. The plugin→policy rename finished across Go source, JSON schemas, and docs in the same release. Installs that hadn’t migrated yet started warning and continuing instead of breaking, so the upgrade path stays soft.
2.1.1 reshaped the dashboard. It’s now an HTMX SPA with a single
swap target, real back/forward navigation, and fragment responses keyed off HX-Request. The 14 single-purpose nav links collapsed into five sections: Observability, Harness, Sources, Flywheels, Quality. SSE topic narrowing means a widget only refreshes when its own path changes. A per-session audit log lands at .keystone/state/audit/session--.jsonl, opened with O_CREATE|O_EXCL so nothing ever overwrites a prior run. Cmd+K (Ctrl+K on Windows and Linux) opens a search popover from any page.If you’re still on 1.x, the upgrade is the same as the install (except Brew, which has an upgrade subcommand. It carries you the whole way to 2.1.1. To bring a project harness up-to-date with the new core framework structure, run keystone migrate up. This will patch core files only and will never touch user-edited files.













