Your Fuzzer Is Only as Smart as Its Oracle

A migration my schema tool generated passed every check I had. The final schema matched the target exactly — convergence, green. Then I looked at the plan it took to get there: DROP TABLE; CREATE TABLE. On a table with data. The destination was right; the path would have erased a production database.

The test was green because I was checking the wrong thing. I had an oracle for where it ended up and none for how it got there.

That gap is most of what this post is about. I build developer tools — SDKs, a compiler, a declarative schema-management system — the kind of software where one wrong edge case ships to everyone downstream. And I spent a long time trying to get coding agents to test this stuff for me. Claude Code, Codex — with enough prompting and the right skills you can get something that looks like a test suite. What I could never get was coverage that followed the real dev flow without quietly skipping the case that mattered. It's convincing right up until you check whether it checked anything.

What finally worked better was semantic fuzzing: constrained, deterministic random generation, run against properties instead of a reference output, with the agent writing the generators and domain rules rather than playing tester. It catches a lot. And the more I run the harness, the more it improves.

But the interesting part isn't the fuzzing. AI made the building of the harness cheap — generators, adapters, the long tail of domain rules. It did not make the harness run itself correctly, and it did not write the part that decides what "correct" means when you have no second implementation to diff against. That part was always the real work, and it still is.

So this is a post about oracles wearing a fuzzing costume. As usual: notes on where my thinking has drifted, not advice.

Where I've landed for now (and expect to revise):

Randomness doesn't find bugs. The oracle does; randomness just walks it there.

So the value isn't the generator. It's the law you check.

The laws worth having need no reference implementation — they're algebraic relations the tool must satisfy by its own logic.

AI dropped the cost of the harness. The scarce skill moved from headcount to seeing what the law is.

"Semantic" needs pinning down

"I fuzz it" says almost nothing on its own. At one end of the dial is crash fuzzing — uniform noise at a parser, asking only "did it fall over?" At the other is reference-model testing — diffing every output against a full second implementation you now have to build and trust. Semantic fuzzing sits in between, and the whole game is what you do with the middle.

Three properties make it work. Generation is distribution-constrained (biased toward inputs with real structure — schemas and migration sequences that could plausibly exist — not uniform noise), deterministic (every case carries a seed, so a failure replays exactly), and shrinkable (a failing case minimizes itself to the smallest thing that still breaks).

That's property-based testing's family tree, and I'd rather say so than pretend it's new. The goal isn't maximum randomness — it's productive randomness: reproducible, minimizable, aimed where bugs live. The one thing it refuses to do is what reference-model testing does — build a second implementation to check the answer. It checks relations the answer must obey instead, which is the move the rest of this post turns on.

The oracle is the whole game

A fuzzer is never smarter than its oracle.

The generator is just search. It produces inputs; it has no idea what's wrong with any of them. The thing that decides this is a bug is the oracle — and whatever the oracle can't recognize, the fuzzer cannot find, however many cases you burn.

Point a billion inputs at a crash oracle and you get crashes. You won't get a migration that converges to the right state via a catastrophic path, or a rollback that silently drops a constraint. Those don't crash. They're wrong semantically, and a crash oracle is blind to exactly that.

The trap here is reaching for a reference implementation — "I'll build a correct model and diff against it." But a full reference model of a schema engine is a second schema engine, with its own bugs to maintain. You doubled the work and got a thing you also can't trust.

The way out: stop asking for a reference, ask for laws — relations the tool must satisfy regardless of the "right answer."

Law	Relation checked	Reference needed?	Axis
convergence	`apply(spec)` → introspect → residual drift `== 0`	no	correctness
idempotency	`apply(spec)` twice → 2nd run is a no-op	no	correctness
rollback	`base → target → base` → back to base, exactly	no	correctness
safety	destructive change → must surface hazard / require approval	no	safety

None needs the correct schema in advance — each is true by the meaning of the operation. convergence doesn't ask "is this schema right?", it asks "did you reach the state you claimed?" That's the metamorphic move: don't verify the answer, verify a relation the answer must obey.

Note safety as a peer, not a footnote. "Reached the right state" and "took a safe path" are different axes — the DROP TABLE; CREATE TABLE from the opening satisfies convergence and still erases your data. The oracle must check the path, not just the destination. This is the law that would have caught my green-but-catastrophic migration.

A worked example: schemas, extensions, a Docker loop

One generator feeds a pipeline; every stage is a place to hang a law.

 generate schema + mutation sequence
            │
            ▼
   IR / shadow DB                ── convergence, idempotency  (in-process)
            │
            ▼
   diff + online-migration plan  ── safety, plan honesty      (in-process)
            │
            ▼
   apply to Docker Postgres/MySQL── locks, extensions, races  (Docker)
            │
            ▼
   introspect + review           ── drift == 0

Run that across thousands of seeds and you're not testing "does ADD COLUMN work." You're hitting combinations a human suite never reaches — especially extension combinations. Base PostgreSQL/MySQL is the easy part; bugs hide where citext meets a generated column meets a partial index meets an old trigger. Enumerating those by hand used to need a team. Generated, it's a distribution you tune.

But be honest about which layer answers:

Layer	Runtime	Answers	Can't answer
In-process	PGlite / `node:sqlite`	convergence, idempotency, real SQL exec	multi-connection, lock fidelity, full extension catalog
Docker	real Postgres / MySQL	locks, isolation, extensions, version matrix	provider parity, hosted-backend quirks
Live	real cloud / hosted	IAM/KMS enforcement, audit immutability	(bulk testing — too slow/expensive)

The same generated case fans out: cheap laws run in-process by the thousand; expensive laws that need a real engine run in Docker, sampled. The layer is chosen per property, not per pipeline. Putting a lock-contention law in the in-process tier wouldn't make it fast — it would make it lie.

The flywheel is real, but it has a catch

"Run it more, it gets better" is true — but not because green accumulates. A suite at 100% pass proves nothing about whether it can catch anything. The flywheel only turns toward value if, on every failure:

The failure minimizes and re-enters as a regression case. Every real bug becomes a permanent, tiny, deterministic test. The corpus grows in the shape of your mistakes.
You periodically test the oracle itself. Inject a known fault — delete a hazard rule, invert a guard, drop a constraint in the introspector — and confirm some law kills it.

fault injected ──▶ does a law fail?
                     ├─ yes → oracle bites. trust the green.
                     └─ no  → green is decorative. fix the oracle.

That second step is the one I see skipped most, and the one that keeps the whole thing from becoming theater. An oracle you never test is a claim, not a check. Honest flywheel: minimize failures into the corpus, and verify the verifier. Do that and it genuinely compounds.

What got cheap (and what didn't)

A few years ago this pipeline was a staffing decision: generators, shadow DB, migration planner, Docker orchestration, per-dialect adapters — a team and a quarter. The barrier wasn't ideas; it was implementation labor, and that priced most people out of serious simulation testing.

That wall came down, and I can point to where. A per-dialect adapter — the layer that maps my IR onto one database's quirks — used to be a multi-day slog of reading docs and discovering edge cases by getting burned. The last one I added came out in an afternoon: I described the IR and the target's introspection format, the agent drafted the adapter and a first pass of its quirk-rules, and I spent my time reviewing rather than typing. Parsing is a commodity, local execution is cheap, and the long tail of boilerplate writes itself faster than I can spec it.

But notice exactly what got cheap. Building the harness did. Knowing which relations are actually invariant did not. Routing each law to a layer that can honestly answer it did not. Checking that the checker still bites did not. The agent wrote the adapter; it could not tell me whether safety belonged on the same axis as convergence, or whether my green meant anything. The scarce skill went from "can you build the harness" to "can you see what the law is." The implementation got democratized. The judgment didn't.

Where it breaks

Local simulation kills a startling fraction of bugs for almost no money. It does not replace a remote environment:

Provider parity — the real cloud API does things no emulator reproduces.
Real enforcement — IAM denies, KMS policies, VPC isolation, audit-log immutability: only provable against the real thing.
Hosted-backend quirks — managed Postgres, pooler modes, hosted auth schemas. My local Docker Postgres happily ran a migration that deadlocked on a managed instance behind a transaction-mode pooler, because the pooler changes how sessions and locks interleave — behavior the local container simply doesn't have.
Genuine concurrency at scale — race probes catch a lot; they don't certify production load.

These live on a rare live tier, used as calibration anchors, not bulk testing. Local says the logic is right; only live says the platform agrees.

Objections I'd raise myself

"This is just property-based testing." The generation and shrinking, largely yes. The contribution is the framing: the oracle is the artifact, the laws need no reference, and safety is a first-class axis.
"Why not just have the agent write and run the tests?" I tried, hard. The agent will produce a plausible suite, but it tends to test the happy path it just wrote and skip the case that actually bites — and you can't tell from the green. A fuzz harness inverts that: the agent supplies generators and laws (things I can review), and generation finds the cases neither of us thought of. The agent is good at proposing rules, unreliable at being exhaustive.
"AI-written oracles are subtly unsound." The failure mode I fear most — a hole in the oracle reads as safety. So: borrow battle-tested algorithm cores, keep the bespoke part to domain rules I can eyeball in review, never let the harness downgrade a real bug to "expected." The verifier is the last thing you let the AI write unsupervised.
"Green still lies." Until you mutation-test the oracle. That's not polish; it's what makes green mean anything.
"This grows into an unmaintainable matrix." Only if every law runs on every layer. Counter-discipline: route each property to the cheapest honest layer, retire laws that stop catching anything.

Where this is going

I came in through a practical door — semantic fuzzing catches a lot of bugs cheaply — and walked out somewhere narrower: the random part was never the point. The oracle was, and AI made everything except the oracle cheap enough to finally see that.

In the AI era, rigor is what pays off: it compresses a vague situation into a few hard facts an agent can act on. Here that takes its sharpest form. Rigor is what turns blind randomness into a bug-finding machine: the law does the finding; the noise just does the walking.

The implementation barrier is down. Knowing what "correct" even means for the thing you built is the whole job now.

Which brings me back to that green migration with DROP TABLE hiding in its path. The fuzzer didn't miss it because the randomness was weak. It missed it because I hadn't yet written the law that calls erasing your data a bug. The day I did, generation found the case in minutes — it had been walking past it the whole time, waiting for me to say it mattered. That's the job now: not running more cases, but learning to name the failures worth catching, one law at a time.