Everyone Was Talking About AI Agents. We Were Asking Who Was Responsible.

The Question Nobody's Asking

The advice is consistent enough to count as consensus:

Configure your AI like you'd configure a skilled teammate.

Set up structured context, encode your preferences, close feedback loops, delegate progressively. Eugene Yan's recent piece on working with AI is the best formulation of this I've read. Worth reading in full. If you're short on time:

Context as infrastructure — Organize your workspace so the model can navigate it. Treat each session like onboarding a new hire: CLAUDE.md, INDEX.md, suggested reading order.
Taste as configuration — Encode how you want the model to behave, push back, and teach. Frequent workflows become skill files that load on demand.
Verification for autonomy — Shift verification left. Let the model run evals, inspect browser output, read its own errors. You can't delegate what you can't verify.
Scaling via delegation — Move from line-by-line instructions to end-to-end specs. Parallel sessions, git worktrees, progressively larger task chunks.
Closing the loop — Mine session transcripts to update CLAUDE.md and skills. Make corrections inside the session so the transcript captures the before-and-after.

But there's a question the framework doesn't address.
It's not a criticism. It's a scope boundary, and the boundary matters more as the stakes get higher.

What happens when the session ends? Not just this session.
What happens when the operator who built all that context leaves the project,
When a new engineer opens the codebase cold and the AI boots with no history of what came before
When the rules that worked so well were personal preferences that lived in one operator's workflow and not documented failures encoded into the project itself?

Without a structure designed to survive operator changes, the answer is: the model knows what it can read from the current files. Every incident that shaped the previous operator's judgment resets. Every hard-won rule that came from a real failure is invisible to the new session.

That's the gap MICA(MICA (Memory Invocation & Context Archive) fills. But it's important to be precise about what kind of gap it is.

Not a Scale Problem — a Threat Model Problem

Eugene explicitly notes in his piece that his principles extend beyond individual use: team norms, agent harness design, organizational infrastructure. A well-committed CLAUDE.md at repo level can outlive any single operator. He makes this point himself.

So the distinction isn't scale, and it isn't storage. It's origin.

Eugene’s configuration encodes preferences, which represent the accumulated judgments of a skilled operator refined across sessions.

MICA's Design Invariants (DIs) encode incidents instead.

Each DI is a binding rule the model must operate within, serving as a constraint extracted from a specific failure rather than a preference. Every critical DI requires binding.origin_episode.

The rule exists not because someone prefers a certain behavior, but because a specific failure occurred, was documented, and was encoded as a constraint. An operator who has never touched the project can read a DI and understand not just what the rule is, but why it exists and what the consequences were when it wasn't followed.

That information can't come from preference accumulation. It comes from incident history.

Eugene's framework defends against an operator who hasn't thought carefully about their setup. MICA's framework defends against an operator — or a model — that doesn't know what this project has already paid to learn.

Why We Don't Use Agents

That threat model difference leads directly to the agent question, and to the choice that most teams in this space make differently from us.

Eugene recommends parallel agents, git worktrees, and progressive delegation to larger work chunks. We don't. The reason isn't technical immaturity, though the empirical record is worth noting: Gartner projects 40% of agentic AI projects will be cancelled by late 2027; a 2025 enterprise survey found only 14% of agent pilots have successfully scaled; the FSE 2025 "Agentless" paper showed that a simple three-step pipeline competed with complex agent orchestration at a fraction of the cost. But none of that is the reason.

The reason is a single question:

When an agent gets something wrong, who is responsible?

In compliance-sensitive domains, such as financial signal generation, citable scientific archives, and legal audit trails, that question must have a person as its answer.

Here is what MICA's Package Conformance Tests (PCT) produce in a typical deployment, running eleven deterministic checks at session start before the model acts:

PCT-001 [PASS] mica.yaml found
PCT-006 [WARN] mica_spec 0.2.6 is 2 version(s) behind canonical 0.2.8 -- consider upgrading
PCT-010 [PASS] all 6 critical DIs have binding
PCT-010 [WARN] doctrinal binding (no episode code, version ref, or date): ['DI-001', 'DI-002', 'DI-004']
              -- ground origin_episode in a real incident
PCT-009 [PASS] CLOSED CONTRACT

Deterministic. Inspectable.

A person sees this output and decides to proceed. The model activates only after that decision. However, if you add an orchestrator agent before PCT, three new failure modes appear:

First, the agent might see a WARN alongside a CLOSED CONTRACT and continue anyway.
Second, the agent might judge the PCT-010 WARN as purely informational and deprioritize it.
Third, PCT might not run at all because the agent skips it.

Governance that is probabilistic is not governance. It's a suggestion. The moment you introduce a stochastic layer before the verification check, the check becomes contingent on the model's disposition — which shifts by session, by context, by what else is in the window.

Eugene's framework contains a related principle: verification must precede delegation, and effective delegation requires defining success criteria so you can verify the outcome. His implementation is configuration.

MICA's implementation takes a different cut: don't delegate the verification itself. The human runs PCT. The human reads the result. The model then operates within the confirmed state.

Agent + MICA is technically possible. But routing PCT through an agent changes what MICA is.

Instead of a gate running before the model acts, it becomes a document that the agent reads if it decides to. Think of the difference between a hard-failing linter that blocks a CI build and a README that documents the same rule: one stops the problem from shipping, the other gets skipped when someone is in a hurry.

A gate is structural. A document is advisory. Our domains require a gate. Here's what that gate looks like in practice.

MICA Mapped to Eugene's Principles

The mapping is worth making explicit, because MICA isn't an alternative to Eugene's framework. It's an implementation of it, extended from individual to institutional context:

# mica.yaml

# ── Eugene: Context as infrastructure ─────────────────────────────────
# mica.yaml declares what loads, in what order, before anything runs.
name: alecta-stock
mica_spec: "0.2.8"
mode: memory_injection

# ── Eugene: Taste as configuration ────────────────────────────────────
# Not personal taste — incident-grounded invariants.
# Each DI traces to a real event in binding.origin_episode.
di_policy:
  namespace_mode: sequential
  critical_binding_required: true   # PCT-010 escalates to FAIL if unbound
  max_archive_age_days: 180         # PCT-012: WARN when archive goes stale

layers:
  - name: archive
    path: alecta-stock.mica.archive.json
  - name: playbook
    path: MEMORY_PLAYBOOK.md

# ── Eugene: Verification for autonomy ─────────────────────────────────
# Verification happens here, before delegation begins.
# PCT runs deterministically. Human reads output. Model activates after.

# ── Eugene: Closing the loop ───────────────────────────────────────────
# binding.origin_episode on every DI is the closed loop — into a
# traceable incident record, not a corrected preference file.

# ── Eugene: Scaling via delegation ────────────────────────────────────
# COMPACT_MODE: no mica.yaml at all. Archive + playbook carry
# governance directly. Delegation scales to minimal footprint.

Four of the five principles map cleanly. The fifth — Closing the loop — is where the most interesting failure appeared. In v0.2.7, the loop looked closed. Running PCT against production deployments showed it wasn't.

v0.2.8: The Loop Has to Actually Close

Two critical DIs, same project, both returned PCT-010 [PASS] under v0.2.7.

{
  "id": "DI-001",
  "label": "astock-data-integrity",
  "severity": "critical",
  "binding": {
    "origin_episode": "Enforcement of absolute data integrity to prevent financial risk.",
    "violation_count": 0
  }
}

{
  "id": "DI-006",
  "label": "astock-output-schema-completeness",
  "severity": "critical",
  "binding": {
    "origin_episode": "EXP-OS-1 (v0.8.6): outputSchema used Zod .strip() — unknown fields silently dropped before scoring. valuation.per and valuation.pbr lost in three separate live runs before detection.",
    "violation_count": 3,
    "last_triggered": "2026-04-02"
  }
}

DI-001's origin_episode restates the label. DI-006's origin_episode is a record: a version, a named experiment, a specific failure, a count of how many times it recurred before anyone caught it.

A model loading DI-001 learns that data integrity matters. A model loading DI-006 learns that this specific failure happened three times before it was found, and here is exactly what caused it.

v0.2.8 now distinguishes between them:

PCT-010 [PASS] all 6 critical DIs have binding
PCT-010 [WARN] doctrinal binding (no episode code, version ref, or date): ['DI-001', 'DI-002', 'DI-004']
              -- ground origin_episode in a real incident

CLOSED CONTRACT holds. But three of the rules are declarations, not lessons. The loop isn't closed yet. This is more than just a v0.2.8 feature, as it represents the threat model playing out at the data level.

The incident-grounded binding is what survives operator changes. The doctrinal binding is what disappears when the person who understood the intent stops being the person running the sessions. Three deployments show where this has mattered.

Three Deployments

Different domains, different failure modes. The first two are about what the gate stopped. The third is about what governance looks like when you deliberately minimize it — and what it means that the institutional memory persisted anyway.

1. Alecta-Stock (securities)

The Zod failure that produced DI-006 was invisible in the moment and obvious in retrospect. The pipeline used Zod to validate its output schema, and Zod's default .strip() behavior silently drops unknown fields by design, for security reasons. When valuation.per and valuation.pbr were added to the archive definition but not to the Zod schema, the model produced complete scoring objects, the schema stripped the valuation fields before they reached the scoring stage, and the output looked valid. No error raised. Three live scoring runs were affected before anyone noticed two fields were consistently absent.

This is precisely where preference-based configuration falls short. A CLAUDE.md entry saying "always verify Zod schema completeness" is a reminder. A new operator won't know why it's there. A distracted session will skip it.

In contrast, DI-006 is a record showing that this specific failure happened three times at a specific version before it was caught. It loads structurally at session start, meaning there is no session where it is optional.

DI-002 covers the complementary failure: any execution path reaching NO_DECISION must fail closed, rather than returning a neutral placeholder that downstream systems treat as a valid result. Through this framework, the model begins each session knowing the failure modes of this specific project. Not as warnings, but as history.

2. Flamehaven Audit Reports (biomedical)

Covered in the previous article in this series: 56 EQA records (physics/math reproductions), 34 BAV experiments (protein-folding validation). The archive is citable. Labels become downstream citations.

The failure was EQA framing drift. Across 51 records, a "PASS" label had been applied to results that were numerically correct but had not been verified through actual engine threshold evaluation — only through manual review. Seven records had gone through the real process. Forty-four had not. No one had noticed because the label looked the same in both cases.

DI-EQA-001 now encodes the precision lock: PASS badge only from real threshold evaluation, not from manual review or narrative description. Domain-namespaced DI IDs, formalized in v0.2.7, keep EQA-specific rules identifiable separately from BAV-specific ones, which have their own failure history.

The gate doesn't verify that the science is correct. It verifies that the process that produced the label matches what the DI requires. That distinction matters because the failures that poison archives are usually not fabrications. They're process mislabelings. Data checked the wrong way, labeled as checked the right way.

3. Flamehaven Code Audit Standard (site operations)

CAS is a confidentiality audit system, and its MICA deployment has no mica.yaml. This is intentional. COMPACT_MODE — formalized in v0.2.7 — is a deliberate minimum-footprint deployment. PCT-001 fails (mica.yaml not found), and the output correctly identifies the package as LEGACY. Not defective. Not non-compliant. Operating at minimum footprint by decision, not by accident.

The distinction between COMPACT_MODE and pre-migration LEGACY_MODE matters in practice: both produce pct=LEGACY at runtime, but one is a decision and one is a migration target. Before v0.2.7, there was no way for the system to tell them apart.

What this case demonstrates: even without the full conformance stack, the incident history persists. The archive and playbook still carry the record of what happened and why the rules exist. Simplify the gate, and the institutional memory doesn't go with it.

What MICA Cannot Block — and Why That Matters More Than What It Can

An honest accounting of limits is part of what makes a gate trustworthy. A gate that claims to catch everything is a gate you should trust less.

Plausible fabrications within valid ranges pass every structural check. A protein pTM score of 0.74 when the actual value is 0.61 has valid format; the number is wrong. No pattern check reaches this. It requires someone who can re-run the underlying computation and compare.

Structural compliance with false content is subtler. v0.2.8's doctrinal WARN fires when origin_episode contains no episode code, version reference, or date. But EXP: general integrity principle passes the check — it looks like an episode code without pointing to one. The validator detects naming conventions. It cannot verify that the narrative behind a code is accurate. A binding can look grounded without being grounded.

Mid-session governance drift is the hardest to close. PCT runs at session start. A model that loads the DIs, acknowledges the constraints, and violates one in step 8 of a 12-step task is not caught by PCT.

The gate is at the session boundary. What happens within the session still requires oversight from someone who understands what the model is doing.

Correct computation with wrong interpretation sits entirely outside structural checking. "This confirms X" versus "this is consistent with X" cannot be separated by a YAML validator. The data may be real, the structure valid, and the conclusion wrong. Catching this requires domain expertise.

Understanding these four categories is what keeps the gate honest. It is a filter for failures that are cheap to prevent structurally, so that human attention is available for the failures that aren't.

Eugene's framework closes a loop by encoding what the operator has learned. MICA closes a loop by recording what the project has paid for.

Both loops matter. That discipline is what the gate enforces. The gate ensures the operator has read the record before the model begins. What happens after that is still on the operator.