How I stopped Claude Code from hallucinating 42% of my React Code

TL;DR

I tracked 6 months of my own AI coding sessions in React Native. In my logs, 42% of AI-generated diffs contained at least one hallucinated import, fake API, or duplicate component.
Token costs were the second tax. Re-loading project context every session cost roughly $135/month per developer at the model pricing I was using.
Better prompts didn’t fix either problem. The AI didn’t need smarter instructions : it needed memory and a map.
I built U-AMOS (Universal AI Memory Operating System): a 3-tier memory bank, a context map, a rule priority system that splits “what to do” from “how to do it,” a 7-point anti-hallucination checklist, and a plan/act workflow that runs before any code is generated.
After deploying U-AMOS across my own projects over a 3-month tracking period: hallucinations dropped from 42% to 3%. Token costs dropped from $180/month to $18/month. Feature velocity increased roughly 5x. These are my internal numbers: I’ll note where external research reports similar magnitudes.
The framework is open and documented. U-AMOS 2.0 also ships pre-configured inside AI Mobile Launcher for anyone who doesn’t want to build it from scratch.

A note on the numbers

Everything in this article that is quantified — the 42%, the $135/month, the 91% reduction — comes from 6 months of my own session logs across my React Native projects. I tracked hallucinations manually, counted tokens via API usage dashboards, and measured debugging time against my own estimates. These are not controlled experiments.

What I can say is that the direction of the results matches what external research is starting to report. Memory-system papers are showing 40–60% accuracy improvements and 60–90% token reductions when you introduce structured memory into LLM workflows. Mem0’s Claude Code integration reports roughly 90% lower token usage with persistent memory vs full-context prompting. The order of magnitude is consistent. The exact numbers are mine.

The moment I stopped pretending it was working

It was a Tuesday in October. I was building a functionality for my app. I asked Claude Code to add a Redux toolkit usage to manage user accounts. It generated something that looked correct. I committed it.

Twenty minutes later, the build failed.

The AI had been imported useRouter from next/router. In a React Native project. That hook doesn’t exist on mobile. It was a 30-second fix, but it wasn’t the first time. It was the fourth time that week.

I started keeping a log. Every wrong thing the AI generated, I wrote down. After a month, I had the data from my own sessions:

42% of AI-generated diffs had at least one hallucinated import, function, or component
25% of the components it created already existed in the codebase under a different name
I was spending roughly 4 hours a week debugging things the AI had invented
I was using Cursor much more than Claude that time, so with Cursor, I had analytics dashboard, an d confirm some of my thesis

The frustrating part was that I knew the AI wasn’t getting worse. I was paying for the best models. The prompts were detailed. The context windows were huge.

The problem wasn’t the model. The problem was that I was treating it like a senior developer when it was behaving like a junior with no memory of the project, and no map of the codebase.

I have played before by adding rules, memory bank,.. but there were always issues in grasping the whole context, and i need to remind him much more often.

The token tax nobody talks about

While I was tracking hallucinations, I also started tracking token usage. The numbers were uncomfortable.

Every session, I was loading the same context: project structure, architecture decisions, naming conventions, what components already existed. The AI had no memory between sessions, so I kept reexplaining everything. Worse, when I didn’t re-explain, the AI would explore : running directory listings, opening files at random, building up its own picture of the codebase by trial and error.

That exploration is where the worst of the token bleeding happens. Asking “where is the authentication logic?” can trigger 25,000 tokens of blind navigation through folders before the AI finds it.

The math, at the model pricing I was using at the time:

Session 1: Re-load + explore project structure → 50,000 tokens
Session 2: Re-load + explore project structure → 50,000 tokens
Session 3: Re-load + explore project structure → 50,000 tokens
Daily total: 150,000 tokens
Monthly cost: ~$135/month per developer

(based on ~$30 per million tokens, prompt + completion)

That’s the invisible tax. Even when the AI was generating correct code, I was paying to give it the same context every time, plus paying for it to wander around the repo finding things it should already know about.

I do remember creating one file, that has architecture.md, where i put this type of context that i give each time, and then i created review_best_practices.md, to have the rules for the mistakes that he was repeating.

Then it comes the Claude Code best practices usage, I tried the obvious approaches first. Longer CLAUDE.md files. More detailed system prompts. Better instructions on what to remember.

None of it worked sustainably. The AI would hold context for a session or two, then drift. Because the problem wasn’t the prompt. It was the architecture.

The reframe that changed everything

The shift came when I stopped thinking of AI as a developer and started thinking of it as a system that needed memory built for it, and a map handed to it. I do remember watching an intreview by Thomas Dohmke, and he asked one of the best practices is to look at it as a colleague, not a tool.

A junior dev with no memory of your project would also generate hallucinated imports. Would also recreate components that already existed. Would also waste hours wandering through unfamiliar code looking for the right file. The AI wasn’t broken. The relationship was broken. I was asking it to behave like it had context it didn’t have.

A lot of content I’ve seen treats this as a prompting problem. Write a better system prompt. Use a longer context window. Be more specific in your instructions.

My experience, and increasingly what I see from teams who’ve shipped real production AI-assisted codebases, is that prompts plateau. Durable context compounds. The teams getting consistent AI output aren’t writing better prompts : they’re building memory systems that load the right context at the right time and update automatically when something changes.

you can read this article about best prompt engineering approach here:

Essential guide of Prompt Engineering for Software Engineers
Malik CHOHRA · 17 November 2025
Read full story

That’s what I built. I called it U-AMOS.

What U-AMOS actually is

U-AMOS : Universal AI Memory Operating System, is a framework for managing AI-assisted development. It has five components, each solving a specific failure mode I’d logged.


┌──────────────────────┐
                  │     Memory Bank      │
                  │ (Cold / Warm / Hot)  │
                  └─────────┬────────────┘
                            ↓
                  ┌──────────────────────┐
                  │     Context Map      │
                  │   (Index / Lookup)   │
                  └─────────┬────────────┘
                            ↓
                  ┌──────────────────────┐
                  │     Plan Mode        │
                  │  (before execution)  │
                  └─────────┬────────────┘
                            ↓
                  ┌──────────────────────┐
                  │ Validation Layer     │
                  │ (7-point checklist)  │
                  └─────────┬────────────┘
                            ↓
                  ┌──────────────────────┐
                  │   Code Generation    │
                  └─────────┬────────────┘
                            ↓
                  ┌──────────────────────┐
                  │  Progress Logging    │
                  │   (.memory updates)  │
                  └─────────┬────────────┘
                            ↓
                  └──────→ FEEDBACK LOOP ──────┘

1. The Memory Bank — three tiers, loaded on demand

Not all context is equally important for every task. So I tiered it.

Cold tier (project identity — loads rarely, ~10% of sessions):

00-description.md — what we’re building, in 500 words
01-brief.md — non-negotiable constraints
10-product.md — feature specs

Warm tier (architecture — loads on demand, ~30% of sessions):

20-system.md — how the system works
30-tech.md — stack and dependencies
60-decisions.md — why we chose what we chose
70-knowledge.md — lessons learned

Hot tier (current state — loads every session, 100%):

40-active.md — what we’re working on right now (max 500 words)
50-progress.md — what shipped recently

The hot tier is small (~2,000 tokens) and always loads. The warm tier loads when the task touches architecture (~5,000 tokens). The cold tier almost never loads during development — it’s the onboarding layer. A new developer (or a new AI agent starting a session) reads the cold tier once and understands the project without hunting through the entire repo.

The result: 2,000–10,000 tokens per session instead of 50,000. That assumes you’re maintaining the files actively — see the hygiene section below.

2. The Context Map — the exploration killer

This is the piece that does the most work for the lowest cost.

context_map.md is a single 500-token lookup file at the root of the project. It indexes everything: every feature, every service, every core UI component, with the entry path next to each one.

# Context Map
## Features (14)
| Feature        | Entry Point                      | Purpose            |
|----------------|----------------------------------|--------------------|
| auth           | src/features/auth/index.ts       | Authentication     |
| onboarding     | src/features/onboarding/index.ts | User onboarding    |
| todos          | src/features/todos/index.ts      | Todo management    |

## Services (15)
| Service        | Path                             | Responsibility     |
|----------------|----------------------------------|--------------------|
| logger         | src/services/logging/logger.ts   | Centralized logs   |
| analytics      | src/services/analytics/...       | Firebase analytics |

## UI Components (40+)
| Category       | Components                       |
|----------------|----------------------------------|
| Buttons        | Button, IconButton, FAB          |
| Forms          | Input, ControlledInput, Switch   |

When the AI starts a session and needs to know “where does authentication live?”, it reads one 500-token file instead of running directory listings, opening five files to compare them, and burning 25,000 tokens building its own mental model of the repo.

In my own logs, this single file removed roughly 60% of the per-session token consumption that wasn’t already covered by the memory bank. The math: 500 tokens replaces 25,000. That’s a 50x reduction on the most expensive part of every session : discovery.

3. The Rule Priority System — three tiers, with generators separate from rules

The same logic applies to coding rules.

Critical rules (always load, ~4,000 tokens):

Meta-rules and session protocol
Anti-hallucination checklist
Common violations (no inline styles, no console.log, no hardcoded strings, no API keys)

Important rules (task-specific, ~2,000 tokens each):

Design system patterns: loads if working on UI
State management rules: loads if working on the state
i18n patterns : loads if adding translations
Navigation patterns: loads if adding routes

Recommended rules (load if relevant):

Performance optimizations
Testing patterns
Security and platform-specific privacy rules

The other architectural distinction that mattered: I separated generators from rules. They look similar but they solve different problems.

Generators answer what to do. Step-by-step implementation guides for recurring tasks: “add a new language,” “add a new screen,” “add a paywall.” They’re workflow documents — copy this template, register here, run this script.
This one i include in my Ai react native boilerplate:
https://aimobilelauncher.com/, and i explained them there, you can check the code about different generators.
Rules answer how to do it well. Code quality patterns and constraints: this is what good styling looks like; this is what the wrong import path looks like.

When you mix the two, when your “how to add a language” doc also tries to explain every i18n best practice, the AI gets overwhelmed and follows neither cleanly. Splitting them means the AI reads the generator to know the steps, then reads the matching rule pack to write the code correctly. Two clean reads. No drift.

4. Concrete examples beat abstract rules

This is a philosophical point but it’s the reason U-AMOS rules actually work.

Most rule documents read like this: “Use proper styling conventions. Avoid inline styles where possible.”

Rules in U-AMOS read like this:

## Styling

### ❌ WRONG — inline styles
<View style={{ marginTop: 20, padding: 16 }}>

### ✅ CORRECT — Restyle props
<Box marginTop="xl" padding="lg"/>

### Exception: unsupported properties
<Box marginTop="xl" style={{ opacity: 0.5 }}>
(opacity is not a Restyle prop, inline is acceptable here)

LLMs don’t generalize abstract principles well. They pattern-match. If you show them what wrong looks like next to what right looks like, they reliably produce the right pattern. If you tell them to “follow good practices,” they produce whatever the training data nudged them toward last time.

Every rule pack in U-AMOS is built this way. ❌ wrong → ✅ correct → exception (if any). No paragraphs of theory. No abstract guidelines. Just visual diffs. This is the single biggest determinant of whether a rule actually changes the AI’s output or gets ignored.

5. The 7-Point Anti-Hallucination Checklist

Before any code is generated, the AI verifies:

Does the file I’m editing exist?
Did I check the component inventory before creating something new?
Did I check the service registry?
Is the import path correct?
Does the function I’m calling actually exist in that file?
Am I using the project’s i18n pattern, not hardcoded strings?
Am I using the project’s logger, not console.log?

If any answer is no, the AI stops and verifies before continuing.

The first week I deployed this, my hallucination rate in my own sessions dropped from 42% to under 5%. Not because the model improved. Because I made verification mandatory before generation.

Each of these rules is manually crafted.

6. Plan/Act Mode — no code without a plan

This is the piece I added after the initial U-AMOS deployment, and it might be the highest-leverage addition.

Before touching more than one file, the AI must:

Read .memory/40-active.md (current focus)
Draft an implementation plan in plain markdown
Wait for my confirmation
Execute only after approval
Log what it actually shipped back into .memory/50-progress.md

This sounds slow. It’s actually faster because you catch architectural mistakes at the plan stage instead of the debugging stage. Tweag’s Agentic Coding Handbook and Lullabot’s memory bank guide both document the same pattern. It’s becoming standard practice in teams using agentic coding seriously.

What changed after U-AMOS

I tracked the same metrics for 3 months after deploying U-AMOS across my own projects.

Hallucinations (from my logs): 42% → 3% (93% reduction)
Tokens per session (average): 48,000 → 4,200 (91% reduction)
Token cost (at my model tier): ~$180/month → ~$18/month
Time debugging AI errors: 4 hours/week → 20 minutes/week
Duplicate components created: 23 in the 3 months before → 0 in the 3 months after
Feature velocity: roughly 5x faster on features I tracked end-to-end

I also started tracking which rule packs loaded most often and which hallucination types were still slipping through. That observability layer is what tells you where the system needs a new rule file vs where the AI needs better examples.

Memory hygiene: pruning, plus living rules

The mistake I see in most memory bank setups is treating the files as append-only. They’re not. They need pruning.

My current hygiene routine:

40-active.md updates at the start of every work session (what’s the actual focus today)
50-progress.md gets a new entry after every shipped feature : old entries archive monthly
70-knowledge.md gets pruned weekly : if a lesson is now in a rule file, it gets removed from the knowledge doc
20-system.md only updates when architecture actually changes
If the AI proposes changes to any memory file, it does it as a plan diff I review : it never writes to memory silently

There’s one more file that prevents documentation rot: updated_rules.md. It’s a changelog for rule exceptions.

When the team makes a real exception to a rule : for example, “we never use inline styles, EXCEPT for the opacity prop because Restyle doesn’t support it” : that exception goes in updated_rules.md with a date and a reason. Not into the main rule file.

# Updated Rules (Living Document)
## 2025-12-20 — Inline styles exception

**Original rule**: NO inline styles ever
**Updated rule**: NO inline styles EXCEPT for single properties not supported by Restyle (opacity)
**Why**: Restyle doesn’t support opacity prop
**Example**: ✅ <Box marginTop="xl" style={{ opacity: 0.5 }} />

Why this matters: rules become outdated quickly, and rewriting them every time creates drift. The living rules file lets the AI always check the latest guidance without losing the original logic. Exceptions are explicit and dated. Historical context is preserved. The main rule files stay clean.

The 2,000–10,000 token figure holds only if you maintain all of this. If you let the files grow unchecked, you’ll hit 50,000 tokens again within two months. The context window isn’t the bottleneck : your maintenance habits are.

What still doesn’t work, and what’s on the roadmap

This isn’t a finished system. Four things still fail or are incomplete:

Long sessions. Context degrades over multi-hour conversations. I re-attach memory bank files every 30–40 messages. A better solution is probably an MCP server that handles re-injection automatically, but I haven’t built it.

Performance edge cases. The AI generates working code that sometimes re-renders too aggressively. Architecture rules help, but don’t eliminate this. I m fixing this by creating performance rules for expo apps. i m using the official one from Expo, but it is not enough, and with the project architecture, it needs a lot of fixes and improvement.

Cross-project memory. U-AMOS handles per-project memory. The next layer — preferences and patterns that follow you across every project you touch — is what tools like Mem0’s MCP integration and Claude Code’s own auto-memory system are starting to solve. If you find yourself re-teaching the same conventions in every new repo, cross-project memory is the fix. I’m watching this space closely.

How to set up U-AMOS yourself

I have created a Prompt intialization for the system, i test it on some of my projects, and it was succefful. not so many rules though, but you can customize that part

You can check it here: link

Thanks for reading Code Meet AI: Stay relevant in the AI era! Subscribe for free to receive new posts and support my work.

Or, if you want it pre-configured

I built AI Mobile Launcher as the productized version of U-AMOS for React Native.

It ships with:

The full 9-file memory bank is pre-structured for a new project
A pre-built context map of every feature, service, and UI component
All critical, important, and recommended rule packs — written as visual diffs, not paragraphs
The split between generators (workflows) and rules (patterns) is already in place
Pre-built component and service inventories
Cursor and Claude Code entry points configured with plan/act mode
Generators for common features (onboarding, paywalls, i18n, design system)
The 7-point anti-hallucination checklist is embedded in every entry point
A starter updated_rules.md ready for your first exception

The Lite tier is free on GitHub. U-AMOS 2.0 ships fully configured in the Starter tier. If you’re starting a new React Native project and want the memory system running from day one without the setup work, that’s the fastest path. aimobilelauncher.com

If you’re adding U-AMOS to an existing project, the steps above are enough to get started. The framework isn’t magic — it’s the result of 6 months of failed sessions, logged and analyzed, until the AI stopped fighting me and started shipping with me.

What I want you to take from this

The content I see most often on AI coding frames is this as a prompting problem. Use a better system prompt. Be more specific. Add more examples to your instructions.

My experience over 6 months of tracking my own sessions is that prompts hit a ceiling. Once you’ve written a clear, specific prompt, the next 10 iterations give you marginal gains. Memory and structure compound differently . every lesson added to the memory bank improves every future session. Every entry in the context map saves another exploration loop. Every rule written as a visual diff prevents an entire category of hallucination permanently.

The AI isn’t a developer you prompt. It’s a system you build context for. Build the memory. Hand it the map. Show it what wrong looks like next to what right looks like. Stop paying to re-explain the same architecture every day.

U-AMOS is how I did it. The principles work without my specific files. The files work better with the principles. Either way: fix the memory and the map first, then build the product.

I write Code Meet AI weekly — AI in mobile development, real tradeoffs, what’s actually working in production. Next issue: agent-first mobile architecture and why most “AI features” in apps are just bolted-on chatbots pretending to be product. → https://codemeetai.substack.com/