3 Tools That Make AI Suck Less at Coding

I've completely updated my workflow, and I use AI coding agents for everything. Writing code, testing, and just helping me be more productive. However, AI isn't perfect and often after a long back-and-forth session I'll have working code, but it will have a few code smells. Duplicated code, dead-end code, lengthy functions are common. One way to fix that is to add all these problems to an AGENTS.md file and that works. But that's not always possible, depending on the codebase and team I'm working with.

To that end, I went looking for tools that could help me detect and fix these issues. And so after some research I found three tools that could work almost as a pipeline. One finds these issues, one reviews the changes before they ever reach a pull request, and one helps me evaluate my agents. I use Kiro for most of my day to day, so I wired all of these into that loop, but they run with whatever agent you already have.

If you'd rather watch than read, here's the full walkthrough:

This post covers:

Finding dead code and duplication with Fallow
Reviewing the AI's diff with CodeRabbit before you open a PR
Measuring whether the output improves with Kiln

Prerequisites

Node.js 20 or newer, so you have npx
A JavaScript or TypeScript project, ideally one with some AI-generated code in it
A coding agent like Kiro, Claude Code, or Cursor for the review-loop parts
Git, for the diff-based workflows

Find the mess: Fallow

The first problem I've seen often is that AI can generate a mess of code. It will duplicate a block, leave an orphaned file behind, and grow a function to a thousand lines without ever flagging it, because each individual edit looked reasonable in isolation. You need a tool that reads the whole codebase at once.

Fallow is the one I reached for. It's static codebase intelligence for JS and TS, and it works a lot like ESLint, except it's built around these common AI code issues. You can run it with:

npx fallow

That gives you a full report. The sections I care about most are dead code, duplication, and complexity. Dead code finds files that aren't reachable from any entry point, exports nothing imports, and dependencies you've stopped using. When I ran this on my vibe-coded chess app (see video) it found dozens of these problems Duplication shows you the exact line ranges that got copied between two places. Complexity scores each function on how branchy and how hard to test it is, then ranks your worst offenders so you know where to start.

There's a fix command for most things:

npx fallow fix

If you'd rather have an agent fix it you can use a JSON output and the skill. Run it with --json and you get structured findings an agent can read and act on. Fallow also ships an agent skill you can install:

npx skills add fallow-rs/fallow

Once that's in place, I can tell Kiro to build a feature and then run Fallow on its own work to clean up after itself. It catches the duplicate block it just wrote and removes it before I even see it. You can also drop fallow audit into CI so it compares your branch against main and only flags what your change introduced.

Bonus: Knip

If Fallow is more than you need, Knip does the dead-code half on its own and has been around longer. It finds unused files, dependencies, and exports in JS and TS projects in one command:

npx knip

I've found knip to be a little slower then Fallow. YMMV.

Review the diff: CodeRabbit

Finding dead code is great, but what about reviews?

CodeRabbit puts a review step back in. It's AI code review that runs on your diff, and the version I like runs locally from the CLI so it happens before the PR exists. Install it with one line:

curl -fsSL https://cli.coderabbit.ai/install.sh | sh

Then point it at your uncommitted changes:

coderabbit review --plain

It reads the diff and comes back with specific findings. It also hands you a fix prompt you can paste straight back into your agent, so the loop closes quickly.

Jack Herrington made the point in his video that lower-end models get noticeably more useful once CodeRabbit is in the loop, because the review catches what the cheaper model missed. That makes sense to me. The agent writes, CodeRabbit reviews, the agent fixes.

Measure the improvement: Kiln

The first two tools clean up individual changes. This last one answers a bigger question. When I tweak a prompt, swap a model, or add a skill, is my agent's output actually better, or does it just feel better because the last three runs went well?

Kiln is the tool I use to stop guessing. It's built for creating and evaluating AI systems, so you set up datasets, run evals, and compare results across changes.

For my use case, I connected it to some output a LLM was producing. I ran an eval on it to make sure it was as I expected. When I added the Fallow skill, I could iterate on the system prompt and strings. It's the same instinct behind writing tests for your code, pointed at the thing generating the code.

This is the least flashy of the three and the one I'd skip first if you're in a hurry. But if you're serious about tuning a workflow you'll use every day, measuring beats vibes every time.

How to pick

You don't have to use all three, so here are some good guide lines for each:

If your agent leaves dead code and duplication everywhere, start with Fallow, or Knip if you want something smaller.
If you need more code reviews, add CodeRabbit and let it review before you do.
If you're tuning prompts, models, or skills and want proof you're improving, set up Kiln.

Keep in mind these tools don't replace knowing what good code looks like. They raise the floor on what your AI ships so the version that lands in your editor is closer to something you'd have written yourself. That's extremely important. AI made writing code fast, and it made reviewing code the bottleneck. These three move that bottleneck.

Conclusion

AI compresses how long it takes to generate code and stretches how long it takes to trust it. Fallow and Knip find the mess it leaves behind, CodeRabbit reviews the change before it slips past you, and Kiln tells you whether any of your tuning is working.

If you want to see these running inside a real agent loop, I walk through all of them in the companion video.

Resources: