This article was originally published on aicoderscope.com
TL;DR: Codestral 2 went Apache 2.0 on April 8, 2026, which makes it the cheapest legally-clean-to-self-host coding model worth wiring into your editor. At $0.30/M input via Mistral's API it slots into Cursor Chat, Cline, and Continue.dev in about ten minutes. Its real edge is fill-in-the-middle autocomplete, not agentic reasoning — so pick it for tab completion and privacy, not for multi-step Cline runs.
| Codestral 2 | DeepSeek V4-Flash | Gemini 3.5 Flash | |
|---|---|---|---|
| Best for | FIM autocomplete + self-host | Agentic Cline work, cheapest | Balanced cloud agent |
| Price (input / output per M) | $0.30 / $0.90 | $0.14 / $0.435 | $1.50 / ~$6 |
| License | Apache 2.0 (self-host free) | MIT (self-host free) | Proprietary (API only) |
| Context window | 256K | 1M | 1M |
| Params | 22B dense | MoE (cloud) | proprietary |
| The catch | Weaker at multi-step agentic tasks | Thinking mode breaks Cline if left on | No self-host, no FIM endpoint |
Honest take: If you want the best inline autocomplete you can legally run on your own GPU, Codestral 2 is the pick — wire it into Continue.dev's FIM slot. If you want a chat/agent backend for Cline, DeepSeek V4-Flash is both cheaper and stronger. Don't use Codestral 2 for heavy agent loops just because it's open.
What actually changed in April 2026
Codestral has existed since May 2024, but the version that matters is Codestral 2, released April 8, 2026. The headline isn't a benchmark bump — it's the license. The original Codestral shipped under the Mistral Non-Production License, which barred commercial use in your product. Codestral 2 is Apache 2.0. That single change is why it's worth a fresh look: you can now self-host it inside a commercial product, ship it on a private server, or run it on a workstation GPU without a lawyer in the loop.
The model itself is a 22-billion-parameter dense transformer (not a mixture-of-experts), with a 256K-token context window and support for 80+ languages. Mistral reports 86.6% on HumanEval and 91.2% on MBPP, with native fill-in-the-middle (FIM) training — the thing that makes inline autocomplete feel native rather than bolted on.
The "dense, not MoE" detail matters more than it looks. A 22B dense model has predictable VRAM and throughput. You're not juggling 384 experts like Kimi K2.7 or a 671B sparse stack like DeepSeek's flagship. At Q4_K_M the weights are roughly 9 GB, so it fits on a single 16 GB card with room for a modest context window. (For the full 256K context you'll need far more — that's a server-class ask, not a laptop one. The runaihome.com local coding LLM guide has the VRAM math by GPU tier.)
Two ways to run it
You have two paths, and they map to different goals:
-
Mistral API (
api.mistral.ai) — fastest, zero hardware, $0.30/M in. Use this if you just want a cheap, capable chat/edit backend and don't care where the tokens go. - Self-hosted via Ollama or vLLM — slower on consumer hardware, but the code never leaves your machine. This is the Apache-2.0 payoff. Use it for client code under NDA or air-gapped work.
Pull the local copy first if you want to test offline:
$ ollama pull codestral
pulling manifest
pulling 0bbfda8e64c1... 100% ▕████████████████▏ 12 GB
pulling f5 db17... 100% ▕████████████████▏ 559 B
success
$ ollama run codestral "write a Python function that returns the nth Fibonacci number iteratively"
def fib(n: int) -> int:
a, b = 0, 1
for _ in range(n):
a, b = b, a + b
return a
Tested with Ollama 0.12.x on June 19, 2026. On a single RTX 4090 the Q4_K_M build runs around 45–55 tokens/sec for short completions, which is fine for chat and edits but noticeably slower than a cloud call for long agent loops.
If you're going cloud, grab a key from console.mistral.ai and smoke-test it:
$ curl -s https://api.mistral.ai/v1/chat/completions \
-H "Authorization: Bearer $MISTRAL_API_KEY" \
-H "Content-Type: application/json" \
-d '{"model":"codestral-latest","messages":[{"role":"user","content":"say ok"}]}' \
| python3 -c "import sys,json;print(json.load(sys.stdin)['choices'][0]['message']['content'])"
ok
codestral-latest is the rolling alias; pin the dated version if you want reproducibility.
Wiring it into Cline
Cline takes any OpenAI-compatible endpoint, so the Mistral API drops straight in.
- Open the Cline panel → Settings (gear icon).
- API Provider: choose OpenAI Compatible.
-
Base URL:
https://api.mistral.ai/v1 - API Key: your Mistral key.
-
Model ID:
codestral-latest - Save, then start a task.
That's the whole setup. Where it gets interesting is what to use it for. Codestral 2 is a code-specialist, not a generalist agent. On a single "edit this function" task it's excellent. On a 12-step Cline plan — read three files, run a test, parse the failure, patch, re-run — it loses the thread sooner than DeepSeek V4-Flash or Gemini 3.5 Flash. If your Cline workflow is mostly "apply this focused change," Codestral 2 is great and cheap. If it's "figure out why the integration test flakes and fix it," reach for DeepSeek V4-Flash instead.
One practical note: unlike DeepSeek V4-Flash, Codestral 2 has no separate "thinking mode" to disable, so you skip the tool-call loop trap that bites Cline users on reasoning models. It just answers.
Wiring it into Cursor (and the Tab caveat)
Cursor lets you override the OpenAI base URL, which routes Chat and Cmd-K through Codestral 2:
- Settings → Models.
- Scroll to OpenAI API Key, expand the override.
-
Base URL:
https://api.mistral.ai/v1 - Paste your Mistral key, click Verify.
- Add a custom model named
codestral-latestand enable it.
Here's the catch every Cursor power user hits: the custom endpoint powers Chat and Cmd-K, but not Tab. Cursor's Tab autocomplete runs on Cursor's own proprietary models and cannot be repointed at an external API. So routing Cursor through Codestral 2 gets you a cheaper chat/edit backend, but your inline gray-text completion is still Cursor's. This is the same limitation that applies to every external backend in Cursor — see the Cursor + Ollama setup guide for the full breakdown.
That limitation is exactly why, if autocomplete is what you care about, Continue.dev is the better host for Codestral 2 — because Continue can use the dedicated FIM endpoint.
Continue.dev: the FIM setup, and the bug that quietly breaks it
This is where Codestral 2 earns its keep. Continue.dev lets you assign a model to the autocomplete role and point it at Mistral's dedicated FIM endpoint, which is a different host from the chat API:
FIM completions → https://codestral.mistral.ai/v1/fim/completions
Chat completions → https://api.mistral.ai/v1/chat/completions
In your Continue config (~/.continue/config.yaml in the current YAML format), the autocomplete model looks like this:
models:
- name: Codestral FIM
provider: mistral
model: codestral-latest
apiKey: YOUR_MISTRAL_KEY
apiBase: https://codestral.mistral.ai/v1
roles:
- autocomplete
autocompleteOptions:
maxPromptTokens: 1024
debounceDelay: 250
The problem: completions feel dumb and slow
Here's the real-world snag. Several Continue users (tracked in continuedev/continue issue #7178) found that autocomplete was hitting …/v1/chat/completions instead of …/v1/fim/completions. The symptoms: completions arrive late, ignore the code after your cursor, and sometimes spit out a markdown code fence into your editor. That's the chat endpoint pretending to do autocomplete — it only sees the prefix, never the suffix, so it can't do











