This article was originally published on aicoderscope.com
TL;DR: Google's June 5 Gemma 4 QAT release drops 4-bit memory by roughly 72%, which means the 12B fits in about 7GB of VRAM, the 26B-A4B MoE in about 15GB, and the dense 31B in about 18GB — all on hardware you might already own. For inline chat and FIM autocomplete in Continue.dev, the 12B is the sweet spot. For agentic file-editing in Cline, you need 26B-A4B or 31B and Ollama 0.22.1+, or the tool calls silently fail.
| Gemma 4 12B (dense) | Gemma 4 26B-A4B (MoE) | Gemma 4 31B (dense) | |
|---|---|---|---|
| 4-bit QAT VRAM | ~7 GB | ~15 GB | ~18 GB |
| Best for | Continue.dev chat + autocomplete | Cline agentic edits on a 16GB card | Cline on 24GB, hardest reasoning |
| Context window | 256K | 256K | 256K |
| The catch | Weak on long multi-step agent loops | Needs Ollama 0.22.1+ for tool calls | 24GB GPU for comfortable context |
Honest take: If you have a single 8GB–16GB GPU and want local AI coding that actually feels useful, run Gemma 4 12B QAT as your Continue.dev chat-and-autocomplete model and stop there. Cline's agentic loop wants the 26B-A4B on a 16GB+ card — but verify your Ollama version first, because the tool-calling parser was broken until 0.22.1 and that single fact wastes more afternoons than any config typo.
Google DeepMind shipped quantization-aware training (QAT) checkpoints for the whole Gemma 4 family on June 5, 2026 — two days after the Gemma 4 12B base model itself landed. The headline is memory: QAT bakes the quantization into training instead of bolting it on afterward, so the 4-bit checkpoints keep near-original quality while using roughly one-third the VRAM of the bf16 weights. For local AI coding, that is the number that matters. A model that needed a 24GB card last month now runs on a 16GB laptop, and the 12B drops onto an 8GB GPU with headroom to spare.
This guide is about turning that into a working setup: which size to pick for which tool, the exact Ollama tags, the context-window dial you have to set, and the one tool-calling bug that makes Cline look broken when it isn't.
What "QAT" actually buys you
Standard post-training quantization (PTQ) takes a finished bf16 model and rounds the weights down to 4-bit. It works, but accuracy slips — and on code, where one wrong token breaks a build, that slip is expensive. QAT runs the quantization math during training, so the model learns weights that survive the rounding. Google reports the QAT 4-bit checkpoints land closer to full precision than naive PTQ, at about 72% less memory.
The concrete savings, from Google's own figures: the dense 31B at 16-bit is roughly 60GB; the 4-bit QAT checkpoint lands in the 17–19GB range. That is the difference between needing two 3090s and needing one. Here is how the family shakes out for a coding box:
| Model | Type | Active params | 4-bit QAT VRAM | Realistic GPU |
|---|---|---|---|---|
| Gemma 4 E2B | dense | 2B | <1 GB (text-only) | Any laptop / iGPU |
| Gemma 4 E4B | dense | 4B | ~3 GB | 6GB GPU |
| Gemma 4 12B | dense | 12B | ~7 GB | 8GB GPU (RTX 4060) |
| Gemma 4 26B-A4B | MoE | 4B of 26B | ~15 GB | 16GB GPU / 16GB Mac |
| Gemma 4 31B | dense | 31B | ~18 GB | 24GB GPU (RTX 3090/4090) |
The 26B-A4B is the interesting one. It is a Mixture-of-Experts model: 26B total parameters but only ~4B active per token. So it loads like a 15GB model but runs at the speed of a 4B model while reasoning with the breadth of something much larger. On a 16GB card it is the best agentic-coding value in the lineup right now.
All sizes carry a 256K context window except E2B/E4B (128K). Every variant handles text and images; E2B, E4B, and 12B also do video and audio natively — irrelevant for coding, but it explains why the base downloads are larger than a text-only model of the same size.
Grab the right GGUF — and skip the naive Q4_0
There is a quality trap in the file naming. The naive Q4_0 conversion of Gemma 4 degrades accuracy more than it should, even though the file is larger than smarter quants. The community fix is Unsloth's UD-Q4_K_XL dynamic GGUFs, which apply different bit-widths to different layers and recover most of the lost accuracy. Google also publishes official q4_0-gguf and w4a16-ct checkpoints on Hugging Face, but for local runs the Unsloth dynamic quants are the safer default.
In practice:
-
12B is published by Ollama directly under a stock
-it-qattag, so you can just pull it. - E4B, 26B-A4B, and 31B are distributed as Unsloth dynamic GGUFs — pull them from Hugging Face or point Ollama at the GGUF.
# 12B QAT — the easy path, straight from Ollama's library
ollama pull gemma4:12b-it-qat
# 26B-A4B and 31B — Unsloth dynamic GGUF (recommended quant)
# (download the UD-Q4_K_XL file from huggingface.co/unsloth, then:)
ollama create gemma4-26b-qat -f Modelfile
A minimal Modelfile for the larger Unsloth GGUFs looks like this:
FROM ./gemma-4-26B-it-qat-UD-Q4_K_XL.gguf
PARAMETER num_ctx 16384
PARAMETER temperature 0.7
PARAMETER top_p 0.95
One non-negotiable: use Ollama 0.22.1 or newer. The 0.22.1 release ships a rewritten Gemma 4 renderer that finally handles the model's explicit thinking mode and tool calling locally. Earlier builds (through 0.20.1) had a broken tool-call parser — the model would emit a valid function call and Ollama would hand back plain text. You will not see an error; the agent just acts like a chatbot. Check before you debug anything else:
$ ollama --version
ollama version is 0.22.1
Continue.dev: the 12B is your daily driver
Continue.dev gives you Copilot-style chat, inline edits, and tab-autocomplete pointed at a local model — and the 12B QAT is the right size for all three on an 8GB–16GB machine. Continue added native Gemma 4 model support in June, so the config is straightforward.
Two roles matter here, and they want different treatment. The chat/edit role is where the 12B shines: it is smart enough to explain a function, refactor a block, or write a test, and at ~21 tokens/second on an RTX 4060 (community-measured, llama.cpp) it is fast enough to feel interactive. The autocomplete role uses fill-in-the-middle (FIM): Continue sends the prefix and suffix of your file and asks the model to predict the middle. Gemma 4 can do this, but a 12B is heavier than you want firing on every keystroke — many people pair it with a small dedicated FIM model and keep the 12B for chat.
A working config.yaml that splits the roles:
models:
- name: Gemma 4 12B (chat)
provider: ollama
model: gemma4:12b-it-qat
roles:
- chat
- edit
defaultCompletionOptions:
contextLength: 16384
- name: Gemma 4 E4B (autocomplete)
provider: ollama
model: gemma4:e4b-it-qat
roles:
- autocomplete
The contextLength line is the dial people forget. Ollama defaults to a small context (historically 2K–4K depending on build), and if you do not raise num_ctx on the model and tell Continue to use it, your "256K context" model will silently truncate your file at a few thousand tokens. Set both. For a coding workload, 16K is a sane floor; push to 32K if your VRAM allows, because every extra token of KV cache eats memory on top of the ~7GB the weights already use.
If you want the full walkthrough of Continue + Ollama wiring, the Continue.dev + Ollama local setup guide covers the VS Code and JetBrains paths, and the Continue.dev + LM Studio guide covers the GGUF-via-LM-Studio route if you prefer that runner.
Cline: agentic editing needs the bigger model and the right Ollama
Cline is a different animal. It does not just suggest — it runs an agentic loop: read files, plan, write edits, run a terminal command, check the result, iterate. That loop lives or dies on tool calling. The model has to emit structured function calls reliably, dozens of times per task, without drifting into "here'











