A mid-range gaming GPU from 2023 ships with 12GB of VRAM and enough tensor cores to run a 7-billion-parameter language model faster than you can read. Most of that silicon sits idle while you type prompts into a browser tab and pay a monthly fee to rent compute on someone else's server. Off Grid AI Desktop is a free, open-source app that runs chat, image generation, and voice directly on your Windows PC.
GitHub → Free, open source, runs offline.
It is Electron and React on the outside. Inside it bundles llama.cpp, stable-diffusion.cpp, whisper.cpp, and an open-weight text-to-speech model. Nothing routes through a server we own. No account. No telemetry.
The Off Grid AI Desktop chat, running a local model fully on-device.
What You Need
Windows acceleration runs on CUDA for NVIDIA cards or Vulkan for AMD and Intel. CPU fallback works too, just slower.
| Tier | GPU / RAM | What runs well |
|---|---|---|
| Minimum | No discrete GPU, 16GB system RAM | Small chat models (3-4B, Q4), STT, slow image gen on CPU |
| Recommended | NVIDIA RTX with 8GB+ VRAM, 16GB RAM | 7-8B chat, SDXL-Lightning images, voice, all responsive |
| Comfortable | NVIDIA RTX with 12-16GB VRAM, 32GB RAM | Larger chat models, full SDXL, fast generation |
You need Windows 10 or 11 (64-bit) and roughly 10-20GB of free disk for the app plus your first models. Models are downloaded on demand, so the floor depends on what you pull.
What Off Grid AI Desktop Can Do
You get a private AI studio. Three things you would normally pay three separate subscriptions for, all on your own machine.
Chat with a local model. A bundled llama-server runs quantized GGUF models like Gemma and Qwen. You control temperature and context window. It behaves like ChatGPT, except the conversation never leaves your PC.
Browse and download models. A built-in Hugging Face browser lets you pick any compatible GGUF and pull it through a download manager. No command line, no manual file shuffling.
Generate images on-device. stable-diffusion.cpp runs SDXL, SDXL-Lightning for fast few-step output, SD 1.5 and 2.1, and the 2026 Z-Image-Turbo. You get text-to-image and image-to-image, a live per-step preview, progress with ETA, a cancel button, a lightbox, and style presets.
Talk to it and let it talk back. whisper.cpp turns your mic into text in the composer. Kokoro-82M, an open-weight multilingual model, reads replies aloud per message or in an auto-speak voice mode.
You also get a meeting recorder, screen-capture-to-memory, document chat with cited sources, and a live HTML/SVG/Mermaid canvas. All local. This article is the broad tour.
Which Models to Use
Pick the model to fit your VRAM, not the other way around.
| Your hardware | Chat model | Image model |
|---|---|---|
| CPU only / 8GB VRAM | Gemma or Qwen 3-4B, Q4_K | SD 1.5, or SDXL-Lightning |
| RTX 8-12GB VRAM | 7-8B, Q4_K or Q8_0 | SDXL-Lightning, Z-Image-Turbo |
| RTX 12GB+ VRAM | 8B+ at higher quant | Full SDXL |
Quantization is the lever. A Q4_K model is roughly half the size of Q8_0 and fits more comfortably in tight VRAM, at a small quality cost. Z-Image-Turbo and SDXL-Lightning reach a usable image in far fewer steps than standard SDXL, which matters most on smaller cards.
How Hardware Acceleration Works
A language model is mostly matrix multiplication. Your GPU does that work in parallel across thousands of cores. On NVIDIA hardware, llama.cpp offloads those layers through CUDA. On AMD and Intel GPUs it uses Vulkan. If a model is too large for VRAM, layers spill to system RAM and run slower but still run.
Quantization is what makes this fit on consumer hardware. Full-precision weights are 16 bits each. Quantized GGUF models pack them down to 4 or 8 bits, shrinking a 7B model from around 14GB to 4-8GB. That is the difference between a model that fits your card and one that does not.
Keeping It Fast
A few habits keep generation responsive.
Watch your context window. Every token of history is recomputed, so a 32K context costs far more than 4K. Keep it as small as the task allows.
Match quantization to VRAM. If a model swaps to system RAM, you will feel it. Drop to Q4_K before you drop the model.
For images, start with a few-step model. SDXL-Lightning and Z-Image-Turbo give you a finished image while standard SDXL is still warming up. Use full SDXL when you want maximum quality and have the headroom.
Privacy: Stronger Than a Cloud Subscription
A cloud AI service sees every prompt, every image you generate, and every document you upload. It keeps logs. It trains on some of it. It requires an account tied to your identity.
Off Grid AI Desktop sends none of that anywhere. Inference happens on your CPU and GPU. There is no account and no telemetry. The code is AGPL-3.0, so you can read exactly what it does. Unplug your network and it keeps working.
Getting Started
- Open the repo: github.com/off-grid-ai/desktop.
- Download the Windows build, or clone and build from source.
- Launch the app. It detects CUDA or Vulkan and falls back to CPU.
- Open the Models browser, pick a chat model that fits your VRAM, and download it.
- Start a chat, generate an image, or hit the mic. Everything runs locally from here.
What's Coming
- Cross-device sync, so your chats and memory follow you between machines, still end-to-end private.
- More bundled models as open weights improve.
- Unified search across chats, documents, and captured memory.
Honest roadmap. We build the local spine first.
The built-in model browser. Download text, vision, image, and voice models from Hugging Face, sized to your machine.
FAQ
Q: Is it really free?
Yes. The app is free and open source under AGPL-3.0. There is no account, no metered usage, no API key.
Q: Does it work offline?
Yes. Once a model is downloaded, you can disconnect from the internet entirely. Inference is local.
Q: Do I need an NVIDIA GPU?
No. NVIDIA cards use CUDA and are fastest. AMD and Intel GPUs use Vulkan. A modern CPU runs smaller models without any GPU.
Q: How much RAM or VRAM do I need?
8GB of VRAM runs a 7-8B chat model and fast image generation well. 16GB of system RAM is a reasonable floor for CPU-only use with smaller models.
Q: Which models can I run?
Any compatible GGUF for chat (Gemma, Qwen, and more) and SDXL, SDXL-Lightning, SD 1.5, SD 2.1, or Z-Image-Turbo for images. Browse and download them from inside the app.
Q: Is my data private?
Yes. Nothing leaves your PC. No telemetry, no account, and the source is open for you to verify.
Run AI on hardware you already own. github.com/off-grid-ai/desktop















