Voicebox Review: The Open-Source ElevenLabs + Wispr Killer

Originally published on andrew.ooo — visit the original for any updates, code snippets that aged out, or follow-up posts.

TL;DR

Voicebox is the open-source AI voice studio that does what would normally cost you two SaaS subscriptions — ElevenLabs (TTS + cloning) and WisprFlow (dictation) — in a single local-first Tauri app. It is currently the #7 trending repo on GitHub this week, sitting at 34,000+ stars and 3,583 new stars in the last 7 days. Key facts:

7 TTS engines in one app: Qwen3-TTS (0.6B/1.7B), Qwen CustomVoice, LuxTTS, Chatterbox Multilingual, Chatterbox Turbo, HumeAI TADA, and Kokoro
Zero-shot voice cloning from a few seconds of reference audio, plus 50+ curated preset voices
23 languages including Arabic, Japanese, Hindi, Swahili, Polish, Turkish
Global dictation hotkey with push-to-talk and toggle modes — transcript pastes straight into any focused text field on macOS
MCP server built in — any agent (Claude Code, Cursor, Cline, Windsurf) can call voicebox.speak({text, profile}) and reply in a voice you own
Native performance — Tauri (Rust) shell, not Electron. MLX on Apple Silicon, CUDA on Windows, ROCm on AMD, Intel Arc support
Everything runs locally — voice samples, captures, and audio never leave your machine
Built by Jamie Pine of Spacedrive fame — the same person who shipped a cross-platform file explorer in Rust

If you've been paying ElevenLabs $22/month for cloning and WisprFlow $12/month for dictation, this is the thing that ends both subscriptions.

Quick Reference

Field	Value
Repo	jamiepine/voicebox
Website	voicebox.sh
Docs	docs.voicebox.sh
License	MIT
Language	TypeScript + Rust (Tauri)
Stars	34,023 (3,583 this week)
Platforms	macOS (ARM + Intel), Windows, Linux, Docker
GPU support	MLX (Apple), CUDA (NVIDIA), ROCm (AMD), Intel Arc, CPU
TTS engines	7 (Qwen3, Qwen CustomVoice, LuxTTS, Chatterbox×2, TADA, Kokoro)
STT engine	OpenAI Whisper (Base → Turbo)
Agent protocol	MCP (HTTP + stdio transports)

What Voicebox Actually Does

Voicebox sits on both halves of the voice I/O loop that, until recently, required two separate paid SaaS products:

Direction	SaaS incumbent	Voicebox replacement
Text → speech (output)	ElevenLabs	7 local TTS engines + cloning
Speech → text (input, global)	WisprFlow / Superwhisper	Whisper + global hotkey + auto-paste
Agent voice output	(none — ElevenLabs API + glue)	Bundled MCP server, one tool call

The clever part isn't any single engine — most of them are open weights you could install yourself. It's that Voicebox bridges all three into one mental model: an on-screen bidirectional pill that shows the same UI whether you are dictating into the system, or your agent is talking back at you. The pill is the surface for the whole voice loop.

The 7 TTS Engines, Compared

This is where the project gets unusual. Most open-source TTS apps ship one engine and apologise for its limits. Voicebox ships seven and lets you switch per-generation depending on what trade-off you actually need.

Engine	Languages	Strengths	When to use
Qwen3-TTS (0.6B / 1.7B)	10	High-quality multilingual cloning, natural-language delivery instructions ("speak slowly", "whisper")	General-purpose cloning
Qwen CustomVoice	10	9 curated preset voices with natural-language delivery control, no reference audio needed	Quick voiceovers without setup
LuxTTS	English	~1 GB VRAM, 48 kHz output, 150× realtime on CPU	Laptops without a GPU
Chatterbox Multilingual	23	Broadest language coverage — Arabic, Finnish, Hebrew, Hindi, Malay, Polish, Swahili, Turkish	Non-English/Chinese content
Chatterbox Turbo	English	Fast 350M model, paralinguistic tags `[laugh] [sigh] [gasp]`	Expressive English narration
HumeAI TADA (1B / 3B)	10	Speech-language model, 700+ seconds coherent, text-acoustic dual alignment	Long-form: audiobooks, podcasts
Kokoro	8	Tiny 82M model, 50 curated presets, fast CPU inference	Embedded or low-resource

Only Chatterbox Turbo interprets paralinguistic tags. Type / in the input box and an inline tag picker opens with [laugh], [chuckle], [gasp], [cough], [sigh], [groan], [sniff], [shush], and [clear throat]. The other engines read those tags literally as text, which is the kind of polish detail that signals someone actually used this thing before shipping it.

Code Example: Giving Your Coding Agent a Voice

The MCP integration is the killer feature for anyone running a coding agent. One tool call from any MCP-aware client and the agent speaks in a voice you've cloned. Setup is two steps:

1. Install Voicebox as an MCP server in Claude Code / Cursor / Windsurf:

# Claude Code
claude mcp add voicebox http://localhost:1420/mcp

// Cursor / Windsurf — ~/.cursor/mcp.json
{
  "mcpServers": {
    "voicebox": {
      "url": "http://localhost:1420/mcp"
    }
  }
}

2. The agent now has a voicebox.speak tool:

// In Claude Code, Cursor, or any MCP-aware client
await voicebox.speak({
  text: "Deploy complete. The smoke tests passed and the health check is green.",
  profile: "Morgan"  // any voice profile you've cloned
});

For non-MCP clients (ACP, A2A, shell scripts), the same primitive is exposed as REST:

curl -X POST http://localhost:1420/speak \
  -H "Content-Type: application/json" \
  -d '{"text": "Build finished", "profile": "Morgan"}'

In Settings → MCP you can pin different voices to different clients — Claude Code speaks as Morgan, Cursor speaks as Scarlett — so when both are running and a notification fires you know without looking which agent is talking to you. Each client also records a last_seen_at timestamp so you can confirm the install actually wired through.

Code Example: Voice Cloning From a Reference Sample

The cloning workflow is intentionally low-ceremony:

1. Open Voicebox → Profiles → "+ New profile"
2. Drag in 5–30 seconds of clean reference audio
   (or hit the in-app mic to record directly)
3. Add a name, optional description, language tag
4. Save — the profile is ready to use immediately

From the API:

# Create a profile from an audio file
curl -X POST http://localhost:1420/profiles \
  -F "name=Morgan" \
  -F "language=en" \
  -F "audio=@morgan-reference.wav"

# Generate speech with that profile
curl -X POST http://localhost:1420/generate \
  -H "Content-Type: application/json" \
  -d '{
    "text": "This is a cloned voice running entirely on my laptop.",
    "profile": "Morgan",
    "engine": "qwen3-tts",
    "effects": ["pitch_shift:-2", "reverb:small_room"]
  }'

The pedalboard-powered effects rack (pitch shift, reverb, delay, chorus, compressor, gain, high-pass, low-pass) applies after generation, in real time, with reusable presets. There are four built-in presets — Robotic, Radio, Echo Chamber, Deep Voice — and you can attach a default chain to each profile so every "Morgan" generation starts with the same processing.

Code Example: Global Dictation

The dictation hotkey replaces WisprFlow, Superwhisper, and macOS's built-in dictation in one shot. Configure the chord in Settings, then anywhere on your system:

Hold ⌥ + Space  → Voicebox starts recording (pill appears)
Speak normally
Release          → Whisper transcribes, transcript pastes into focused field

Or tap-to-toggle if you prefer — and there's a clever hybrid mode where you can hold the chord and tap Space mid-hold to "upgrade" into a toggle session without a gap in audio.

What it does that the built-in dictation doesn't:

Target-aware paste on macOS — uses the Accessibility API to inject into the focused field, with atomic clipboard save/restore so your clipboard isn't clobbered
LLM refinement — optionally pipes the raw transcript through the bundled local LLM to clean up ums, stutters, false starts, and self-corrections before paste
Captures tab — every dictation is stored with original audio + transcript. Replay, re-transcribe with a different Whisper size, re-refine with different flags, or promote any capture into a voice profile sample with one click
First-run permission UX — in-app gates deep-link you straight to the right macOS System Settings pane for Accessibility and Input Monitoring, instead of leaving you to figure out which checkbox needs ticking

Whisper sizes range from Base through Turbo (~8× faster than Whisper Large with minimal quality loss), and the roadmap lists Parakeet v3 and Qwen3-ASR as additional STT engines.

Voice Personalities — The Quietly Impressive Feature

This one is buried in the README but is genuinely novel. Attach a free-form persona to any voice profile — "Morgan, a calm engineering manager who explains things plainly and never uses jargon" — and three new actions appear on the generate box, powered by a bundled Qwen3 LLM running entirely locally:

Compose — a shuffle button drops a fresh in-character line into the textarea
Rewrite — takes your input and rewrites it in the persona's voice
Respond — generates an in-character response

Agents can invoke the same modes over MCP, so a Claude Code agent could "respond as Morgan" and the answer comes back already styled and spoken. It's the first open-source TTS app where the persona layer is treated as a first-class primitive rather than an afterthought.

Performance Notes (Tested on Apple Silicon)

The Tauri-over-Electron decision is felt immediately. The DMG is ~80 MB; cold start is sub-second. On an M2 Pro with 16 GB unified memory:

Engine	VRAM	Realtime factor	Notes
LuxTTS	~1 GB	150× CPU	Best for laptops, English only
Kokoro	~500 MB	~50× CPU	Tiny, 8 languages, preset-only
Qwen3-TTS 0.6B	~3 GB	~8× MLX	Cloning, multilingual
Qwen3-TTS 1.7B	~6 GB	~3× MLX	Higher quality cloning
Chatterbox Multilingual	~4 GB	~5× MLX	Best language coverage
TADA 3B	~10 GB	~1.5× MLX	Long-form, slowest

Auto-chunking with crossfade handles up to 50,000 characters and respects abbreviations, CJK punctuation, and inline [tags] — i.e. it won't split "Dr. Smith" at the period or break a [laugh] in half.

Community Reactions

From the linked review on TheAIToolkit Substack and GitHub stars trajectory: real users on Reddit reported "cloned my voice well", "quality OFF THE CHARTS", and "this is fing great"* — language that's earned, given how much TTS demo audio normally sounds like Microsoft Sam with delusions. The repo went from public release in April 2026 to 34K stars and Trendshift top-30 status inside 8 weeks. It's currently sitting at **#7 on GitHub Trending this week with 3,583 new stars.

The release notes for v0.5.0 frame the project's pitch in one sentence:

Voicebox stops being just a voice-cloning studio and becomes a full AI voice studio. Hold a key anywhere on your machine, speak, release — the transcript lands in the focused text field. Flip the primitive around and any MCP-aware agent — Claude Code, Cursor, Spacebot — speaks back through an on-screen pill in one of your cloned voices.

A few side notes worth flagging since they keep coming up:

Domain-checker false positives. Sites like Scamadviser and Gridinsoft flag voicebox.sh as "suspicious" with low trust scores. This is a generic warning their algorithms throw at any new .sh domain with no Whois history, and it has nothing to do with the actual project, which is a public MIT-licensed GitHub repo built by the same person behind Spacedrive. The "scam check" results are noise.
Issue creation is restricted on the GitHub repo, which has annoyed some power users who want to file bugs. The maintainer's pattern (familiar from Spacedrive) is to route bug reports through Discord and dogfood internally before opening the issue tracker — workable but worth knowing if you're used to filing GitHub issues.

Honest Limitations

This is a fast-moving v0.x project. The rough edges are real:

Linux has no pre-built binaries yet. The README points at voicebox.sh/linux-install for build-from-source instructions. That's fine if you're used to cargo build, less fine if you wanted a single-click DMG-equivalent. Docker is available as a fallback.
Target-aware paste is macOS-only. The Accessibility-verified injection into the focused text field is genuinely magic on Mac. On Windows and Linux you still get dictation, but the transcript lands in the clipboard rather than auto-typing into the focused field. Cross-platform parity is on the roadmap.
MLX requires Apple Silicon. Intel Macs get the same engines but on CPU or PyTorch fallback — usable for Kokoro and LuxTTS, painful for Qwen3-TTS 1.7B and TADA 3B.
No streaming TTS yet. All seven engines generate the whole utterance before playback starts. For short replies (1–3 sentences) you don't notice, but for a long agent monologue you'll hear the pill spin for a few seconds before audio begins. The roadmap mentions streaming output as a v0.6 target.
First-run model downloads are heavy. Switching between all seven engines pulls down 15–20 GB of weights from Hugging Face. Plan accordingly on tethered or capped connections, or pre-pull on a fast link.
Issue tracker is closed. As noted above, bug reports route through Discord rather than GitHub Issues right now.

None of these are dealbreakers, and most of them are explicitly tracked in the roadmap. But it's worth setting expectations before you uninstall ElevenLabs.

How It Compares

Feature	Voicebox	ElevenLabs	WisprFlow	Superwhisper	OpenAI TTS API
Cost	Free	$22+/mo	$12/mo	$8.49/mo	Per-token
Voice cloning	✅ 7 engines	✅	❌	❌	❌
Global dictation hotkey	✅	❌	✅	✅	❌
Runs offline	✅	❌	❌	✅	❌
MCP server for agents	✅ built-in	❌	❌	❌	❌
Languages	23	32	~50	~50	~50
Voice personalities (local LLM)	✅	❌	❌	❌	❌
License	MIT	Proprietary	Proprietary	Proprietary	Proprietary

The honest read: ElevenLabs still wins on the absolute top-tier production quality you'd want for a Spotify podcast intro, and WisprFlow has a smoother first-run onboarding for non-technical users. Voicebox wins everywhere else — and it's free, local, and the only one with an MCP server for your agents.

If you're already paying both of those subscriptions, the math is: download Voicebox, try it for a week, and if it covers 90% of what you're using ElevenLabs and WisprFlow for, that's $34/month back in your pocket.

Install in 60 Seconds

# macOS (Apple Silicon)
open https://voicebox.sh/download/mac-arm

# macOS (Intel)
open https://voicebox.sh/download/mac-intel

# Windows
start https://voicebox.sh/download/windows

# Docker (any platform)
git clone https://github.com/jamiepine/voicebox
cd voicebox
docker compose up

First launch will walk you through the macOS permission grants (Accessibility + Input Monitoring) for global dictation. Skip that and you still get the studio, you just lose the system-wide hotkey.

FAQ

How is Voicebox different from VoxCPM2 or Chatterbox by themselves?

VoxCPM2 and Chatterbox are TTS models. Voicebox is the application layer on top of seven different TTS models (including Chatterbox) plus Whisper STT, a global hotkey daemon, an MCP server, voice cloning, an effects rack, and a stories editor. You could in principle assemble Voicebox yourself from those underlying models, but you'd be writing the Tauri shell, the MCP integration, and the macOS Accessibility plumbing yourself. We covered VoxCPM2 in detail in a previous post.

Is the voice cloning quality actually comparable to ElevenLabs?

For zero-shot cloning from a 5–30 second sample: close but not identical. Qwen3-TTS 1.7B and Chatterbox Multilingual produce voices that are recognisably the source speaker with natural prosody. ElevenLabs Voice Cloning ("Professional" tier) still wins on the very last layer of polish — micro-breaths, natural pauses, occasional vocal fry. For 95% of use cases (agent voices, narration, dictation playback, internal demos) the gap doesn't matter. For commercial voice-acting work it still does.

Will Voicebox work with my coding agent if it doesn't speak MCP?

Yes. Every primitive is also exposed over HTTP — POST /speak, POST /generate, POST /transcribe — so anything that can issue an HTTP request can use Voicebox. For OpenAI-compatible clients, there's a roadmap item for an OpenAI-compatible /v1/audio/speech endpoint.

Does it work with my existing Whisper setup?

Voicebox bundles its own Whisper (running on MLX for Apple Silicon, PyTorch elsewhere) and downloads the model sizes you select in Settings. It doesn't talk to an existing whisper.cpp daemon or Faster-Whisper install. If you want to point Voicebox at an external STT engine, you'd have to route the audio yourself.

Can I run Voicebox on a remote server and use the dictation from my laptop?

The TTS and STT primitives are network-callable, so yes for the API surface. The global dictation hotkey and macOS auto-paste are local-by-design — they need to inject into your local OS, so those features only work when the Voicebox app is running on the same machine as your keyboard.

What about safety / consent for cloned voices?

This is the same trade-off as every open-source voice-cloning model: the project enforces no watermarking or consent verification at the engine level. The responsibility lives with you. For production use, follow the FTC guidance on AI voice cloning and get explicit consent from anyone whose voice you clone.

Verdict

If you've been waiting for the open-source ElevenLabs + WisprFlow killer that's actually shippable to non-technical users, this is it. Voicebox is the cleanest single integration of voice I/O for AI agents that exists today: seven good-to-excellent TTS engines, real voice cloning, global dictation, an MCP server, voice personalities, and a Tauri-fast UI — all running on your machine, MIT-licensed, free.

Two subscriptions, gone. Plus a primitive — agent voices via MCP — that the SaaS incumbents don't even have a roadmap for yet.

Download Voicebox: github.com/jamiepine/voicebox · voicebox.sh