We got 207 tok/s with Qwen3.5-27B on an RTX 3090

Points

Comments

GreenGames

Author

Top Comments

AurornisApr 20

This is a Claude-code generated repo that implements some ideas from research papers. If you follow this space, every paper release spawns tens or hundreds of vibecoded repos like this that get spammed to Reddit, Hacker News, and other sites.

It's generally best to overlook the vibecoded repos and go closer to the source for up to date information. In this case, z-lab already showed Qwen3.5-27B with DFlash last month: https://huggingface.co/z-lab/Qwen3.5-27B-DFlash

This repo is an example of what you get if you point Claude Code at the upstream repo and have it iterate with some other objective (loading GGUF). They also included DDTree in there somewhere.

You also need to look closely at the claims. A classic trick in these repos is to cherry-pick numbers that make the work in the repo look extraordinary until you start reading the details. From my quick read, this repo is using Q4 quantization on the KV cache which does not produce good results. Someone who reads everything in detail might find more tricks. This is par for all of these demo repos because the goal is to impress casual viewers with big numbers.

I'm trying to find where they get the 207 tok/s number but the 207 number only appears in their headline claim. If you read deeper the real numbers are half that or less.

There are also several (possibly vibecoded, I haven't checked) draft PRs and forks to use these techniques on upstream llama.cpp that would be much more useful for experimenting. One example I picked at random: https://github.com/ggml-org/llama.cpp/pull/22105

dirtikitiApr 20

"Local AI should be a default, not a privilege: private data, no per-token bill, no vendor lock-in. The hardware to run capable models already sits on desks. The software to run those chips well doesn't."

So figure out how to run it on Vulkan instead of requiring the user to be locked into expensive CUDA cards.

lostmsuApr 20

No you did not. You got 207 tok/s on an RTX 3090 with speculative decoding which, generally speaking, is not the same quality as serving the model without it.

Greedy-only decoding is even worse. There's a reason every public model comes with suggested sampling parameters. When you don't use them, output tends to degrade severely. In your case simply running a 14B model on the same hardware with the tools you compare against would probably be both faster and produce output of higher quality.

SilverElfinApr 20

Why did they focus on that particular graphics card and not others, and not common laptops used by developers, or something like that?

GreenGamesApr 20

We built a standalone C++/ggml speculative decoder for Qwen3.5-27B Q4_K_M with a DFlash block-diffusion draft.

207.6 tok/s peak (5.46x over AR); HE 10-prompt bench averages 129.5 tok/s at DDTree budget=22, single RTX 3090, 24 GB. 3.43x over autoregressive and 2.8x over the best public SGLang AWQ number.

TL;DR - Peak 207.6 tok/s DFlash vs 38.0 tok/s AR (5.46x). HE bench: 129.5 tok/s mean at DDTree budget=22. - 3.43x over autoregressive Q4_K_M baseline (37.78 tok/s). - 2.8x vs SGLang AWQ reference (46.6 tok/s) on the same RTX 3090. - 128K context fits on 24 GB. Q4_0 KV + rolling 4096-slot target feature buffer. 134.78 tok/s at ctx=131072. - Only ggml. Never link libllama. ~2000 LOC C++/CUDA in libdflash27b.a around ggml_gated_delta_net.

Why the experiment exists Qwen3.5-27B is a hybrid model: every 4th layer is full softmax attention, the rest (48 of 64) are Gated DeltaNet. SSM state cache alongside the KV cache. That combo doesn't have a good single-3090 decode path today: llama.cpp has the GGUF loader and ggml_gated_delta_net, but no DFlash speculative decoding. vLLM / SGLang ship z-lab's DFlash integration, but only on BF16 (54 GB, doesn't fit on 24 GB). AWQ target on SGLang runs plain AR at 46.6 tok/s but can't host a BF16 draft + DDTree state in 24 GB. z-lab's reference benchmarks run BF16 on B200, 54+ GB class. We wanted the fastest single-3090 decode on a 24 GB card. The answer: port only the graph glue to ggml, keep the existing DeltaNet kernel, run DFlash block-diffusion draft with a DDTree verifier, compress KV to Q4_0 for long context.

From autoregressive to DDTree Same 10-prompt HE bench, n_gen=256, Q4_K_M target, BF16 draft. AL = average accept length. DDTree paper reports +35-42% over chain DFlash on pure-attention Qwen3 variants. On our hybrid Q4_K_M/RTX 3090 combo we see +15% over chain. The gap comes from Q4 quantization flattening the draft softmax, partially patched with a chain pre-seed in build_ddtree. Draft-ceiling bound, not verify-memory bound: a bigger tree won't help, only a better draft will.

Key wins - f16 intermediate cache: half the bandwidth, +5% at the same tree budget. Bit-identical to AR at 40 tokens. - Persist-write kernel (ggml_gated_delta_net_tree_persist): skips a 9 ms ggml_cpy per step, +11%. - target_feat compaction after sibling accept: unlocked real tree rescue on 9/10 prompts. - extract_draft_topk reverse bug: sort_heap + cmp_greater already produces descending order; an extra std::reverse was sending the worst candidate to the tree root. One-line fix. - verify_logits_buf overflow: sized vocabq_len but DDTree reads vocab(budget+1) past budget 15. Silent memory corruption. One-line size fix.

128K context on 24 GB Flash-attention in ggml-cuda supports Q4_0 K+V natively, so KV compression is just ggml_cpy with the F32->Q4_0 quantizer on write. 8x over f16. Combined with a rolling 4096-slot target_feat ring, target_feat shrinks from 6.6 GB to 0.2 GB at 128K. Tradeoffs: Q4_0 KV costs ~3% quality on HE (AL 8.56 -> 8.33) at short context, dramatically better at long ones. Only thing that lets 128K fit on 24 GB.

Prefill Short prompts (<=2048 tok): PREFILL_UBATCH=16. Matches DFlash block size. Long prompts (>2048 tok): auto-bump to PREFILL_UBATCH=192. 13K prefill: 40.9 s -> 15.07 s (2.7x, ~913 tok/s).

What comes next - Daemon mode: keep the model resident, first-token latency 10 s -> ms. - Temperature / top-k sampling in verify. Currently greedy-only. - Q5_K_M / Q6_K: better quants should recover most of the ~30-point accept gap vs BF16. - Full llama.cpp integration: qwen35 arch, llama-speculative-dflash.cpp wiring. - Metal/Vulkan: not planned. CUDA only, anyone who wants Metal can fork.

As soon as Qwen3.6-27B comes out, we'll do the same for it. Repo in the first comment (open source, MIT).

Visit the Original Link

Read the full content on github.com

Visit github.com View on Hacker News

Source

github.com

Author

GreenGames

Posted

April 20, 2026 at 06:46 PM

Visit Original Hacker News Thread

We got 207 tok/s with Qwen3.5-27B on an RTX 3090

Top Comments

Visit the Original Link

Source

Author

Posted

More Top Stories

Tim Cook to become Apple Executive Chairman

AI Resistance Is Growing

Qwen3.6-Max-Preview: Smarter, Sharper, Still Evolving

Kimi vendor verifier – verify accuracy of inference providers

GitHub's fake star economy

Tim Cook Stepping Down as Apple CEO, John Ternus Taking Over