88
Points
20
Comments
nnx
Author

Top Comments

freakynitApr 21
Open access for next 5 hours (Ternary-Bonsai-8B-Q2_0.gguf, running on RTX 3090) or until server crashes or the this spot instance gets taken away :) =>

https://uklkyvetsjf7qt-80.proxy.runpod.net

    ./build/bin/llama-server \
     -m ../Ternary-Bonsai-8B-Q2_0.gguf \
     -ngl 999 \
     --flash-attn on \
     --host 0.0.0.0 \
     --port 80 \
     --ctx-size 65500 \
     --batch-size 512 \
     --ubatch-size 512 \
     --parallel 5 \
     --cont-batching \
     --threads 8 \
     --threads-batch 8 \
     --cache-type-k q8_0 \
     --cache-type-v q8_0 \
     --log-colors on
# llama.cpp is forked one: https://github.com/PrismML-Eng/llama.cpp.git

# The server can serve 5 parallel request, with each request capped at around `13K` tokens...

# A bit of of benchmarks I did:

# 1. Input: 1001 tokens, ttfs: 0.3 second, outputs: 1618 tokens ~140t/s

# 2. Input: 9708 tokens, ttfs: 2.4 second, outputs: 2562 tokens at ~106t/s

# Vram usage was consistently at ~7GiB.

> https://huggingface.co/prism-ml/Ternary-Bonsai-8B-gguf/resol...

armanjApr 21
I did a quick benchmark & compared it with Qwen3.5: https://github.com/ArmanJR/PrismML-Bonsai-vs-Qwen3.5-Benchma...

in my results, accuracy-wise Ternary-Bonsai-8B is on par with Qwen3.5-4B. But in accuracy-per-byte, bonsai is the clear winner:

=> Ternary-Bonsai-1.7B achieved 65.1% from 462 MiB, beating Qwen3.5-0.8B by 12 points while being ~5% smaller on disk. => Ternary-Bonsai-4B is the accuracy-per-byte winner above 1 GiB. 83.0% from only 1.1 GiB, within 2 points of Qwen3.5-4B at 40% of the weight size.

they show strong promise on edge devices and where disk space is limited. I think this lab is worth watching.

usernametaken29Apr 21
I think it’s exciting to live in this quirky universe where we have simply accepted our hardware does weird and nonlinear stuff and that powers some math and that’s why your transform function works. Many people thought quantisation is not viable to the extent we see, but we clearly underestimated the effect of hardware on the actual non linearity of the models. Cool to see this pushed to the limits.
yodonApr 21
So excited to see this - the big advantage of 1.58 bits is there are no multiplications at inference time, so you can run them on radically simpler and cheaper hardware.
AnimatsApr 21
This makes sense. The 1-bit model implies needing 2x as many neurons, because you need an extra level to invert. But the ternary model still has a sign, just really low resolution.

(I've been reading the MMLU-Redux questions for electrical engineering. They're very funny. Fifty years ago they might have been relevant. The references to the Intel 8085 date this to the mid-1970s. Moving coil meters were still a big thing back then. Ward-Leonard drives still drove some elevators and naval guns. This is supposed to be the hand-curated version of the questions. Where do they get this stuff? Old exams?)

[1] https://github.com/aryopg/mmlu-redux/blob/main/outputs/multi...

WatchDogApr 21
All of their benchmarks are against 16 bit models right?

Why aren't they comparing to 2/3/4 bit quants?

mchusmaApr 21
Ever since I saw the first one of these one-bit models made by Microsoft, I thought this was a fascinating route. I assume that in practice, this is less helpful than it seems, just because there's every economic incentive in the world for the big AI labs to produce small, powerful, fast models. None of them seem to be using this technique, so it's interesting, but I suspect it's not quite working.

I also have yet to see any of these at a larger scale. For example, can you try one of these at 100 billion parameters?

gbgarbebApr 21
When do we get 1100B Kimi K2.6 in 160 GB of memory at 1.125 bpw?
Visit the Original Link

Read the full content on prismml.com

Source
prismml.com
Author
nnx
Posted
April 18, 2026 at 02:51 AM


More Top Stories

apple.com Apr 20
John Ternus to become Apple CEO
1398711 commentsby schappim
Details
zef-lang.dev Apr 21
How to make a fast dynamic language interpreter
878 commentsby pizlonator
Details
isaaccorbrey.com Apr 20
Jujutsu megamerges for fun and profit
18161 commentsby icorbrey
Details
qwen.ai Apr 20
Qwen3.6-Max-Preview: Smarter, Sharper, Still Evolving
570299 commentsby mfiguiere
Details
kimi.com Apr 20
Kimi vendor verifier – verify accuracy of inference providers
19617 commentsby Alifatisk
Details
github.com Apr 20
Soul Player C64 – A real transformer running on a 1 MHz Commodore 64
9224 commentsby adunk
Details
👋 Need help with code?