Top Comments
https://uklkyvetsjf7qt-80.proxy.runpod.net
./build/bin/llama-server \
-m ../Ternary-Bonsai-8B-Q2_0.gguf \
-ngl 999 \
--flash-attn on \
--host 0.0.0.0 \
--port 80 \
--ctx-size 65500 \
--batch-size 512 \
--ubatch-size 512 \
--parallel 5 \
--cont-batching \
--threads 8 \
--threads-batch 8 \
--cache-type-k q8_0 \
--cache-type-v q8_0 \
--log-colors on
# llama.cpp is forked one: https://github.com/PrismML-Eng/llama.cpp.git# The server can serve 5 parallel request, with each request capped at around `13K` tokens...
# A bit of of benchmarks I did:
# 1. Input: 1001 tokens, ttfs: 0.3 second, outputs: 1618 tokens ~140t/s
# 2. Input: 9708 tokens, ttfs: 2.4 second, outputs: 2562 tokens at ~106t/s
# Vram usage was consistently at ~7GiB.
> https://huggingface.co/prism-ml/Ternary-Bonsai-8B-gguf/resol...
in my results, accuracy-wise Ternary-Bonsai-8B is on par with Qwen3.5-4B. But in accuracy-per-byte, bonsai is the clear winner:
=> Ternary-Bonsai-1.7B achieved 65.1% from 462 MiB, beating Qwen3.5-0.8B by 12 points while being ~5% smaller on disk. => Ternary-Bonsai-4B is the accuracy-per-byte winner above 1 GiB. 83.0% from only 1.1 GiB, within 2 points of Qwen3.5-4B at 40% of the weight size.
they show strong promise on edge devices and where disk space is limited. I think this lab is worth watching.
(I've been reading the MMLU-Redux questions for electrical engineering. They're very funny. Fifty years ago they might have been relevant. The references to the Intel 8085 date this to the mid-1970s. Moving coil meters were still a big thing back then. Ward-Leonard drives still drove some elevators and naval guns. This is supposed to be the hand-curated version of the questions. Where do they get this stuff? Old exams?)
[1] https://github.com/aryopg/mmlu-redux/blob/main/outputs/multi...
Why aren't they comparing to 2/3/4 bit quants?
I also have yet to see any of these at a larger scale. For example, can you try one of these at 100 billion parameters?