Overfitted a 900KB Transformer to Compress a 100MB CSV into 7MB

Points

Comments

spidy__

Author

Top Comments

whacked_newJun 26

Somewhat related is stavros's method to compress 500KB to something like 50 bytes https://www.stavros.io/posts/compressing-images-with-stable-...

main drawback is that it's not lossless ;-)

but this is great. I hope this actually becomes a format that wraps the weights and transformer module (maybe this can also be NAS-optimized too?). Maybe it would even work for video?

It's like calling gzip but instead of compression level you choose kolmogorov complexity level

userbinatorJun 26

Fabrice Bellard may have been the first to do this, 7 years ago: https://news.ycombinator.com/item?id=27244004

SubiculumCodeJun 26

What do those compress to with conventional approaches? For comparison.

I am curious. A classic machine learning ensemble approach is to overfit a collection of small models then bag them (e.g. voting) allowing the models to generalize.

I'm sure someone's tried to overfit a bunch of transformers for compression like this, then bag them to see how well it does?

jmspringJun 26

The model is the important part, a huffman code or adaptive huffman or other sorts of encoders would be much better on a dataset based on the model. You need the model to also decode. And on a dataset of sufficient size, embedding the model and the benefit of it's memorization of the file can be offset.

A non-general compression algorithm (model - I don't mean a distinct llm, but "modeling data") targeted at a specific dataset will always do better than a general algorithm.

The reason I mentioned the "encoder" doesn't matter - arithmetic coding, for the data it is presented, will beat huffman/adaptive huffman every day, but it's the model that is where the real "compression" comes into play.

I've implemented enough "coders" over the years, including arithmetic for both commercial and research purposes (was a student of Glen Langdon).

wildstrawberryJun 26

Three questions:

1. How much was AI used to generate documentation for this project?

2. The 100MB CSV data sources are not provided in the repo so it doesn't seem possible to reproduce your results. The enwik9 dataset says it is a "slice" of the larger data set, and there are many NYC taxi trip record datasets that exist. Can you provide the datasets used to generate your results?

3. I am surprised to see performance comparisons only between your transformer and WinZIP. What were your results when comparing your transformer to more modern approaches like LZMA2 (level 9), BZIP2 and ZPAQ (max effort)?

purple-leafyJun 26

Dumb question: can you train a model to predict the next byte of ANOTHER MODEL

So apply this same logic to compressing a bigger model within a smaller model

I know this is absolutely regarded, but humour me please

jxmorris12Jun 26

Lo and behold, a nice arithmetic coding implementation that wasn't written by an LLM! A sight for sore eyes – a treat, even. Looks like it was written by someone else though.

Check it out: https://github.com/samyak112/pym-particles/blob/main/arithme...

7373737373Jun 23

What does it compress the full 1GB file to? http://prize.hutter1.net/

Visit the Original Link

Read the full content on news.ycombinator.com

View on Hacker News

Source

news.ycombinator.com

Author

spidy__

Posted

June 23, 2026 at 01:11 PM

Hacker News Thread

Overfitted a 900KB Transformer to Compress a 100MB CSV into 7MB

Top Comments

Visit the Original Link

Source

Author

Posted

More Top Stories

Om Malik has died

An entire Herculaneum scroll has been read for the first time

Libre Barcode Project

Framework's 10G Ethernet module exposes USB-C's complexity

What happened after 2k people tried to hack my AI assistant

The 'papers, please' era of the internet will decimate your privacy