Overfitted a 900KB Transformer to Compress a 100MB CSV into 7MB
Top Comments
main drawback is that it's not lossless ;-)
but this is great. I hope this actually becomes a format that wraps the weights and transformer module (maybe this can also be NAS-optimized too?). Maybe it would even work for video?
It's like calling gzip but instead of compression level you choose kolmogorov complexity level
I am curious. A classic machine learning ensemble approach is to overfit a collection of small models then bag them (e.g. voting) allowing the models to generalize.
I'm sure someone's tried to overfit a bunch of transformers for compression like this, then bag them to see how well it does?
A non-general compression algorithm (model - I don't mean a distinct llm, but "modeling data") targeted at a specific dataset will always do better than a general algorithm.
The reason I mentioned the "encoder" doesn't matter - arithmetic coding, for the data it is presented, will beat huffman/adaptive huffman every day, but it's the model that is where the real "compression" comes into play.
I've implemented enough "coders" over the years, including arithmetic for both commercial and research purposes (was a student of Glen Langdon).
1. How much was AI used to generate documentation for this project?
2. The 100MB CSV data sources are not provided in the repo so it doesn't seem possible to reproduce your results. The enwik9 dataset says it is a "slice" of the larger data set, and there are many NYC taxi trip record datasets that exist. Can you provide the datasets used to generate your results?
3. I am surprised to see performance comparisons only between your transformer and WinZIP. What were your results when comparing your transformer to more modern approaches like LZMA2 (level 9), BZIP2 and ZPAQ (max effort)?
So apply this same logic to compressing a bigger model within a smaller model
I know this is absolutely regarded, but humour me please
Check it out: https://github.com/samyak112/pym-particles/blob/main/arithme...