KV Cache Compression 900000x Beyond TurboQuant and Per-Vector Shannon Limit
Top Comments
This line from the abstract got me really suspicious. Obviously a compression scheme that incorporates the entire sequence shouldn't get worse compared to a per element one as the length increases.
It is important to note that this paper is PURELY theoretical. I couldn't find much meat on the bone from a quick skim.
The single author, Gregory Magarshak, has only published one paper on arxiv before and appears to be a professor of business / music. I don't plan to give it more of a read hoping for something of value.
I don't understand this. The key and value vectors for any given layer + token are created by the model. By definition, they are exactly equal to the model's prediction of them!
Extreme KV cache compression is easy to get---you can get an infinite compression ratio by just regenerating the key and value vectors on every forward pass. The point of a KV cache is to reduce the amount of repeated computation during generation, though. Compression only helps if you have an efficient decompression algorithm.