Data Compression Explained (2012) (mattmahoney.net)

174 points by mtdewcmu 4 days ago

Bnjoroge 34 minutes ago

I wonder what if anything has changed ever since this article. Is llm-based compression more mainstream?

rurban 13 hours ago

The leader boards are from the pre Fabrice Bellard days, btw. Neural network modeling helped finding better patterns in text.

Also, you could say the same for the related data search problem. How to prepare data, so that it can most efficiently searched. Smallest encoding vs fastest search. Databases are mostly very, very stupid compared to more data-specific tuned algorithms. Like factor 1000 slower and bigger.

TiredOfLife 8 hours ago

https://mattmahoney.net/dc/text.html is the one with Bellard

dang 13 hours ago

Related:

Data Compression Explained (2011) - https://news.ycombinator.com/item?id=40631931 - June 2024 (1 comment)

Data Compression Explained - https://news.ycombinator.com/item?id=5931493 - June 2013 (14 comments)

Data Compression Explained by Matt Mahoney - https://news.ycombinator.com/item?id=1179242 - March 2010 (1 comment)

usernametaken29 11 hours ago

Isn’t the idea of AI precisely to find universal compression from arbitrary input data, at least with LLMs?

adrian_b 8 hours ago

Which is why in this book the title of a paragraph is "Compression is an artificial intelligence problem".

However, I believe that is less useful to think that AI "finds an universal compression", than to think that the training of an AI model has the purpose to find a specific lossy data compression method, which is close to optimal for the input data that constitutes the training data set.

One could consider the training algorithm as a universal lossy data compression method, but this view is not useful in practice, because unlike a traditional lossy data compression algorithm, used e.g. for movie or picture compression, which you can use every day to compress many data sets, even in real time on an incoming data stream, the training of an AI model is a very long and expensive operation, which can be done only infrequently and which makes sense only for special important datasets, from which data will be frequently extracted by querying (i.e. AI inference) for a long time, to make worthwhile the compression, i.e. training, cost.

Moreover, for the best compression results the training of a new improved AI model does not consist only in determining the values of parameters (weights) of a fixed inference algorithm, but the structure of the inference algorithms is also tweaked for each new generation of models.

This is an additional reason that makes impractical to think about training as a universal compression algorithm (instead of a method for searching specific compression algorithms, which work for a given training set), because it is not a fixed algorithm, but a family of algorithms that evolves continuously, at least for now.

hyperpape 6 hours ago

I think this is an analogy that's been taken far too far. The output of intelligence just isn't compression, that's memorization. The role of intelligence is to generate novelty.

It's true that LLMs do something that looks very compression like in their weights, but it is lossy, and it has to be--if you're not lossy, you've overfitted the corpus, and that's bad. Post-training takes this even further, because you're not doing anything that looks like training on a specific corpus, you're exploring in a wider space of text. That text doesn't even concretely exist until you start exploring it.

I'm sure there must be a serious attempt to pursue this analogy that isn't just handwaving, but I haven't seen it.

miki123211 6 hours ago

LLM compression doesn't necessarily have to be lossy.

You can use the fact that LLMs predict P(next token | existing tokens) to losslessly and efficiently compress arbitrary token sequences. This idea is closely related to arithmetic coding.

hyperpape 5 minutes ago

fuglede_ 5 hours ago

3b1b just started a series under this slogan. Only one episode so far but chances are it'll be a good one; https://www.3blue1brown.com/

briansm 10 hours ago

I think so, specifically lossy compression though.

A modern version of the book would include an extra section in the 'Lossy compression' chapter - 'Text' (alongside Images/Video/Audio) that would discuss LLM's.

eru 9 hours ago

No, it's not for lossy compression only.

An LLM can give you a probability distribution for the next token. You can pair that with arithmetic coding to get a lossless compression/decompression algorithm. See https://en.wikipedia.org/wiki/Arithmetic_coding

adrian_b 8 hours ago

eru 9 hours ago

No, LLMs only do this for language. They don't try to do this for arbitrary data.

woadwarrior01 9 hours ago

There are many approaches around this, the simplest being to treat bytes as tokens (cf: Google's ByT5[1]). Also, BLT[2] from Meta and ByteFormer[3] from Apple.

[1]: https://arxiv.org/abs/2105.13626

[2]: https://arxiv.org/abs/2412.09871

[3]: https://arxiv.org/abs/2306.00238

energy123 7 hours ago

Transformers do this for any stream of tokens, those tokens can map to anything you want, and you will get lossy compression. Text produced by humans just happens to be dense, available, and a useful prior, but it is not intrinsically required. See 3D vision transformers for example.

jeremyjh 5 hours ago

agumonkey 4 hours ago

semantic compression, that's how i see it

firesteelrain 4 hours ago

Matt Mahoney is one of the best when it comes to compression. He is retired now.

blastro an hour ago

i was told that middle-out was best

brownpoints 9 hours ago

I say transformers are the best compression systems

wps 12 hours ago

This is the guy who created Zpaq btw. Super interesting but niche backup/archive software.

blobbers 12 hours ago

Matt is a great guy to explain this kind of stuff. He's very helpful.

phrotoma 4 hours ago

Yeah this was way more interesting than I anticipated.

NooneAtAll3 12 hours ago

does anyone have any sources to read about ai-based compression?

I remember hearing a lot about "compression is a lot about prediction", but I don't remember reading any practical result

Karliss 11 hours ago

It can and has been done just not very practical. Having a dozen GB language model just to squeeze out few more percent on plaintext compression which already compresses well and is tiny in comparison of images or video is not worth it outside benchmarks. Even superior traditional conpression algorithms are often not used due to insufficient software support. Multigabyte decompressor as big as rest of your OS installation is not practical to distribute or standardize. It would also take a lot of memory at runtime for decompressing thus shadowing the efficiency gains in everyday use. Only if you have huge archival scale of data it might be worth the gains. But for long term archival fragile formats which depend on huge arbitrary extra knowledge isnt a good idea. I am not quite sure if ai based compression would make it more robust by allowing to fix corruption based on context or make it worse by having single bitflip produce completely opposite but still plausible looking text. At least with traditional compression its usually obvious when corruption causes gibberish. And then you have problem of versioning, you need to have exactly the same version of dozen GB model for decompression as was used for compression. Just one of them is questionable now imagine having to store few dozen of them. Most computers have code for supporting at least half a dozen compression formats, and many of those are parametrized allowing single algorithm to handle multiple varations of the compression scheme, and then many apps bundle their own copies of compression library.

eru 9 hours ago

I mostly agree, however:

> But for long term archival fragile formats which depend on huge arbitrary extra knowledge isnt a good idea.

This doesn't need to be a problem: you can and should layer an error correcting code on top.

sltkr 25 minutes ago

compression = prediction + entropy coding was already an insight from Claude Shannon in the 1950s

Since LLM are inherently token predictors, that makes using them for losless compression almost trivial. For something close to the state of the art see e.g. Fabrice Bellard (of course) ts_zip: https://bellard.org/ts_zip/

I think some of the confusion comes from the fact that there is a pretty big difference between the techniques employed by compressors that optimize compression ratio at the cost of nearly everything else, like ts_zip above, and practical tools that intend to balance compression ratio with limitation on CPU speed / memory, like zstd.

When optimizing for compression ratio, the prediction + entropy coding paradigm dominates. Practical tools, even modern ones like zstd, are mostly based around sliding window compression à la LZ77 (unzip/deflate), with the main selling point of more modern tools being that they scale up to larger window sizes and run really really fast. Some of these (like LZO) don't even have an entropy coding step to save time. zstd has both Huffman coding and FSE: Huffman coding is suboptimal but presumably it's an option because it's faster, and on lower compression levels it's preferable to be fast.

Anyway, the bottom line is: don't get confused between the state of the art in terms of compression ratio, and practical tools. Those are quite different things.