An Ode to Bzip (purplesyringa.moe)

43 points by signa11 3 hours ago

saghm an hour ago

Early on the article mentions that xz have zstd have gotten more popular than bzip, and my admitted naive understanding is that they're considered to have better tradeoffs in teems of collision compression time and overall space saved by compression. The performance section heavily discusses encoding performance of gzip and bzip, but unless I'm missing something, the only references to xz or zstd in that section are briefly handwaving about the decoding times probably being similar.

My impression is that this article has a lot of technical insight into how bzip compares to gzip, but it fails actually account for the real cause of the diminished popularity of bzip in favor of the non-gzip alternatives that it admits are the more popular choices in recent years.

fl0ki an hour ago

This seems as good a thread as any to mention that the gzhttp package in klauspost/compress for Go now supports zstd on both server handles and client transports. Strangely this was added in a patch version instead of a minor version despite both expanding the API surface and changing default behavior.

https://github.com/klauspost/compress/releases/tag/v1.18.4

klauspost 19 minutes ago

About the versioning, glad you spotted it anyway. There isn't as much use of the gzhttp package compared to the other ones, so the bar is a bit higher for that one.

Also making good progress on getting a slimmer version of zstd into the stdlib and improving the stdlib deflate.

hexxagone an hour ago

Notice that bzip3 has close to nothing to do with bzip2. It is a different BWT implementation with a different entropy codec, from a different author (as noted in the GitHub description "better and stronger spiritual successor to BZip2").

Grom_PE 5 minutes ago

PPMd (of 7-zip) would beat Bzip2 in this use case.

pella 37 minutes ago

imho: the future is a specialized compressor optimized for your specific format. ( https://openzl.org/ , ... )

srean 23 minutes ago

That is an interesting link.

Does gmail use a special codec for storing emails ?

elophanto_agent 2 hours ago

bzip2 is the compression algorithm equivalent of that one coworker who does incredible work but nobody ever talks about. meanwhile gzip gets all the credit because it's "good enough"

kergonath an hour ago

Bzip2 is slow. That’s the main issue. Gzip is good enough and much faster. Also, the fact that you cannot get a valid bzip2 file by cat-ing 2 compressed files is not a deal breaker, but it is annoying.

nine_k an hour ago

Gzip is woefully old. Its only redeeming value is that it's already built into some old tools. Otherwise, use zstd, which is better and faster, both at compression and decompression. There's no reason to use gzip in anything new, except for backwards compatibility with something old.

kergonath an hour ago

sedatk 27 minutes ago

> the fact that you cannot get a valid bzip2 file by cat-ing 2 compressed files

TIL. Now that's why gzip has a file header! But, tar.gz compresses even better, that's probably why it hasn't caught on.

pocksuppet 24 minutes ago

saidnooneever an hour ago

the catting issue might be more an implementation of bzip program problem than algorithm (it could expect an array of compressed files). that would only be impossible if the program cannot reason about the length of data from file header, which again is technically not something about compression algo but rather file format its carried through.

that being said, speed is important for compression so for systems like webservers etc its an easy sell ofc. very strong point (and smarter implementation in programs) for gzip

nine_k an hour ago

joecool1029 an hour ago

stefan_ an hour ago

bzip and gzip are both horrible, terribly slow. Wherever I see "gz" or "bz" I immediately rip that nonsense out for zstd. There is such a thing as a right choice, and zstd is it every time.

laurencerowe 21 minutes ago

joecool1029 an hour ago

Just use zstd unless you absolutely need to save a tiny bit more space. bzip2 and xz are extremely slow to compress.

silisili an hour ago

I'd argue it's more workload dependent, and everything is a tradeoff.

In my own testing of compressing internal generic json blobs, I found brotli a clear winner when comparing space and time.

If I want higher compatibility and fast speeds, I'd probably just reach for gzip.

zstd is good for many use cases, too, perhaps even most...but I think just telling everyone to always use it isn't necessarily the best advice.

joecool1029 an hour ago

> If I want higher compatibility and fast speeds, I'd probably just reach for gzip.

It’s slower and compresses less than zstd. gzip should only be reached for as a compatibility option, that’s the only place it wins, it’s everywhere.

EDIT: If you must use it, use the modern implementation, https://www.zlib.net/pigz/

hexxagone an hour ago

In the LZ high compression regime where LZ can compete in terms of ratio, BWT compressors are faster to compress and slower to decompress than LZ codecs. BWT compressors are also more amenable to parallelization (check bsc and kanzi for modern implementations besides bzip3).