Hacker News

by Ryan Harman

Gzip decompression in 250 lines of Rust (iev.ee)

80 points by vismit2000 4 days ago

stgn 4 hours ago

> so i wrote a gzip decompressor from scratch

After skimming through the author's Rust code, it appears to be a fairly straightforward port of puff.c (included in the zlib source): https://github.com/madler/zlib/blob/develop/contrib/puff/puf...

dymk 3 hours ago

This feels like it should have been mentioned in the article.

It makes me wonder if there was some LLM help, based on how similar the fn structure and identifier names are.

f1shy 3 hours ago

> It makes me wonder if there was some LLM help

I would bet there was

fuhsnn 2 hours ago

> This feels like it should have been mentioned in the article.

With an entire section complaining how many lines of code existing implementations are, looks like they did found a good simple implementation to clone in Rust then deliberately not mention it.

bitbasher an hour ago

You could say it was a “puff” piece, eh, eh!?

nayuki 5 hours ago

Just like that author, many years ago, I went through the process of understanding the DEFLATE compression standard and producing a short and concise decompressor for gzip+DEFLATE. Here are the resources I published as a result of that exploration:

* https://www.nayuki.io/page/deflate-specification-v1-3-html

* https://www.nayuki.io/page/simple-deflate-decompressor

* https://github.com/nayuki/Simple-DEFLATE-decompressor

Lerc 3 hours ago

The function

  fn bits(&mut self, need: i32) -> i32 { ....

Put me in mind of one of my early experiments in Rust. It would be interesting to compare a iterator based form that just called .take(need)

I haven't written a lot of Rust, but one thing I did was to write an iterator that took an iterator of bytes as input and provided bits as output. Then used an iterator that gave bytes from a block of memory.

It was mostly as a test to see how much high level abstraction left an imprint on the compiled code.

The dissasembly showed it pulling in 32 bits at a time and shifting out the bits pretty much the same way I would have written in ASM.

I was quite impressed. Although I tested it was working by counting the bits and someone critizised it for not using popcount, so I guess you can't have everything.

kibwen 3 hours ago

> I tested it was working by counting the bits and someone critizised it for not using popcount

PSA: Rust exposes the popcnt intrinsic via the `count_ones` method on integer types: https://doc.rust-lang.org/std/primitive.u32.html#method.coun...

Lerc 3 hours ago

Looks like that was added about 3 years after I wrote my test code.

evmar 2 hours ago

I had a similar experience implementing simd instructions in my emulator, where I needed to break apart a 64-bit value into four eight-bit values, do an operation on each value, then pack it back together. My first implementation did it with all the bit shifts you’d expect, but my second one used two helpers to unpack into an array, map on the array to a second array, and pack the array again. The optimized output was basically the same.

carlos256 3 hours ago

>the only flag we care about is FNAME The specification does not define an encoding for the file name. Different file systems may impose restrictions on certain names, so FNAME should not be used.

MisterTea 5 hours ago

> twenty five thousand lines of pure C not counting CMake files. ...

Keep in mind this is also 31 years of cruft and lord knows what.

Plan 9 gzip is 738 lines total:

  gzip.c 217 lines
  gzip.h 40 lines
  zip.c  398 lines
  zip.h  83 lines

Even the zipfs file server that mounts zip files as file systems is 391 lines.

edit - post a link to said code: https://github.com/9front/9front/tree/front/sys/src/cmd/gzip

> ... (and whenever working with C always keep in mind that C stands for CVE).

Sigh.

bboozzoo 4 hours ago

You forgot to include https://github.com/9front/9front/tree/front/sys/src/libflate which gzip is built around, which brings it closer to 10k lines.

carlos256 3 hours ago

Interesting, the decompressor in Jdeflate is around 4k LoC. https://github.com/Jpn666/jdeflate

tyingq 5 hours ago

His also omits CRC, which is part of the 25k lines, no --fast/--best/etc, missing some output formats, and so on. I'm sure the 25k includes a lot of bloat, but the comparison is odd. Comparing to your list would make much more sense.

kibwen 5 hours ago

I would expect a CRC to add a negligible number of lines of code. The reason that production-grade decompressors are tens of thousands of LOC is likely attributable to extreme manual optimization. For example, I wouldn't be surprised if a measurable fraction of those lines are actually inline assembly.

nayuki 4 hours ago

ack_complete 4 hours ago

tyingq 4 hours ago

fullstop 4 hours ago

gzip also contains a significant amount of compatibility code for different platforms.

xxs 4 hours ago

Crc32 can be written in handful lines of code. Although it'd be better to use the vector instruction set - e.g. AVX when available.

up2isomorphism 4 hours ago

Another dev who doesn’t show respect to what has been done and expect a particular language will do wonders for him. Also I don’t see this is much better in term of readability.

maverwa 3 hours ago

Where do you see the lack of respect? The author wanted to learn how gzip works and chose to implement it in a language they like to do so. As a learning tool, not because the world needs another gzip decompressor.

hybrid_study 4 hours ago

he does mention https://github.com/trifectatechfoundation/zlib-rs not just https://github.com/madler/zlib, but it would be interesting to hear from those developers also

jeffrallen 5 hours ago

But probably without any error checking.

Feels like Rust culture inherited "throw and forget" as an error handling "strategy" from Java

Sigh.

dmit 3 hours ago

Hehe, why "probably"? It says "250 lines" right there in the subject. Surely one can skim the single file of code (https://github.com/ieviev/mini-gzip/blob/main/src/main.rs) and offer criticism that isn't based on hypotheticals?

Anyway, I skimmed the file for you this time, and basically you're either correct or wrong, depending on your definition of "error checking." The code handles error conditions by aborting the process. Seeing as it's a standalone CLI program and not a library meant for reuse, safely shutting down with a meaningful message sounds like fair game to me.

dymk 4 hours ago

This is an educational project. Not something for production. The article even says so!

You can leave the snide comments about “Rust culture” (whatever that is) out next time.

throwaway27448 4 hours ago

Why people ascribe error handling practices to languages is baffling. What language doesn't allow punting error handling until later? Even Haskell has "panic" functionality that fudges the type constraints to allow this.

pdpi 2 hours ago

EAFP is an explicit part of what makes code "pythonic", and the Zen of Python (`import this`) has the lines "Errors should never pass silently. Unless explicitly silenced." Java has checked and unchecked exceptions. Rust has panics and Result<T, E>, and the ? operator.

The way a language's community handles errors and how the language itself handles errors are different things, sure, but they're not independent of each other.

That said, OP's snark against Rust is completely unmerited, and they can take my `impl From<OtherErr> for MyErr` from my cold dead hands.

Hacker News

by Ryan Harman

Gzip decompression in 250 lines of Rust (iev.ee)

stgn 4 hours ago [-]

dymk 3 hours ago [-]

f1shy 3 hours ago [-]

fuhsnn 2 hours ago [-]

bitbasher an hour ago [-]

nayuki 5 hours ago [-]

Lerc 3 hours ago [-]

kibwen 3 hours ago [-]

Lerc 3 hours ago [-]

evmar 2 hours ago [-]

carlos256 3 hours ago [-]

MisterTea 5 hours ago [-]

bboozzoo 4 hours ago [-]

carlos256 3 hours ago [-]

tyingq 5 hours ago [-]

kibwen 5 hours ago [-]

nayuki 4 hours ago [-]

ack_complete 4 hours ago [-]

tyingq 4 hours ago [-]

fullstop 4 hours ago [-]

xxs 4 hours ago [-]

up2isomorphism 4 hours ago [-]

maverwa 3 hours ago [-]

hybrid_study 4 hours ago [-]

jeffrallen 5 hours ago [-]

dmit 3 hours ago [-]

dymk 4 hours ago [-]

throwaway27448 4 hours ago [-]

pdpi 2 hours ago [-]

stgn 4 hours ago

dymk 3 hours ago

f1shy 3 hours ago

fuhsnn 2 hours ago

bitbasher an hour ago

nayuki 5 hours ago

Lerc 3 hours ago

kibwen 3 hours ago

Lerc 3 hours ago

evmar 2 hours ago

carlos256 3 hours ago

MisterTea 5 hours ago

bboozzoo 4 hours ago

carlos256 3 hours ago

tyingq 5 hours ago

kibwen 5 hours ago

nayuki 4 hours ago

ack_complete 4 hours ago

tyingq 4 hours ago

fullstop 4 hours ago

xxs 4 hours ago

up2isomorphism 4 hours ago

maverwa 3 hours ago

hybrid_study 4 hours ago

jeffrallen 5 hours ago

dmit 3 hours ago

dymk 4 hours ago

throwaway27448 4 hours ago

pdpi 2 hours ago