Show HN: Hacker News archive (47M+ items, 11.6GB) as Parquet, updated every 5m (huggingface.co)

112 points by tamnd 4 days ago

xnx an hour ago

The best source for this data used to be Clickhouse (https://play.clickhouse.com/play?user=play#U0VMRUNUIG1heCh0a...), but it hasn't updated since 2025-12-26.

gkbrk an hour ago

My Hacker News items table in ClickHouse has 47,428,860 items, and it's 5.82 GB compressed and 18.18 GB uncompressed. What makes Parquet compression worse here, when both formats are columnar?

0cf8612b2e1e an hour ago

Sorting, compression algorithm +level, and data types can all have an impact. I noted elsewhere that a Boolean is getting represented as an integer. That’s one bit vs 1-4 bytes.

There is also flexibility in what you define as the dataset. Skinnier, but more focused tables could be space saving vs a wide table that covers everything -will probably break compressible runs of data.

xnx an hour ago

Parquet has a few compression option. Not sure which one they are using.

hirako2000 an hour ago

Plus isn't the least wasteful format, native duckdb for instance compacts better. That's not just down to the compression algorithm, which as you say got three main options for parquet.

Imustaskforhelp 2 minutes ago

As someone who had made a project analysing hackernews who had used clickhouse, I really feel like this is a project made for me (especially the updated every 5 minute aspect which could've helped my project back then too!)

Your project actually helps me out a ton in making one of the new project ideas that I had about hackernews that I had put into the back-burner.

I had thought of making a ping website where people can just @Username and a service which can detect it and then send mail to said username if the username has signed up to the service (similar to a service run by someone from HN community which mails you everytime someone responds to your thread directly, but this time in a sort of ping)

[The previous idea came as I tried to ping someone to show them something relevant and thought that wait a minute, something like ping which mails might be interesting and then tried to see if I can use algolia or any service to hook things up but not many/any service made much sense back then sadly so I had the idea in back of my mind but this service sort of solves it by having it being updated every 5 minutes]

Your 5 minute updates really make it possible. I will look what I can do with that in some days but I am seeing some discrepancy in the 5 minute update as last seems to be 16 march in the readme so I would love to know more about if its being updated every 5 minutes because it truly feels phenomenal if true and its exciting to think of some new possibilities unlocked with it.

kshacker 15 minutes ago

Good for demo but every 5 minutes? Why?

Imustaskforhelp a minute ago

It can have some good use cases I can think of. Personally I really appreciate the 5 minute update.

mlhpdx an hour ago

Static web content and dynamic data?

> The archive currently spans from 2006-10 to 2026-03-16 23:55 UTC, with 47,358,772 items committed.

That’s more than 5 minutes ago by a day or two. No big deal, but a little bit depressing this is still how we do things in 2026.

xandrius 2 minutes ago

I don't get what you meant with this comment.

alstonite 27 minutes ago

What happened between 2023 and 2024 to cause the usage dropoff?

ghgr 23 minutes ago

I'd say it's less a usage dropoff and more a reversion to the mean after Covid

tehjoker 15 minutes ago

That's a possible hypothesis, but there was also a rising trend prior, it wasn't stable.

lyu07282 23 minutes ago

Please upload to https://academictorrents.com/ as well if possible

tonymet 21 minutes ago

what's the license for HN content?

echelon 15 minutes ago

At this point, you can train on anything without repercussion.

Copyright doesn't seem to matter unless you're an IP cartel or mega cap.

marginalia_nu 10 minutes ago

Laughs nervously in jurisdiction without fair use doctrine

0cf8612b2e1e an hour ago

Under the Known Limitations section

  deleted and dead are integers. They are stored as 0/1 rather than booleans.
Is there a technical reason to do this? You have the type right there.

lokimoon 20 minutes ago

You are the product

Onavo 2 hours ago

Is is possible to only download a subset? e.g. Show HNs or HN Whoishiring. The Show HNs and HN Whoishiring are very useful for classroom data science i.e. a very useful set of data for students to learn the basic of data cleaning and engineering.

nelsondev 2 hours ago

It’s date partitioned, you could download just a date range. It’s also parquet, so you can download just specific columns with the right client

bstsb 2 hours ago

what’s the license? “do whatever the fuck you want with the data as long as you don’t get caught”? or does that only work for massive corporations

BoredPositron 17 minutes ago

The universal license.

GeoAtreides 2 hours ago

is the legal page a placeholder, do words have no meaning?

https://www.ycombinator.com/legal/

Mods, enforce your license terms, you're playing fast and loose with the law (GDPR/CPRA)

Retr0id an hour ago

Which terms are not being enforced? (not disagreeing I just don't feel like reading a large legal document)

GeoAtreides an hour ago

> By uploading any User Content you hereby grant and will grant Y Combinator and its affiliated companies

The user content is supposed to be licensed only Y Combinator and (bleah) its affiliated companies (which are many, all the startups they fund, for example).

jmalicki 35 minutes ago

zamadatix 29 minutes ago

ryandvm 44 minutes ago

ungruntled an hour ago

None that I could see:

Your submissions to, and comments you make on, the Hacker News site are not Personal Information and are not "HN Information" as defined in this Privacy Policy.

Other Users: certain actions you take may be visible to other users of the Services.

GeoAtreides an hour ago

ryandvm an hour ago

Eh, fuck that agreement. I'm kind of old school in that I believe if you put it on the internet without an auth-wall, people should be allowed to do whatever they want with it. The AI companies seem to agree.

Then again, I'm not the guy that is going to get sued...

Ylpertnodi 28 minutes ago

> I believe if you put it on the internet without an auth-wall, people should be allowed to do whatever they want with it.

I agree. It's the owners of the sites that have to follow rules, not us.

kmeisthax 22 minutes ago

"I'm kind of old school in that I believe if you put grass on the ground without a fence, people should be allowed to do whatever they want with it. The noblemen with a thousand cows seem to agree."

And that, my friends, is how you kill the commons - by ignoring the social context surrounding its maintenance and insisting upon the most punitive ways of avoiding abuse.

petercooper 10 minutes ago

echelon 13 minutes ago

hsuduebc2 an hour ago

How is is he breaking gdpr here?

andrewmcwatters an hour ago

They already refuse to comply with CPRA, instead electing to replace your username with a random 6(?) character string, prefixed with `_`, if I remember correctly.

I know, because I've been here since maybe 2015 or so, but this account was created in 2019.

So any PII you have mentioned in your comments is permanent on Hacker News.

I would appreciate it if they gave users the ability to remove all of their personal data, but in correspondence and in writing here on Hacker News itself, Dan has suggested that they value the posterity of conversations over the law.

palmotea 2 hours ago

> At midnight UTC, the entire current month is refetched from the source as a single authoritative Parquet file, and today's individual 5-minute blocks are removed from the today/ directory.

Wouldn't that lose deleted/moderated comments?

BoredPositron 22 minutes ago

I guess that's the point.