Hacker News

by Ryan Harman

Do you even need a database? (dbpro.app)

127 points by upmostly 7 hours ago

ozgrakkurt 4 hours ago

You need databases if you need any kind of atomicity. Doing atomic writes is extremely fragile if you are just on top of the filesystem.

This is also why many databases have persistence issues and can easily corrupt on-disk data on crash. Rocksdb on windows is a very simple example a couple years back. It was regularly having corruption issues when doing development with it.

dkarl 3 hours ago

Honestly, at this point, if I had a design that required making atomic changes to files, I'd redo the design to use SQLite. The other way around sounds crazy to me.

"Why use spray paint when you can achieve the same effect by ejecting paint from your mouth in a uniform high-velocity mist?" If you happen to have developed that particular weird skill, by all means use it, but if you haven't, don't start now.

That probably sounds soft and lazy. I should learn to use my operating system's filesystem APIs safely. It would make me a better person. But honestly, I think that's a very niche skill these days, and you should consider if you really need it now and if you'll ever benefit from it in the future.

Also, even if you do it right, the people who inherit your code probably won't develop the same skills. They'll tell their boss it's impossibly dangerous to make any changes, and they'll replace it with a database.

duped 35 minutes ago

The problem is that most of the time when you want "atomic changes to files" the only safe API is copy the file, mutate it, then rename. That doesn't factor in concurrent writers or advisory locks.

If that kind of filesystem traffic is unsuitable for your application then you will reinvent journaling or write-ahead logging. And if you want those to be fast you'll implement checkpointing and indexes.

noselasd 37 minutes ago

Yes, the code in the article will at one unlucky point end up with an empty file after a power outage.

At least write to a temp file(in the same filesystem), fsync the file and its folder and rename it over the original.

goerch 3 hours ago

Nice, so we are already covering the A of ACID. And don't get me started about what OLAP databases like DuckDB can do for out of core workloads.

creatonez 2 hours ago

For the simple case, it isn't necessarily that fragile. Write the entire database to a temp file, then after flushing, move the temp file to overwrite the old file. All Unix filesystems will ensure the move operation is atomic. Lots of "we dump a bunch of JSON to the disk" use cases could be much more stable if they just did this.

Doesn't scale at all, though - all of the data that needs to be self-consistent needs to be part of the same file, so unnecessary writes go through the roof if you're only doing small updates on a giant file. Still gotta handle locking if there is risk of a stray process messing it up. And doing this only handles part of ACID.

jeffffff an hour ago

don't forget to fsync the file before the rename! and you also need to fsync the directory after the rename!

wasabi991011 22 minutes ago

Yes, this is covered in the "When do you actually need a database?" section of the article.

vector_spaces 2 hours ago

I mean, if your atomic unit is a single file and you can tolerate simple consistency models, flat files are perfectly fine. There are many use cases that fit here comfortably where a whole database would be overkill

z3ugma 6 hours ago

At some point, don't you just end up making a low-quality, poorly-tested reinvention of SQLite by doing this and adding features?

freedomben 6 hours ago

Sometimes yes, I've seen it. It even tends to happen on NoSQL databases as well. Three times I've seen apps start on top of Dynamo DB, and then end up re-implementing relational databases at the application level anyway. Starting with postgres would have been the right answer for all three of those. Initial dev went faster, but tech debt and complexity quickly started soaking up all those gains and left a hard-to-maintain mess.

leafarlua 6 hours ago

This always confuses me because we have decades of SQL and all its issues as well. Hundreds of experienced devs talking about all the issues in SQL and the quirks of queries when your data is not trivial.

One would think that for a startup of sorts, where things changes fast and are unpredictable, NoSQL is the correct answer. And when things are stable and the shape of entities are known, going for SQL becomes a natural path.

There is also cases for having both, and there is cases for graph-oriented databases or even columnar-oriented ones such as duckdb.

Seems to me, with my very limited experience of course, everything leads to same boring fundamental issue: Rarely the issue lays on infrastructure, and is mostly bad design decisions and poor domain knowledge. Realistic, how many times the bottleneck is indeed the type of database versus the quality of the code and the imlementation of the system design?

marcosdumay 4 hours ago

tracker1 3 hours ago

mike_hearn 4 hours ago

dalenw 5 hours ago

AlotOfReading 3 hours ago

akdev1l an hour ago

> end up re-implementing relational databases at the application level anyway

This is by design, the idea is that scaling your application layer is easy but scaling your storage/db layer is not

Hence make the storage dumb and have the application do the joins and now your app scales right up

(But tbh I agree a lot of applications don’t reach the scale required to benefit from this)

tshaddox 3 hours ago

I've never used DynamoDB in production, but it always struck me as the type of thing where you'd want to start with a typical relational database, and only transition the critical read/write paths when you get to massive scale and have a very good understanding of your data access patterns.

icedchai 3 hours ago

Same. DynamoDB is almost never a good default choice unless you've thought very carefully about your current and future use cases. That's not to say it's always bad! At previous startups we did some amazing things with Dynamo.

noveltyaccount 6 hours ago

As soon as you need to do a JOIN, you're either rewriting a database or replatforming on Sqlite.

pgtan 4 hours ago

Here are two checks using joins, one with sqlite, one with the join builtin of ksh93:

  check_empty_vhosts () {
    # Check which vhost adapter doesn't have any VTD mapped
    start_sqlite
    tosql "SELECT l.vios_name,l.vadapter_name FROM vios_vadapter AS l
        LEFT OUTER JOIN vios_wwn_disk_vadapter_vtd AS r
    USING (vadapter_name,vios_name)
    WHERE r.vadapter_name IS NULL AND
      r.vios_name IS NULL AND
   l.vadapter_name LIKE 'vhost%';"
    endsql
    getsql
    stop_sqlite
  }

  check_empty_vhosts_sh () {
    # same as above, but on the shell
    join  -v 1  -t , -1 1 -2 1 \
   <(while IFS=, read vio host slot; do 
  if [[ $host == vhost* ]]; then
      print ${vio}_$host,$slot 
  fi
     done < $VIO_ADAPTER_SLOT | sort -t , -k 1)\
   <(while IFS=, read vio vhost vtd disk; do
  if [[ $vhost == vhost* ]]; then        
    print ${vio}_$vhost
  fi
     done < $VIO_VHOST_VTD_DISK | sort -t , -k 1)
  }

goerch 3 hours ago

a) Just heard today: JOINs are bad for performance b) How many columns can (an Excel) table have: no need for JOINs

datadrivenangel an hour ago

bachmeier 4 hours ago

Based on what's in the article, it wouldn't take much to move these files to SQLite or any other database in the future.

Edit: I just submitted a link to Joe Armstrong's Minimum Viable Programs article from 2014. If the response to my comment is about the enterprise and imaginary scaling problems, realize that those situations don't apply to some programming problems.

locknitpicker 4 hours ago

> Based on what's in the article, it wouldn't take much to move these files to SQLite or any other database in the future.

Why waste time screwing around with ad-hoc file reads, then?

I mean, what exactly are you buying by rolling your own?

bachmeier 4 hours ago

hackingonempty 3 hours ago

Probably more like a low-quality, poorly-tested reinvention of BerkeleyDB.

randyrand 3 hours ago

“You Aren’t Gonna Need It” - one of the most important software principles.

Wait until you actually need it.

dkarl 3 hours ago

I interpret YAGNI to mean that you shouldn't invest extra work and extra code complexity to create capabilities that you don't need.

In this case, I feel like using the filesystem directly is the opposite: doing much more difficult programming and creating more complex code, in order to do less.

It depends on how you weigh the cost of the additional dependency that lets you write simpler code, of course, but I think in this case adding a SQLite dependency is a lower long-term maintenance burden than writing code to make atomic file writes.

The original post isn't about simplicity, though. It's about performance. They claim they achieved better performance by using the filesystem directly, which could (if they really need the extra performance) justify the extra challenge and code complexity.

goerch 2 hours ago

Is this what we do with education in general?

upmostly 3 hours ago

100%.

Premature optimisation I believe that's called.

I've seen it play out many times in engineering over the years.

whalesalad 4 hours ago

Reminds me of the infamous Robert Virding quote:

“Virding's First Rule of Programming: Any sufficiently complicated concurrent program in another language contains an ad hoc informally-specified bug-ridden slow implementation of half of Erlang.”

mrec 2 hours ago

In case you weren't aware, that in itself is riffing on Greenspun's tenth rule:

https://en.wikipedia.org/wiki/Greenspun%27s_tenth_rule

trgn 2 hours ago

im sure, but honestly, i would love to have a db engine that just writes/reads csv or json. does it exist?

banana_giraffe 2 hours ago

DuckDB can do exactly this, once you get the API working in your system, it becomes something simple like

    SELECT \* from read_csv('example.csv');

Writing generally involves reading to an in-memory database, making whatever changes you want, then something like

    COPY new_table TO 'example.csv' (HEADER true, DELIMITER ',');

herpdyderp 2 hours ago

I wrote a CSV DB engine once! I can't remember why. For fun?

zabzonk 20 minutes ago

akdev1l an hour ago

SQLite can do it

trgn 13 minutes ago

gorjusborg 6 hours ago

Only if you get there and need it.

z3ugma 6 hours ago

but it's so trivial to implement SQLite, in almost any app or language...there are sufficient ORMs to do the joins if you don't like working with SQL directly...the B-trees are built in and you don't need to reason about binary search, and your app doesn't have 300% test coverage with fuzzing like SQLite does

you should be squashing bugs related to your business logic, not core data storage. Local data storage on your one horizontally-scaling box is a solved problem using SQLite. Not to mention atomic backups?

gorjusborg 6 hours ago

hirvi74 6 hours ago

moron4hire 6 hours ago

9rx 6 hours ago

upmostly 6 hours ago

Exactly. And most apps don't get there and therefore don't need it.

evanelias 6 hours ago

koliber 4 hours ago

I love this article as it shows how fast computers really are.

There is one conclusion that I do not agree with. Near the end, the author lists cases where you will outgrow flat files. He then says that "None of these constraints apply to a lot of applications."

One of the constraints is "Multiple processes need to write at the same time." It turns out many early stage products need crons and message queues that execute on a separate worker. These multiple processes often need to write at the same time. You could finagle it so that the main server is the only one writing, but you'd introduce architectural complexity.

So while from the pure scale perspective I agree with the author, if you take a wider perspective, it's best to go with a database. And sqlite is a very sane choice.

If you need scale, cache the most often accessed data in memory and you have the best of both worlds.

My winning combo is sqlite + in-memory cache.

jerf 40 minutes ago

The one that gets me a lot, which is similar in practice to your point, is when I need server redundancy, even if one server is otherwise plenty for my task. As soon as I'm not running in one place, you need network data storage, and that kicks pretty hard in the direction of a network-accessible database. S3 works sometimes and the recent work on being able to atomically claim files has helped with some of the worst rough edges but it still doesn't take a lot to disqualify it, at least as the only store.

pseudosavant 3 hours ago

SQLite has become my new go-to when starting any project that needs a DB. The performance is very fast, and if anything is ever successful enough to outgrow SQLite, it wouldn't be that hard to switch it out for Postgres. Not having to maintain/backup/manage a separate database server is cheaper and easier.

koliber 2 hours ago

Backups are super-simple as well.

I'm also a convert.

upmostly 3 hours ago

Seeing the Rust 1M benches were an amazing reminder as to how fast stuff really is.

koliber 2 hours ago

The reality is that things will be blazing fast in any language if you save things by PK in HashMaps.

cold_tom 7 minutes ago

you can get surprisingly far with files, but the moment you care about things like concurrent writes or not losing data on crash, the whole thing changes at that point you're not choosing speed vs simplicity anymore -you're choosing how much risk you're willing to carry

kabir_daki 6 hours ago

We built a PDF processing tool and faced this exact question early on.

For our use case — merge, split, compress — we went fully stateless. Files are processed in memory and never stored. No database needed at all.

The only time a database becomes necessary is when you need user accounts, history, or async jobs for large files. For simple tools, a database is often just added complexity.

The real question isn't "do you need a database" but "do you need state" — and often the answer is no.

tetha an hour ago

> The real question isn't "do you need a database" but "do you need state" — and often the answer is no.

We have a bunch of these applications and they are a joy to work with.

Funny enough, even if you have a database, if you wonder if you need caches to hold state in your application server, the answer is, kindly, fuck no. Really, really horrible scaling problems and bugs are down that path.

There are use cases to store expensive to compute state in varnish (HTTP caching), memcache/redis (expensive, complex datastructures like a friendship graph), elasticsearch/opensearch (aggregated, expensive full-text search), but caching SQL results in an application server because the database is "slow" beyond a single transaction brings nothing but pain in the future. I've spent so much energy working around decisions born out of simple bad schema design decisions and tuning...

bevr1337 4 hours ago

> The real question isn't "do you need a database" but "do you need state" — and often the answer is no.

This is a solid takeaway and applies to a lot of domains. Great observation

datadrivenangel an hour ago

But if you have state, and that state needs to persist between program executions, then for the love of resiliency and all that is robust, consider using a SQL database!

agustechbro 16 minutes ago

To not destroy the article author and apreciate his effort to prove something, that might be useful in a extreme case of optimization with a limited ammount of data and NO NEED to update/write the files. Just a read cache only.

If you need to ever update a single byte in your data, please USE A PROPER DATABASE, databases does a lot of fancy thing to ensure you are not going to corrupt/broke your data on disk among other safety things.

forinti 6 hours ago

Many eons ago I wrote a small sales web application in Perl. I couldn't install anything on the ISP's machine, so I used file-backed hashes: one for users, one for orders, another for something else.

As the years went by, I expected the client to move to something better, but he just stuck with it until he died after about 20 years, the family took over and had everything redone (it now runs Wordpress).

The last time I checked, it had hundreds of thousands of orders and still had good performance. The evolution of hardware made this hack keep its performance well past what I had expected it to endure. I'm pretty sure SQLite would be just fine nowadays.

da02 4 hours ago

What type of product or service were they selling?

forinti 4 hours ago

A calendar for cutting your hair according to the phases of the moon.

da02 4 hours ago

shafoshaf 5 hours ago

Relational Databases Aren’t Dinosaurs, They’re Sharks. https://www.simplethread.com/relational-databases-arent-dino...

The very small bonus you get on small apps is hardly worth the time you spend redeveloping the wheel.

adrian_b 3 hours ago

Sharks vs. dinosaurs seems indeed an appropriate metaphor.

During Cretaceous, when dinosaurs were at their peak, sharks had already become very similar to the sharks of today, e.g. there were big sharks that differed very little from the white sharks and tiger sharks of today.

Then the dinosaurs have disappeared, together with the pterosaurs and the mosasaurs, and they have been replaced by other animals, but the sharks have continued to live until today with little changes, because they had already reached an optimized design that was hard to improve.

Besides the sharks, during Cretaceous there already existed along the dinosaurs other 2 groups of big predators that have changed little since then, crocodiles and big constrictor snakes similar to the pythons of today.

Therefore all 3 (sharks, crocodiles and big constrictor snakes) are examples of locally optimum designs that have been reached more than 70 million years ago, without needing after that any major upgrades.

ktzar 5 hours ago

Writing your own storage is a great way to understand how databases work (if you do it efficiently, keeping indexes, correct data structures, etc.) and to come to the conclusion that if your intention wasn't just tinkering, you should've used a database from day 1.

throwway120385 2 hours ago

I dunno. Even in embedded systems every time I've started without a database I've eventually come to need something like a database, and in every case I've found myself building essentially an ad-hoc poorly managed database into the application including marshalling/unmarshalling, file management, notification, and so on because each new feature over the top of regular files was just that much easier to add versus switching to a database system.

However the driving motivation for adding a database is not necessarily managing data, but the fact that the database system creates a nice abstraction layer around storing data of relational or non-relational form in non-volatile memory and controlling access to it while other systems are updating it. And because it's a nice abstraction, there are a lot of existing libraries that can take advantage of it in your language of choice without requiring you to completely invent all of that stuff over the top of the filesystem. That has knock-on effects when you're trying to add new functionality or new interaction patterns to an existing system.

And in cases where two or more processes need to communicate using the same data, a database gives you some good abstractions and synchronization primitives that make sense, whereas regular files or IPC require you to invent a lot of that stuff. You could use messaging to communicate updates to data but now you have two copies of everything, and you have to somehow atomize the updates so that either copy is consistent for a point in time. Why not use a database?

Knowing what I know today I would start with some kind of database abstraction even if it's not necessarily designed for transactional data, and I would make sure it handled the numerous concerns I have around data sharing, consistency, atomicity, and notification because if I don't have those things I eventually have to invent them to solve the reliability problems I otherwise run in to without them.

jrecursive 3 hours ago

I suggest every developer write a database from scratch at least once, and use it for something real. Or, even better, let somebody else use it for something real. Then you will know "why database".

traderj0e 2 hours ago

My first time was with a Bukkit plugin as a kid. One of my updates broke existing flat json files. Someone asked me if it has MySQL support, I didn't know what that was, then realized oh this is nice.

There are also things besides databases that I'll DIY and then still wonder why so many people use a premade tool for it, like log4j

yubblegum 3 hours ago

It's indeed an amazing design and implementation space to explore. If distributed it is nearly comprehensive in scope. (However, did lol @ your "every developer" - that's being super kind and generous or "developer" is doing heavy lifting here.)

goerch 3 hours ago

Hm, sometimes opening a book could do wonders? But these were the old times...

subhobroto 2 hours ago

I'll do one better.

I suggest every developer learn how to replicate, backup and restore the very database they are excited about, from scratch at least once. I propose this will teach them what takes to build a production ready system and gain some appreciation for other ways of managing state.

jmull 2 hours ago

> Binary search beats SQLite... For a pure ID lookup, you're paying for machinery you're not using.

You'll likely end up quite a chump if you follow this logic.

sqlite has pretty strong durability and consistency mechanism that their toy disk binary search doesn't have.

(And it is just a toy. It waves away the maintenance of the index, for god's sake, which is almost the entire issue with indexes!)

Typically, people need to change things over time as well, without losing all their data, so backwards compatibility and other aspects of flexibility that sqlite has are likely to matter too.

I think once you move beyond a single file read/written atomically, you might as well go straight to sqlite (or other db) rather than write your own really crappy db.

waldrews 2 hours ago

File systems are nice if you need to do manual or transparent script-based manipulations. Like 'oh hey, I just want to duplicate this entry and hand-modify it, and put these others in an archive.' Or use your OS's access control and network sharing easily with heterogeneous tools accessing the data from multiple machines. Or if you've got a lot of large blobs that aren't going to get modified in place.

What the world needs is a hybrid - database ACID/transaction semantics with the ability to cd/mv/cp file-like objects.

theshrike79 40 minutes ago

I have a vague recollection that 4chan (At least at one point) didn't use any kind of backend database, they just rewrote the static pages with new content and that was it.

That's why it could handle massive traffic with very little issues.

nishagr 4 hours ago

The real question - do you really need to hack around with in-memory maps and files when you could just use a database?

Joeboy 5 hours ago

Don't know if it counts, but my London cinema listings website just uses static json files that I upload every weekend. All of the searching and stuff is done client side. Although I do use sqlite to create the files locally.

Total hosting costs are £0 ($0) other than the domain name.

jmaw an hour ago

While this is certainly cool to see. And I love seeing how fast webservers can go.. The counter question "Do you even need 25,000 RPS and sub-ms latency?" comes to mind.

I don't choose a DB over a flat file for its speed. I choose a DB for the consistent interface and redundancy.

jmaw 2 hours ago

Very interesting, I'd never heard of JSONL before: https://jsonlines.org/

Also notable mention for JSON5 which supports comments!: https://json5.org/

randusername 6 hours ago

Separate from performance, I feel like databases are a sub-specialty that has its own cognitive load.

I can use databases just fine, but will never be able to make wise decisions about table layouts, ORMs, migrations, backups, scaling.

I don't understand the culture of "oh we need to use this tool because that's what professionals use" when the team doesn't have the knowledge or discipline to do it right and the scale doesn't justify the complexity.

goerch 2 hours ago

Hm, I somewhat understand your point of `making wise decisions`. But doesn't that concern all kinds of software development? For me, it does.

vovanidze 7 hours ago

people wildly underestimate the os page cache and modern nvme drives tbh. disk io today is basically ram speeds from 10 years ago. seeing startups spin up managed postgres + redis clusters + prisma on day 1 just to collect waitlist emails is peak feature vomit.

a jsonl file and a single go binary will literally outlive most startup runways.

also, the irony of a database gui company writing a post about how you dont actually need a database is pretty based.

upmostly 7 hours ago

The irony isn’t lost on us, trust me. We spent a while debating whether to even publish this one.

But yeah, the page cache point is real and massively underappreciated. Modern infrastructure discourse skips past it almost entirely. A warm NVMe-backed file with the OS doing the caching is genuinely fast enough for most early-stage products.

tracker1 3 hours ago

Definitely appreciate the post and the discussion that has come from it... While I'm still included to just reach for SQLite as a near starting point, it's often worth considering depending on your needs.

In practice, I almost always separate the auth chain from the service chain(s) in that if auth gets kicked over under a DDoS, at least already authenticated users stand a chance of still being able to use the apps. I've also designed auth system essentially abstracted to key/value storage with adapters for differing databases (including SQLite) for deployments...

Would be interested to see how LevelDB might perform for your testing case, in that it seems to be a decent option for how your example is using data.

vovanidze 6 hours ago

props for actually publishing it tbh. transparent engineering takes are so rare now, usually its just seo fluff.

weve basically been brainwashed to think we need kubernetes and 3 different databases just to serve a few thousand users. gotta burn those startup cloud credits somehow i guess.

mad respect for the honesty though, actually makes me want to check out db pro when i finally outgrow my flat files.

upmostly 6 hours ago

locknitpicker 4 hours ago

grep_it 5 hours ago

Except that eventually you'll find you lose a write when things go down because the page cache is write behind. So you start issuing fsync calls. Then one day you'll find yourself with a WAL and buffer pool wondering why you didn't just start with sqlite instead.

phillipcarter 2 hours ago

> seeing startups spin up managed postgres + redis clusters + prisma on day 1 just to collect waitlist emails is peak feature vomit.

I'm pretty sure most startups just use a quick and easy CRM that makes this process easy, and that tool will certainly use a database.

rglover 4 hours ago

A few months back I decided to write an embedded db for my firm's internal JS framework. Learned a lot about how/why databases work the way they do. I use stuff like reading memory cached markdown files for static sites, but there are certain things that a database gives you (chief of which for me was query ergonomics—I loved MongoDB's query language but grew too frustrated with the actual runtime) that you'll miss once you move past a trivial data set.

I think a better way to ask this question is "does this application and its constraints necessitate a database? And if so, which database is the correct tool for this context?"

tracker1 3 hours ago

For me, I just wish MongoDB had scaling options closer to how Elatic/Cassandra and other horizontally scalable databases work, in that the data is sharded in a circle with redundancy metrics... as opposed to Mongo, which afaik is still limited to either sharding or replication (or layers of them). FWIW, I wish that RethinkDB had seen more attention and success and for that matter might be more included to use CockroachDB over Mongo, where I can get some of the scaling features while still being able to have some level of structured data.

ghc 6 hours ago

I'm so old I remember working on databases that were designed to use RAW, not files. I'm betting some databases still do, but probably only for mainframe systems nowadays.

bob1029 6 hours ago

https://docs.oracle.com/cd/B16276_01/doc/win.102/b14305/arch...

ghc 3 hours ago

> Oracle® Database Platform Guide 10g Release 2 (10.2) for Microsoft Windows Itanium (64-Bit)

Well, I guess that at least confirms Oracle on Itanium (!?) still supported RAW 5 years ago.

I'm guessing everyone's on ASM by now though, if they're still upgrading. I ran into a company not long ago with a huge oracle cluster that still employed physical database admins and logical database admins as separate roles...I would bet they're still paying millions for an out of date version of Oracle and using RAW.

zkmon an hour ago

Sure. Go ahead and use JSONL files and implement every feature of SQL query. Congrats, you just reinvented a database, while trying to prove you don't need database.

oliviergg 2 hours ago

Please … Every few years the pendulum swings. First it was “relational databases are too rigid, just use NoSQL.” Then “NoSQL is a mess, just go back to Postgres.” Now: “do you even need a database at all, just use flat files.” Each wave is partially right. But… each wave is about to rediscover, the hard way, exactly why the previous generation made the choices they did. SQLite is the answer to every painful lesson learned, every scar from long debug night the last time someone thought “a JSON file is basically a database.”

debo_ 2 hours ago

Michael Stonebraker used to write long, scathing critiques of modern data storage/retrieval fads, and how they were forgetting important historical lessons.

They were terrific reads; his writing on object-oriented databases was the most fun technical reading I did in grad school. And I even learned a lot!

goerch 2 hours ago

Yes, but you are probably a bit too polite. And I'm not sure how to do justice to SQLite, Postgres and my new favourite toy DuckDB.

mdp an hour ago

I've been really happy with DuckDB, and I love that fact that it can operate directly on things like JSON an JSONL. It's become my data swiss army knife.

827a 4 hours ago

I'm a big fan of using S3 as a database. A lot of apps can get a lot of mileage just doing that for a good chunk of their data; that which just needs lookup by a single field (usually ID, but doesn't have to be).

tracker1 3 hours ago

I worked in an org where a lot of records were denormalized to be used in a search database... since I went through that level of work anyway, I also fed the exports into S3 records for a "just in case" backup. That backup path became really useful in practice, since there was a need for eventually a "pending" version of records, separate from the "published" version.

In practice, the records themselves took no less than 30 joins for a flat view of the record data that was needed for a single view of what could/should have been one somewhat denormalied record in practice. In the early 2010's that meant the main database was often kicked over under load, and it took a lot of effort to add in appropriate caching and the search db, that wound up handling most of the load on a smaller server.

swiftcoder 4 hours ago

I feel like someone who works for a DB company ought to mention at least some of the pitfalls in file-based backing stores (data loss due to crashes, file truncation, fsync weirdness, etc)

tracker1 3 hours ago

I'd argue for using LevelDB or similar if I just wanted to store arbitrary data based on a single indexable value like TFA. That said, I'd probably just default to SQLite myself since the access, backup, restore patterns are relatively well known and that you can port/grow your access via service layers that include Turso or Cloudflare D1, etc.

moregrist 3 hours ago

Embedded KV stores like LevelDB are great for what they are, but I’ve often found that I’ll need to add an index to search the data in a different way.

And then another index. And at some point you want to ensure uniqueness or some other constraint.

And then you’re rewriting a half-complete and buggy SQLite. So I’ve come around to defaulting to SQLite/PostgresQL unless I have a compelling need otherwise. They’re usually the right long-term choice for my needs.

tracker1 3 hours ago

Absolutely... I was just bringing it up, as it seems to have in the box support for a lot of what TFA is discussing. I'm generally more inclined to just use SQLite most of the time anyway.

That it's now in the box (node:sqlite) for Deno/TS makes it that much more of an easy button option.

the_inspector 7 hours ago

In many cases not. E.g. for caching with python, diskcache is a good choice. For small amounts of data, a JSON file does the job (you pointed to JSONL as an option). But for larger collections, that should be searchable/processable, postgres is a good choice.

Memory of course, as you wrote, also seems reasonable in many cases.

matja 5 hours ago

If you think files are easier than a database, check out https://danluu.com/file-consistency/

winrid 3 hours ago

My recent project - a replacement for CodeMaster's RaceNet, runs on flat files! https://dirtforever.net/

Just have to use locks to be careful with writes.

I figured I'd migrate it to a database after maybe 10k users or so.

mfro 3 hours ago

Neat, what happened to the original system? Last I checked multiplayer was working in DR2.

winrid 2 hours ago

EA is shutting down Clubs. That is the primary motivation here.

Sadly no solution for non-rooted consoles.

goerch 2 hours ago

And crashes you can exclude? Good luck!

winrid 2 hours ago

The security level is the same actually as the codemasters servers.

gavinray 6 hours ago

Not to nitpick, but it would be interesting to see profiling info of the benchmarks

Different languages and stdlib methods can often spend time doing unexpected things that makes what looks like apples-to-apples comparisons not quite equivalent

traderj0e 3 hours ago

The "database" in this article is only a read-only KV-store. Mind that the hard part of a KV store is writing. Still the benchmarks are interesting.

goerch 2 hours ago

Pretty sure the origin should be `dbunpro.app`, no? I'd think the consensus should be: do you even need the file system?

stackskipton 5 hours ago

SRE here. My "Huh, neat" side of my brain is very interested. The SRE side of my brain is screaming "GOD NO, PLEASE NO"

Overhead in any project is understanding it and onboarding new people to it. Keeping on "mainline" path is key to lower friction here. All 3 languages have well supported ORM that supports SQLite.

tracker1 3 hours ago

I'm mostly with you here... it's amazing how many devs don't have a certain amount of baseline knowledge to understand file-io, let alone thin abstractions for custom data and indexing like tfa. Then again, most devs also don't understand the impacts of database normalization under load either.

goerch 2 hours ago

Sorry, this I think is a dangerous attitude: for me it is not about onboarding. Every newcomer reading `Huh, neat` is poised to repeat the mistakes of us and our ancestors.

chuckadams 6 hours ago

I need a filesystem that does some database things. We got teased with that with WinFS and Beos's BFS, but it seems the football always gets yanked away, and the mainstream of filesystems always reverts back to the APIs established in the 1980s.

tracker1 3 hours ago

FWIW, you can do some things like this on top of S3 Metadata.

chuckadams 3 hours ago

Transactions are one thing I want the most, and that's not going to happen on S3. Sure, I can reinvent them by hand, but the point is I want that baked in.

tracker1 3 hours ago

FpUser an hour ago

I think this whole article and post is an attention / points seeking exercise. It is hard to imagine programmer who would not know difference between DBMS and just bunch of files and when to use which

jwitchel 6 hours ago

This is a great incredibly well written piece. Nice work showing under the hood build up of how a db works. It makes you think.

goerch 2 hours ago

About what?

charcircuit 2 hours ago

>So the question is not whether to use files. You're always using files. The question is whether to use a database's files or your own.

It's the opposite. A file system is a database. And databases can recursively store their data within another database.

jbiason 6 hours ago

Honestly, I have been thinking about the same topic for some time, and I do realize that direct files could be faster.

In my (hypothetical, 'cause I never actually sat down and wrote that) case, I wanted the personal transactions in a month, and I realized I could just keep one single file per month, and read the whole thing at once (also 'cause the application would display the whole month at once).

Filesystems can be considered a key-value (or key-document) database. The funny thing about the example used in the link is that one could simply create a structure like `user/[id]/info.json` and directly access the user ID instead of running some file to find them -- again, just 'cause the examples used, search by name would be a pain, and one point where databases would handle things better.

m6z 6 hours ago

I have found that SQLite can be faster than using text or binary files, confirming their claims here: https://sqlite.org/fasterthanfs.html

JohnMakin 5 hours ago

everyone thinks this is a great idea until they learn about file descriptor limits the hard way

freedomben 6 hours ago

I avoided DBs like the plague early in my career, in favor of serialized formats on disk. I still think there's a lot of merit to that, but at this point in my career I see a lot more use case for sqlite and the relational features it comes with. At the least, I've spent a lot less time chasing down data corruption bugs since changing philosophy.

Now that said, if there's value to the "database" being human readable/editable, json is still well worth a consideration. Dealing with even sqlite is a pain in the ass when you just need to tweak or read something, especially if you're not the dev.

giva 6 hours ago

> Dealing with even sqlite is a pain in the ass when you just need to tweak or read something, especially if you're not the dev.

How? With SQL is super easy to search, compare, and update data. That's what it’s built for.

freedomben 6 hours ago

Pain in the ass was way too strong, I retract that. Mainly I meant relative. For example `nvim <filename>.json` and then /search for what I want, versus tracking down the sqlite file, opening, examining the schema, figuring out where the most likely place is that I care about, writing a SQL statement to query, etc.

giva 5 hours ago

the_pwner224 3 hours ago

0x457 2 hours ago

> Do you even need a database?

Then proceeds to (poorly) implement database on files.

Sure, Hash Map that take ~400mb in memory going to offer you fast lookups. Some workloads will never reach this size can be done as argument, but what are you losing by using SQLite?

What happens when services shutdowns mid write? Corruption that later results in (poorly) implemented WAL being added?

SQLite also showed something important - it was consistent in all benchmarks regardless of dataset size.

hnlmorg 2 hours ago

> Every database you have ever used reads and writes to the filesystem, exactly like your code does when it calls open().

Nope. There are non-persistent in-memory databases too.

In fact, a database can be a plethora of things and the stuff they were building is just a subset of a subset (persistent, local relational database)

XorNot 6 hours ago

I've just built myself a useful tool which now really would benefit from a database and I'm deeply regretting not doing that from the get-go.

So my opinion has thoroughly shifted to "start with a database, and if you _really_ don't need one it'll be obvious.

But you probably do.

srslyTrying2hlp 6 hours ago

I tried doing this with csv files (and for an online solution, Google Sheets)

I ended up just buying a VPS, putting openclaw on it, and letting it Postgres my app.

I feel like this article is outdated since the invention of OpenClaw/Claude Opus level AI Agents. The difficulty is no longer programming.

fifilura 6 hours ago

Isn't this the same case the NoSQL movement made.

tonymet 2 hours ago

If the cloud is just someone else’s hard disks (etc) then RDBMS is just someone else’s btree

allknowingfrog 3 hours ago

I've used foreign keys and unique indexes to enforce validity on even the smallest, most disposable toy applications I've ever written. These benchmarks are really interesting, but the idea that performance is the only consideration is kind of silly.

pstuart 2 hours ago

In order to ask this question it's important to understand the lifecycle of the data in question. If it is constantly being updated and requires "liveness" (updates are reflected in queries immediately), the simple answer is: yes, you need a database.

But if you have data that is static or effectively static (data that is updated occasionally or batched), then serving via custom file handling can have its place.

If the records are fixed width and sorted on the key value, then it becomes trivial to do a binary search on the mmapped file. It's about as lightweight as could be asked for.

goerch 2 hours ago

But then you'll get two files to join?

pstuart 2 hours ago

I should have been clear with the assumption baked into that statement: the data in question is in a single file, with fixed size fields and sorted by primary keys. That precludes "looser" datasets, but I believe my point stands for the given context.

rasengan 3 hours ago

Sounds like a good way to waste the only scarce resource: time.

cratermoon 3 hours ago

I worked one place that shoehorned SQL Server into a system to hold a small amount of static data that could easily have been a config file or even (eek) hard-coded.

amw-zero 3 hours ago

I think so, yea.

ForHackernews 6 hours ago

Surprised to see this beating SQLite after previously reading https://sqlite.org/fasterthanfs.html

ethan_smith 4 hours ago

The SQLite "faster than filesystem" page is specifically about reading small blobs where the overhead of individual filesystem calls (open/read/close per blob) exceeds SQLite reading from a single already-open file. Once you're talking about reading one big JSON file sequentially, that overhead disappears and you're just doing a single read - which is basically the best case for the filesystem and the worst case for SQLite (which still has to parse its B-tree, check schemas, etc).

MattRogish 5 hours ago

"Do not cite the deep magic to me witch, I was there when it was written"

If you want to do this for fun or for learning? Absolutely! I did my CS Masters thesis on SQL JOINS and tried building my own new JOIN indexing system (tl;dr: mine wasn't better). Learning is fun! Just don't recommend people build production systems like this.

Is this article trolling? It feels like trolling. I struggle to take an article seriously that conflates databases with database management systems.

A JSON file is a database. A CSV is a database. XML (shudder) is a database. PostgreSQL data files, I guess, are a database (and indexes and transaction logs).

They never actually posit a scenario in which rolling your own DBMS makes sense (the only pro is "hand rolled binary search is faster than SQLite"), and their "When you might need" a DBMS misses all the scenarios, the addition of which would cause the conclusion to round to "just start with SQLite".

It should basically be "if you have an entirely read-only system on a single server/container/whatever" then use JSON files. I won't even argue with that.

Nobody - and I mean nobody - is running a production system processing hundreds of thousands of requests per second off of a single JSON file. I mean, if req/sec is the only consideration, at that point just cache everything to flat HTML files! Node and Typescript and code at all is unnecessary complexity.

PostgreSQL (MySQL, et al) is a DBMS (DataBase Management System). It might sound pedantic but the "MS" part is the thing you're building in code:

concurrency, access controls, backups, transactions: recovery, rollback, committing, etc., ability to do aggregations, joins, indexing, arbitrary queries, etc. etc.

These are not just "nice to have" in the vast, vast majority of projects.

"The cases where you'll outgrow flat files:"

Please add "you just want to get shit done and never have to build your own database management system". Which should be just about everybody.

If your app is meaningfully successful - and I mean more than just like a vibe-coded prototype - it will break. It will break in both spectacular ways that wake you up at 2AM and it will break in subtle ways that you won't know about until you realize something terrible has happened and you lost your data.

Didn't we just have this discussion like yesterday (https://ultrathink.art/blog/sqlite-in-production-lessons)?

It feels like we're throwing away 50 years of collective knowledge, skills, and experience because it "is faster" (and in the same breath note that nobody is gonna hit these req/sec.)

I know, it's really, really hard to type `yarn add sqlite3` and then `SELECT * FROM foo WHERE bar='baz'`. You're right, it's so much easier writing your own binary search and indexing logic and reordering files and query language.

Not to mention now you need a AGENTS.md that says "We use our own home-grown database nonsense if you want to query the JSON file in a different way just generate more code." - NOT using standard components that LLMs know backwards-and-forwards? Gonna have a bad time. Enjoy burning your token budget on useless, counter-productive code.

This is madness.

fatih-erikli-cg 6 hours ago

I agree. Databases are useless. You don't even need to load it into the memory. Reading it from the disk when there is a need to read something must be ok. I don't believe the case that there are billions of records so the database must be something optimized for handling it. That amount of records most likely is something like access logs etc, I think they should not be stored at all, for such case.

Even it's postgres, it is still a file on disk. If there is need something like like partitioning the data, it is much more easier to write the code that partitions the data.

If there is a need to adding something with textinputs, checkboxes etc, database with their admin tools may be a good thing. If the data is something that imported exported etc, database may be a good thing too. But still I don't believe such cases, in my ten something years of software development career, something like that never happened.

zeroonetwothree 5 hours ago

Poe’s law in action?

Sharlin 6 hours ago

Not sure if sarcastic…

fatih-erikli-cg 11 minutes ago

It isnt sarcasm. I don't really find a case that a database that has it's own query language like SQL is needed. It won't be different than storing a JSON file and filter the content with a for loop, the dev (e.g. me) will be returning a JSON on REST API at the end. A query language may be a good thing if you are working in a team, thats it. SQL is indeed isnt a good thing.

bsenftner 6 hours ago

I worked as a software engineer for 30 years before being forced to use a database, and that was for a web site. I've been coding actively, daily, since the 70's. Forever we just wrote proprietary files to disk, and that was the norm, for decades. Many a new developer can't even imagine writing their own proprietary file formats, the idea literally scares them. The engineers produced today are a shadow of what they used to be.

anonymars 6 hours ago

Yeah, it scares me because I'm experienced enough to know all the difficulties involved in keeping durable data consistent, correct, and performant

pythonaut_16 3 hours ago

> we just wrote proprietary files to disk

That alone is a terrible thing. Open formats are so much more user friendly

vlapec 5 hours ago

>The engineers produced today are a shadow of what they used to be.

…and it won’t get better anytime soon.

linuxhansl 3 hours ago

Hmm... Sure, if you do not need a database then do not use a database.

Don't use a sports-car to haul furniture or a garbage truck as an ambulance. For the use case and scale mentioned in the article it's obvious not to use a database.

Am I missing something? I guess many people are the using the tools they are familiar with and rarely question whether they are really applicable. Is that the message?

I think a more interesting question is whether you will need a single source of truth. If you don't you can scale on many small data sets without a database.

I will say this before I shut up with my rant: If you start with a design that scales you will have an easier to scale when it is time without re-engineering your stack. Whether you think you will need to scale depends on your projected growth and the nature of your problem (do you need a single source of truth, etc.)

Edits: Spelling

Hacker News

by Ryan Harman

Do you even need a database? (dbpro.app)

ozgrakkurt 4 hours ago [-]

dkarl 3 hours ago [-]

duped 35 minutes ago [-]

noselasd 37 minutes ago [-]

goerch 3 hours ago [-]

creatonez 2 hours ago [-]

jeffffff an hour ago [-]

wasabi991011 22 minutes ago [-]

vector_spaces 2 hours ago [-]

z3ugma 6 hours ago [-]

freedomben 6 hours ago [-]

leafarlua 6 hours ago [-]

marcosdumay 4 hours ago [-]

tracker1 3 hours ago [-]

mike_hearn 4 hours ago [-]

dalenw 5 hours ago [-]

AlotOfReading 3 hours ago [-]

akdev1l an hour ago [-]

tshaddox 3 hours ago [-]

icedchai 3 hours ago [-]

noveltyaccount 6 hours ago [-]

pgtan 4 hours ago [-]

goerch 3 hours ago [-]

datadrivenangel an hour ago [-]

bachmeier 4 hours ago [-]

locknitpicker 4 hours ago [-]

bachmeier 4 hours ago [-]

hackingonempty 3 hours ago [-]

randyrand 3 hours ago [-]

dkarl 3 hours ago [-]

goerch 2 hours ago [-]

upmostly 3 hours ago [-]

whalesalad 4 hours ago [-]

mrec 2 hours ago [-]

trgn 2 hours ago [-]

banana_giraffe 2 hours ago [-]

herpdyderp 2 hours ago [-]

zabzonk 20 minutes ago [-]

akdev1l an hour ago [-]

trgn 13 minutes ago [-]

gorjusborg 6 hours ago [-]

z3ugma 6 hours ago [-]

gorjusborg 6 hours ago [-]

hirvi74 6 hours ago [-]

moron4hire 6 hours ago [-]

9rx 6 hours ago [-]

upmostly 6 hours ago [-]

evanelias 6 hours ago [-]

koliber 4 hours ago [-]

jerf 40 minutes ago [-]

pseudosavant 3 hours ago [-]

koliber 2 hours ago [-]

upmostly 3 hours ago [-]

koliber 2 hours ago [-]

cold_tom 7 minutes ago [-]

kabir_daki 6 hours ago [-]

tetha an hour ago [-]

bevr1337 4 hours ago [-]

datadrivenangel an hour ago [-]

agustechbro 16 minutes ago [-]

forinti 6 hours ago [-]

da02 4 hours ago [-]

forinti 4 hours ago [-]

da02 4 hours ago [-]

shafoshaf 5 hours ago [-]

adrian_b 3 hours ago [-]

ktzar 5 hours ago [-]

throwway120385 2 hours ago [-]

jrecursive 3 hours ago [-]

traderj0e 2 hours ago [-]

yubblegum 3 hours ago [-]

goerch 3 hours ago [-]

subhobroto 2 hours ago [-]

jmull 2 hours ago [-]

waldrews 2 hours ago [-]

theshrike79 40 minutes ago [-]

nishagr 4 hours ago [-]

ozgrakkurt 4 hours ago

dkarl 3 hours ago

duped 35 minutes ago

noselasd 37 minutes ago

goerch 3 hours ago

creatonez 2 hours ago

jeffffff an hour ago

wasabi991011 22 minutes ago

vector_spaces 2 hours ago

z3ugma 6 hours ago

freedomben 6 hours ago

leafarlua 6 hours ago

marcosdumay 4 hours ago

tracker1 3 hours ago

mike_hearn 4 hours ago

dalenw 5 hours ago

AlotOfReading 3 hours ago

akdev1l an hour ago

tshaddox 3 hours ago

icedchai 3 hours ago

noveltyaccount 6 hours ago

pgtan 4 hours ago

goerch 3 hours ago

datadrivenangel an hour ago

bachmeier 4 hours ago

locknitpicker 4 hours ago

bachmeier 4 hours ago

hackingonempty 3 hours ago

randyrand 3 hours ago

dkarl 3 hours ago

goerch 2 hours ago

upmostly 3 hours ago

whalesalad 4 hours ago

mrec 2 hours ago

trgn 2 hours ago

banana_giraffe 2 hours ago

herpdyderp 2 hours ago

zabzonk 20 minutes ago

akdev1l an hour ago

trgn 13 minutes ago

gorjusborg 6 hours ago

z3ugma 6 hours ago

gorjusborg 6 hours ago

hirvi74 6 hours ago

moron4hire 6 hours ago

9rx 6 hours ago

upmostly 6 hours ago

evanelias 6 hours ago

koliber 4 hours ago

jerf 40 minutes ago

pseudosavant 3 hours ago

koliber 2 hours ago

upmostly 3 hours ago

koliber 2 hours ago

cold_tom 7 minutes ago

kabir_daki 6 hours ago

tetha an hour ago

bevr1337 4 hours ago

datadrivenangel an hour ago

agustechbro 16 minutes ago

forinti 6 hours ago

da02 4 hours ago

forinti 4 hours ago

da02 4 hours ago

shafoshaf 5 hours ago

adrian_b 3 hours ago

ktzar 5 hours ago

throwway120385 2 hours ago

jrecursive 3 hours ago

traderj0e 2 hours ago

yubblegum 3 hours ago

goerch 3 hours ago

subhobroto 2 hours ago

jmull 2 hours ago

waldrews 2 hours ago

theshrike79 40 minutes ago

nishagr 4 hours ago

Joeboy 5 hours ago

jmaw an hour ago

jmaw 2 hours ago