Things Unix can do atomically (2010) (rcrowley.org)

236 points by onurkanbkrc 16 hours ago

0xbadcafebee 16 hours ago

You can use `ln` atomicity for a simple, portable(ish) locking system: https://gist.github.com/pwillis-els/b01b22f1b967a228c31db3cf...

akoboldfrying 12 hours ago

Really nice explanation of a useful pattern. I was surprised to discover that even the famously broken NFS honours atomicity of hardlink creation.

amstan 14 hours ago

Missing (probably because of the date of the article): `mv --exchange` aka renameat2+RENAME_EXCHANGE. It atomically swaps 2 file paths.

rustybolt 13 hours ago

I tried using this a while back and found it was not widely available. You need coreutils version 9.1 or later for this, many distros do not ship this.

I made https://github.com/rubenvannieuwpoort/atomic-exchange for my usecase.

oguz-ismail2 14 hours ago

Title says Unix, renameat2 is Linux-only.

jasode 13 hours ago

>Title says Unix,

You're misinterpreting the title. The author didn't intend "Unix" to literally mean only the official AT&T/TheOpenGroup UNIX® System to the exclusion of Linux.

The first sentence of "UNIX-like" makes that clear : >This is a catalog of things UNIX-like/POSIX-compliant operating systems can do atomically,

Further down, he then mentions some Linux specifics : >fcntl(fd, F_GETLK, &lock), fcntl(fd, F_SETLK, &lock), and fcntl(fd, F_SETLKW, &lock) . [...] There is a “mandatory locking” mode but Linux’s implementation is unreliable as it’s subject to a race condition.

shawn_w 11 hours ago

bee_rider 7 hours ago

monibious 11 hours ago

pjmlp 8 hours ago

stephenr 10 hours ago

pjmlp 8 hours ago

Unless they can be guaranteed by the POSIX specification, they are implementation specific and should not be relied upon for portable code.

kccqzy 5 hours ago

Which of these are not guaranteed by the POSIX specification? It’s been a while since I studied it, but if I recall correctly the ones mentioned in the article are guaranteed.

ncruces 12 hours ago

I use several of these to implement alternative SQLite locking protocols.

POSIX file locking semantics really are broken beyond repair: https://news.ycombinator.com/item?id=46542247

nialv7 5 hours ago

The mmap/msync one is incorrect I believe? (Correct me if I am wrong).

msync() sync content in memory back to _disk_. But multiple processes mapping the same file always see the same content (barring memory consistency, caching, etc.) already. Unless the file is mapped with MAP_PRIVATE.

DSMan195276 4 hours ago

Yeah I agree that one isn't very clear, perhaps the idea is to use `msync()` as a barrier to achieve consistent ordering of the writes without having to handle that yourself with more complex primitives. But then, they do mention some of those primitives at the bottom of the article, so it's hard to say what exactly the idea is.

icedchai 3 hours ago

mmap/msync is behavior is also very platform specific. On some systems (like AIX, at least older versions), even without msync, memory mapped data is synced back to disk periodically.

I worked on a code base that was portable between Linux, AIX, and some other Unix flavors. mmap/msync was a source of bugs. Just imagine your system running for days, never syncing any data to disk... then someone pulls the plug. Where'd my data go? Even worse, it happened "in production" at a beta site. Fortunately we had a way to recover data from a log.

Igrom 12 hours ago

>fcntl(fd, F_GETLK, &lock), fcntl(fd, F_SETLK, &lock), and fcntl(fd, F_SETLKW, &lock)

There's also `flock`, the CLI utility in util-linux, that allows using flocks in shell scripts.

cachius 12 hours ago

What are flocks in this context? Surely not a number of sheep...

ncruces 12 hours ago

File locks.

pjmlp 8 hours ago

In UNIX/POSIX file locks are advisory, not enforced, it only works if all processes play ball.

zbentley 6 hours ago

Sure, but the discussion is around whether they’re atomic, not whether they’re advisory.

zbentley 6 hours ago

Aren’t flock and POSIX locks backed by totally different systems?

KevinChasse 5 hours ago

Nice catalog. One subtle thing I’ve found in building deterministic, stateless systems is that atomic filesystem and memory operations are the only way to safely compute or persist secrets without locks. Combining rename/link/O_EXCL patterns with ephemeral in-memory buffers ensures that sensitive data is never partially written to disk, which reduces race conditions and side-channel exposure in multi-process workflows.

zzo38computer 14 hours ago

Even though it can do some things atomically, it only does with one file at a time, and race conditions are still possible because it only does one operation at a time (even if you are only need one file). Some of these are helpful anyways, such as O_EXCL, but it is still only one thing at a time which can cause problems in some cases.

What else it does not do is a transaction with multiple objects. That is why, I would design a operating system, that you can do a transaction with multiple objects.

ptx 14 hours ago

Windows had APIs for this sort of thing added in Vista, but they're now deprecating it "due to its complexity and various nuances which developers need to consider":

https://learn.microsoft.com/en-us/windows/win32/fileio/about...

Orphis 8 hours ago

In some cases, you can start by using the "at" functions (openat...) to work on a directory tree. If you have your logical "locking" done at the top-level of the tree, it might be a fine option.

In some other cases, I've used a pattern where I used a symlink to folders. The symlink is created, resolved or updated atomically, and all I need is eventual consistency.

That last case was to manage several APT repository indices. The indices were constantly updated to publish new testing or unstable releases of software and machines in the fleet were regularly fetching the repository index. The APT protocol and structure being a bit "dumb" (for better or worse) requires you to fetch files (many of them) in the reverse order they are created, which leads to obvious issues like the signature is updated only after the list of files is updated, or the list of files is created only after the list of packages is created.

Long story short, each update would create a new folder that's consistent, and a symlink points to the last created folder (to atomically replace the folder as it was not possible to swap them), and a small HTTP server would initiate a server side session when the first file is fetched and only return files from the same index list, and everything is eventually consistent, and we never get APT complaining about having signature or hash mismatches. The pivotal component was indeed the atomicity of having a symlink to deal with it, as the Java implementation didn't have access to a more modern "openat" syscall, relative to a specific folder.

akoboldfrying 13 hours ago

I don't follow, sorry. Are you saying that if we run:

    mv a b
    mv c d
We could observe a state where a and d exist? I would find such "out of order execution" shocking.

If that's not what you're saying, could you give an example of something you want to be able to do but can't?

zbentley 6 hours ago

Depending on metadata cache behavior configuration, if the system is powered off immediately after the first command, then that could indeed happen I think.

As to whether it’s technically possible for it to happen on a system that stays on, I’m not sure, but it’s certainly vanishingly rare and likely requires very specific circumstances—not just a random race condition.

LgWoodenBadger 5 hours ago

jstimpfle 12 hours ago

I don't think that's happening in practice, but 1) it may not be specified and 2) What you say could well be the persisted state after a machine crash or power loss. In particular if those files live in different directories.

You can remedy 2) by doing fsync() on the parent directory in between. I just asked ChatGPT which directory you need to fsync. It says it's both, the source and the target directory. Which "makes sense" and simplifies implementations, but it means the rename operation is atomic only at runtime, not if there's a crash in between. It think you might end up with 0 or 2 entries after a crash if you're unlucky.

If that's true, then for safety maybe one should never rename across directories, but instead do a coordinated link(source, target), fsync(target_dir), unlink(source), fsync(source_dir)

jstimpfle 8 hours ago

duped 8 hours ago

All you need for this to occur is the window where both renames occurs overlap. A system polling to check if a, b, c, and d exist while the renames are happening might find all four of them.

jstimpfle 8 hours ago

devnonymous 11 hours ago

I'm almost certain what the OP meant was if the commands were run synchronously (ie: from 2 different shells or as `mv a b &; mv c d`) yes there is a possibility that a and d exist (eg: On a busy system where neither of the 2 commands can be immediately scheduled and eventually the second one ends up being scheduled before the first)

Or to go a level deeper, if you have 2 occurrences of rename(2) from the stdlibc ...

rename('a', 'b'); rename('c', 'd');

...and the compiler decides on out of order execution or optimizing by scheduling on different cpus, you can get a and d existing at the same time.

The reason it won't happen in the example you posted is the shell ensures the atomicity (by not forking the second mv until the wait() on the first returns)

isodude 10 hours ago

sega_sai 15 hours ago

rename() is certainly the easiest to use for any sort of file-system based synchronization.

compressedgas 2 hours ago

As long as you don't run into or want freedom from possible path races, for that you need the missing:

  frenameat2(srcdirfd, srcfd, srcname, dstdirfd, dstfd, dstname)

MintPaw 15 hours ago

Not much apparently, although I didn't know about changing symlinks, that could be very useful.

jeffbee 6 hours ago

I wonder why the author left out atomic writes with O_APPEND.

ozgrakkurt 6 hours ago

This requires O_SYNC and O_DIRECT afaik.

Even then it is only some file systems that guarantee it and even then file size updating isn’t atomic afaik.

Not so sure about file size update being atomic in this case but fairly sure about the rest.

Matklad had some writing or video about this.

Also there is a tool called ALICE and authors of that tool have a white paper about this subject.

Also there was a blog post about how badger database fixed some issues around this problem.

jeffbee 6 hours ago

I don't think any part of your post is right. Aside from NFS, there should not be filesystems where this doesn't work. If there are, those are just bugs. The flags you mentioned are not required or relevant. Setting the fd offset to the end of the file atomically is the entire purpose of O_APPEND.

ozgrakkurt 5 hours ago

zbentley 6 hours ago

Unsure. Aren’t there filesystems which make O_APPEND less durable than it’s specified to be, which might be interpreted to adversely affect atomicity? Could that be it?

andrewstuart 12 hours ago

Anywhere there is atomic capability you can build a queuing application.

ta8903 15 hours ago

Not technically related to atomicity, but I was looking for a way to do arbitrary filesystem operations based on some condition (like adding a file to a directory, and having some operation be performed on it). The usual recommendation for this is to use inotify/watchman, but something about it seems clunky to me. I want to write a virtual filesystem, where you pass it a trigger condition and a function, and it applies the function to all files based on the trigger condition. Does something like this exist?

zbentley 6 hours ago

The challenge with that approach is memory: trigger conditions, if added irresponsibly, can result in unbounded memory and (depending on implementation) potentially linear performance degradation of filesystem operations as well. Unbounded kernel memory growth leads to stability or security risks.

That tradeoff is at the root of why most notify APIs are either approximate (events can be dropped) or rigidly bounded by kernel settings that prevent truly arbitrary numbers of watches. fanotify and some implementations of kqueue are better at efficiently triggering large recursive watches, but that’s still just a mitigation on the underlying memory/performance tradeoffs, not a full solution.

laz 11 hours ago

Sounds half baked. What context does this function run in? Is it an interpreted language or an executable that you provide?

Inotify is the way to shovel these events out of the kernel, then userspace process rules apply. It's maybe not elegant from your pov, but it's simple.

quesera 9 hours ago

I've used FUSE for something similar.

There are sample "drivers" in easily-modified python that are fast enough for casual use.

direwolf20 12 hours ago

are you asking for if statements?

if(condition) {do the thing;}

ta8903 11 hours ago

I know this is trivial to do programmatically, but I was looking for a way this will be handled by the filesystem. For instance, if I have some processes generating log files, and I have a script that converts them to html, I wanted the script to be called every time a log file is updated, without having a daemon running in the background to monitor the directory, just some filesystem mount. This would have made some deployments easier.

Brian_K_White 15 hours ago

incron

ta8903 11 hours ago

Thanks, I didn't find this when I was looking for a solution for my problem. This is pretty much the exact solution for my usecase, though for some reason inotify feels more complicated than some kind of filesystem mount solution for me.

klempner 13 hours ago

This document being from 2010 is, of course, missing the C11/C++11 atomics that replaced the need for compiler intrinsics or non portable inline asm when "operating on virtual memory".

With that said, at least for C and C++, the behavior of (std::)atomic when dealing with interprocess interactions is slightly outside the scope of the standard, but in practice (and at least recommended by the C++ standard) (atomic_)is_lock_free() atomics are generally usable between processes.

senderista 3 hours ago

That's right, atomic operations work just fine for memory shared between processes. I have worked on a commercial product that used this everywhere.

exac 15 hours ago

Sorry, there is zero chance I will ever deploy new code by changing a symlink to point to the new directory.

silisili 5 hours ago

I don't do devops/sysadmin anymore, so this would have been before the age of k8s for everything. But I once interviewed for a company hiring specifically because their deployment process lasted hours, and rollbacks even longer.

In the interview when they were describing this problem, I asked why the didn't just put all of the new release in a new dir, and use symlinks to roll forward and backwards as needed. They kind of froze and looked at each other and all had the same 'aha' moment. I ended up not being interested in taking the job, but they still made sure to thank me for the idea which I thought was nice.

Not that I'm a genius or anything, it's something I'd done previously for years, and I'm sure I learned it from someone else who'd been doing it for years. It's a very valid deployment mechanism IMO, of course depending on your architecture.

sholladay 15 hours ago

Why? What do you prefer to do instead?

gib444 14 hours ago

Anything less than an entire new k8s cluster and switching over is just amateur hour obviously

iberator 15 hours ago

why? it works and its super clever. Simple command instead some shit written in JS with docker trash

lloeki 15 hours ago

Ah, the memories of capistrano, complete with zero-downtime unicorn handover

https://github.com/capistrano/capistrano/

10us 14 hours ago

bandrami 14 hours ago

Works pretty well for Nix

atmosx 13 hours ago

Worked pretty well in production systems, serving huge amount of RPS (like ~5-10k/s) running on a LAMP stack monolith in five different geographical regions.

Just git branch (one branch per region because of compliance requirements) -> branch creates "tar.gz" with predefined name -> automated system downloads the new "tar.gz", checks release date, revision, etc. -> new symlink & php (serverles!!!) graceful restart and ka-b00m.

Rollbacks worked by pointing back to the old dir & restart.

Worked like a charm :-)

mananaysiempre 14 hours ago

And for Stow[1] before it, and for its inspiration Depot[2] before even that. It’s an old idea.

[1] https://www.gnu.org/software/stow/

[2] http://ftp.gregor.com/download/dgregor/depot.pdf

bandrami 14 hours ago

gonzus 11 hours ago

Then you are locking yourself out of a pretty much ironclad (and extremely cost-effective) way of managing such things.

1718627440 8 hours ago

Isn't that the standard way to do that? Why wouldn't you?

slopusila 15 hours ago

that's how some phone OSes update the system (by having 2 read only fs)

that's how Chrome updates itself, but without the symlink part

dizhn 13 hours ago

No snapshotting at all? Thinking about it.. The filesystem does not support it I suppose.

LiamPowell 13 hours ago

x4132 14 hours ago

not surprised about the chrome part, but pretty shocked at the phone OS part. I know APFS migration was done in this way, but wouldn't storage considerations for this be massive?

slopusila 14 hours ago

marmarama 13 hours ago

alpb 15 hours ago

Nobody's saying you should deploy code with this, but symlinks are a very common filesystem locking method.