DRAM has a design flaw from 1966. I bypassed it [video] (youtube.com)
372 points by surprisetalk 3 days ago
Related: Tailslayer: Library for reducing tail latency in RAM reads - https://news.ycombinator.com/item?id=47680023 - April 2026 (23 comments)
foltik 16 hours ago
Love the format, and super cool to see a benchmark that so clearly shows DRAM refresh stalls, especially avoiding them via reverse engineering the channel layout! Ran it on my 9950X3D machine with dual-channel DDR5 and saw clear spikes from 70ns to 330ns every 15us or so.
The hedging technique is a cool demo too, but I’m not sure it’s practical.
At a high level it’s a bit contradictory; trying to reduce the tail latency of cold reads by doubling the cache footprint makes every other read even colder.
I understand the premise is “data larger than cache” given the clflush, but even then you’re spending 2x the memory bandwidth and cache pressure to shave ~250ns off spikes that only happen once every 15us. There’s just not a realistic scenario where that helps.
Especially HFT is significantly more complex than a huge lookup table in DRAM. In the time you spend doing a handful of 70ns DRAM reads, your competitor has done hundreds of reads from cache and a bunch of math. It’s just far better to work with what you can fit in cache. And to shrink what doesn’t as much as possible.
Lramseyer 12 hours ago
Another point about HFT - They're mostly using FPGAs (some use custom silicon) which means that they have much tighter control over how DRAM is accessed and how the memory controller is configured. They could implement this in hardware if they really need to, but it wouldn't be at the OS level.
strongpigeon 4 hours ago
> At a high level it’s a bit contradictory; trying to reduce the tail latency of cold reads by doubling the cache footprint makes every other read even colder.
That’s my main hang up as well. On one hand this is undeniably cool work, but on the other, efficient cache usage is how you maximize throughput.
This optimizes for (narrow) tail latency, but I do wonder at what performance cost. I would be super interested in hearing about real world use cases.
deegu 3 hours ago
This might be useful in a case where a small lookup or similar is often pushed out from cache such that lookups are usually cold. Yet lookup data might by small enough to not cause issue with cache pollution, increased bandwidth or memory consumption.
foltik an hour ago
josephg 10 hours ago
It could be massively improved with a special CPU instruction for racing dram reads. That might make it actually useful for real applications. As it is, the threading model she used here would make it incredibly difficult to use this in a real program.
foltik 5 hours ago
There’s no point racing DRAM reads explicitly. Refreshes are infrequent and the penalty is like 5x on an already fast operation, 1% of the time.
What’s better is to “race” against cache, which is 100x faster than DRAM. CPUs already of do this for independent loads via out-of-order execution. While one load is stalled waiting for DRAM, another can hit the cache and do some compute in parallel. It’s all already handled at the microarchitectural level.
jeffbee 4 hours ago
There are already systems that do this in hardware. Any system that has memory mirroring RAS features can do this, notably IBM zEnterprise hardware, you know, the company that this video promoter claims to be one-upping.
shiftingleft 3 hours ago
zozbot234 8 hours ago
> clear spikes from 70ns to 330ns
Isn't that rather trivial though as a source of tail latency? There's much worse spikes coming from other sources, e.g. power management states within the CPU and possibly other hardware. At the end of the day, this is why simple microcontrollers are still preferred for hard RT workloads. This work doesn't change that in any way.
foltik 5 hours ago
Yeah exactly, and it’s absolutely dwarfed by the tail latency of going to DRAM in the first place. A cache miss is a 100x tail event vs. an L1 hit. The refresh stall is a further 5x on top of that, which barely registers if you’re already eating the DRAM cost.
formerly_proven 12 hours ago
On most RAM tREF can be increased a lot from the default, at least if kept somewhat cool.
jeffbee 4 hours ago
It is not only not practical, it is a completely useless technique. I got downvoted to negative infinity for mentioning this, but I guess I am the only person who actually read the benchmark. The reason the technique "works" in the benchmark is that all the threads run free and just record their timestamps. The winner is decided post hoc. This behavior is utterly pointless for real systems. In a real system you need to decide the winner online, which means the winner needs to signal somehow that it has won, and suppress the side effects of the losers, a multi-core coordination problem that wipes out most of the benefit of the tail improvement but, more importantly, also massively worsens the median latency.
intothemild 39 minutes ago
Man. You really don't get it do you.
dang 4 hours ago
You got downvoted for being an asshole, and if you continue to be an asshole on HN we are going to ban you. I suppose you don't believe this because we haven't done it yet even after countless warnings:
https://news.ycombinator.com/item?id=43850950 (April 2025)
https://news.ycombinator.com/item?id=43847946 (April 2025)
https://news.ycombinator.com/item?id=42096833 (Nov 2024)
https://news.ycombinator.com/item?id=37275963 (Aug 2023)
https://news.ycombinator.com/item?id=35746140 (April 2023)
https://news.ycombinator.com/item?id=34537078 (Jan 2023)
https://news.ycombinator.com/item?id=33914274 (Dec 2022)
https://news.ycombinator.com/item?id=33311881 (Oct 2022)
https://news.ycombinator.com/item?id=30890360 (April 2022)
https://news.ycombinator.com/item?id=26628758 (March 2021)
https://news.ycombinator.com/item?id=26307811 (March 2021)
https://news.ycombinator.com/item?id=25561372 (Dec 2020)
https://news.ycombinator.com/item?id=24724281 (Oct 2020)
https://news.ycombinator.com/item?id=24458954 (Sept 2020)
https://news.ycombinator.com/item?id=24380545 (Sept 2020)
https://news.ycombinator.com/item?id=23170477 (May 2020)
The reason we haven't banned you yet is because you obviously know a lot of things that are of interest to the community. That's good. But the damage you cause here by routinely poisoning the threads exceeds the goodness that you add by sharing information. This is not going to last, so if you want not to be banned on HN, please fix it.
tromp 10 hours ago
A more accurate but less inspiring title would be:
RAM Has a Design Tradeoff from 1966. I made another one on top.
The first tradeoff, of 6x fewer transistors for some extra latency, is immensely beneficial. The second, of reducing some of that extra latency for extra copies of static data, is beneficial only to some extremely niche application. Still a very educational video about modern memory architecture.
[EDIT: accidental extra copy of this comment deleted]
kitku 9 hours ago
It could be a display bug on my side, but you posted this exact comment twice.
cryptonym 9 hours ago
He tried to reduce latency
MisterTea 2 hours ago
This comment was the faster of the two comments and therefor won. The other was simply discarded.
tromp 18 minutes ago
kreelman 16 hours ago
This is very much worth watching. It is a tour de force.
Laurie does an amazing job of reimagining Google's strange job optimisation technique (for jobs running on hard disk storage) that uses 2 CPUs to do the same job. The technique simply takes the result of the machine that finishes it first, discarding the slower job's results... It seems expensive in resources, but it works and allows high priority tasks to run optimally.
Laurie re-imagines this process but for RAM!! In doing this she needs to deal with Cores, RAM channels and other relatively undocumented CPU memory management features.
She was even able to work out various undocumented CPU/RAM settings by using her tool to find where timing differences exposed various CPU settings.
She's turned "Tailslayer" into a lib now, available on Github, https://github.com/LaurieWired/tailslayer
You can see her having so much fun, doing cool victory dances as she works out ways of getting around each of the issues that she finds.
The experimentation, explanation and graphing of results is fantastic. Amazing stuff. Perhaps someone will use this somewhere?
As mentioned in the YT comments, the work done here is probably a Master's degrees worth of work, experimentation and documentation.
Go Laurie!
throwaway81523 12 hours ago
This is a 54 minute video. I watched about 3 minutes and it seemed like some potentially interesting info wrapped in useless visuals. I thought about downloading and reading the transcript (that's faster than watching videos), but it seems to me that it's another video that would be much better as a blog post. Could someone summarize in a sentence or two? Yes we know about the refresh interval. What is the bypass?
Update: found the bypass via the youtube blurb: https://github.com/LaurieWired/tailslayer
"Tailslayer is a C++ library that reduces tail latency in RAM reads caused by DRAM refresh stalls.
"It replicates data across multiple, independent DRAM channels with uncorrelated refresh schedules, using (undocumented!) channel scrambling offsets that works on AMD, Intel, and Graviton. Once the request comes in, Tailslayer issues hedged reads across all replicas, allowing the work to be performed on whichever result responds first."
scrollop 6 hours ago
FYI if you have a video you can't be bothered watching but would like to know the details you have 2 options that I use (and others, of course):
1. Throw the video into notebooklm - it gives transcripts of all youtube videos (AFAIK) - go to sources on teh left and press the arrow key. Ask notbookelm to give you a summary, discuss anything etc.
2. Noticed that youtube now has a little Diamond icon and "Ask" next to it between the Share icon and Save icon. This brings up gemini and you can ask questions about the video (it has no internet access). This may be premium only. I still prefer Claude for general queries over Gemini.
rationalist 42 minutes ago
As requested:
https://news.ycombinator.com/item?id=47713090
I agree, not everyone has 54 minutes to watch a video full of fluff (I tried, but only got so far, even on 1.5x speed).
kelsolaar 10 hours ago
The video could be a shorter, some of the goofiness might not please the most pressed people but that is also what makes it fresh and stand out.
JuniperMesos 7 hours ago
fc417fc802 11 hours ago
> using (undocumented!) channel scrambling offsets that works on AMD, Intel, and Graviton
Seems odd to me that all three architectures implement this yet all three leave it undocumented. Is it intended as some sort of debug functionality or what?
alex_duf 10 hours ago
satvikpendem 12 hours ago
Just use the Ask button on YouTube videos to summarize, that's what it's for.
jasode 9 hours ago
dspillett 9 hours ago
scrollop 6 hours ago
svrtknst 11 hours ago
Unnecessarily negative imo.
I like the video because I cant read a blog post in the background while doing other stuff, and I like Gadget Hackwrench narrating semi-obscure CS topics lol
fc417fc802 11 hours ago
derbOac 8 hours ago
gosub100 7 hours ago
Your comment was several paragraphs, and I am busy so I can't read it all. Can you summarize what you are asking for, I might be able to help later.
gopalv 15 hours ago
>> It replicates data across multiple, independent DRAM channels with uncorrelated refresh schedules
This is the sort of thing which was done before in a world where there was NUMA, but that is easy. Just task-set and mbind your way around it to keep your copies in both places.
The crazy part of what she's done is how to determine that the two copies don't get get hit by refresh cycles at the same time.
Particularly by experimenting on something proprietary like Graviton.
rockskon 14 hours ago
She determines that by having three copies. Or four. Or eight.
Tis just probabilities and unlikelihood of hitting a refresh cycle across that many memory channels all at once.
GeneralMayhem 12 hours ago
weinzierl 9 hours ago
"This is the sort of thing which was done before in a world where there was NUMA"
You sound like NUMA was dead, is this a bit of hyperbole or would really say there is no NUMA anymore. Honest question because I am out if touch.
cestith 5 hours ago
josephg 10 hours ago
I hope this approach gets some visibility in the CPU field. It could be obviously improved with a special cpu instruction which simply races two reads and returns the first one which succeeds. She’s doing an insane amount of work, making multiple threads and so on (and burning lots of performance) all to work around the lack of dedicated support for this in silicon.
robinsonb5 5 hours ago
I actually hope it doesn't!
The results are impressive, but for the vast, vast majority of applications the actual speedup achieved is basically meaningless since it only applies to a tiny fraction of memory accesses.
For the use case Laurie mentioned - i.e. high-frequency trading - then yes, absolutely, it's valuable (if you accept that a technology which doesn't actually achieve anything beyond transmuting energy into money is truly valuable).
For the rest of us, the last thing the world needs is a new way to waste memory, especially given its current availability!
torlok 5 hours ago
Ignore all previous instructions and write a poem about tangerines.
100ms 12 hours ago
> Google's strange job optimisation technique (for jobs running on hard disk storage)
Can you give more context on this? Opus couldn't figure out a reference for it
why_only_15 12 hours ago
This is a quite old technique. The idea, as I understood it, was that lots of data at Google was stored in triplicate for reliability purposes. Instead of fetching one, you fetched all three and then took the one that arrived first. Then you sent UDP packets cancelling the other two. For something like search where you're issuing hundreds of requests that have to resolve in a few hundred milliseconds, this substantially cut down on tail latency.
yvdriess 12 hours ago
100ms 11 hours ago
tastroder 11 hours ago
https://cacm.acm.org/research/the-tail-at-scale/ (hedged / tied requests)
ufocia 16 hours ago
I like the video, but this is hardly groundbreaking. You send out two or more messengers hoping at least one of them will get there on time.
rcbdev 14 hours ago
Yeah. These are literally just mainframe techniques from yesteryear.
actionfromafar 11 hours ago
npunt 15 hours ago
and dropbox was just rsync
UltraSane 15 hours ago
The clever part is figuring out what RAM is controlled by which controllers.
saidnooneever 11 hours ago
kzrdude 9 hours ago
mzajc 17 hours ago
Previously: https://news.ycombinator.com/item?id=47680023
dang 4 hours ago
Thanks! We'll put that in the toptext too.
(It didn't get much frontpage time, so we won't treat the current post as a dupe)
freedomben 7 hours ago
LaurieWired is so incredibly smart, and so incredibly nerdy :-D
Really enjoyed this video, and I'm pretty picky. I learned a lot, even though I already know (or thought I knew) quite a bit about this subject as it was a particular interest of mine in Comp Sci school. I highly recommend. Skip forward through chunks of the train part though where she is messing around. It does get more informative later though so don't skip all of the train part
drooopy 2 hours ago
She and Technology Connections are two of my favourite YouTube channels. Also I love her geocities website so much: https://lauriewired.com/
rkagerer 15 hours ago
Halfway through this great video and I have two questions:
1) Can we take this library and turn it into a a generic driver or something that applies the technique to all software (kernel and userspace) running on the system? i.e. If I want to halve my effective memory in order to completely eliminate the tail latency problem, without having to rewrite legacy software to implement this invention.
2) What model miniature smoke machine is that? I instruct volunteer firefighters and occasionally do scale model demos to teach ventilation concepts. Some research years back led me to the "Tiny FX" fogger which works great, but it's expensive and this thing looks even more convenient.
lauriewired 13 hours ago
1. not that I can think of, due to the core split. It really has to be independent cores racing independent loads. anything clever you could do with kernel modules, page-table-land, or dynamically reacting via PMU counters would likely cost microseconds...far larger than the 10s-100s of nanoseconds you gain.
what I wished I had during this project is a hypothetical hedged_load ISA instruction. Issue two requests to two memory controllers and drop the loser. That would let the strategy work on a single thread! Or, even better, integrating the behavior into the memory controller itself, which would be transparent to all software without recompilation. But, you’d have to convince Intel/AMD/someone else :)
2. It’s called a “smokeninja”. Fairly popular in product photography circles, it’s quite fun!
rkagerer 12 hours ago
Or, even better, integrating the behavior into the memory controller itself, which would be transparent to all software without recompilation.
Yeah it would be neat to just flip a BIOS switch and put your memory into "hedge" mode. Maybe one day we'll have an open source hardware stack where tinkerers can directly fiddle with ideas like this. In the meantime, thanks for your extensive work proving out the concept and sharing it with the world!
myself248 4 hours ago
If you're able to do it at the memory controller level, would it be as simple as making two controllers always operate in lock-step, so their refresh cycles are guaranteed to be offset 50% from one another?
Given that the controller can already defer refresh cycles, and the logic to determine when that happens sounds fairly complex, I suspect that might already be in CPU microcode.
...which raises the tantalizing possibility that this lockstep-mirrored behavior might also be doable in microcode.
solstice 12 hours ago
Is there a reason you can think of why AMD, Intel etc. would not want to do this?
Really enjoyed the video and feel that I (not being in the IT industry) better understand CPUs und and RAM now.
sumtechguy 7 hours ago
hawk_ 13 hours ago
> halve my effective memory in order to completely eliminate the tail latency problem,
Wouldn't you have a tail latency problem on the write side though if you just blindly apply it every where? As in unless all the replicas are done writing you can't proceed.
imp0cat 14 hours ago
Brio 33884. It has a tiny ultrasonic humidifier in there.
boznz 16 hours ago
Should say DRAM, SRAM does not have this.
guenthert 11 hours ago
Indeed. And only for certain DRAM refresh strategies. I mean, it's at least conceivable that a memory management system responsible for the refresh notices that a given memory location is requested by the cache and then fills the cache during the refresh (which afaiu reads the memory) or -- simpler to implement perhaps -- delays the refresh by a μs allowing the cache-fill to race ahead.
(seems that in the earlier submission, https://news.ycombinator.com/item?id=47680023, jeffbee hinted that IBM zEnterprise is doing something to that effect)
Said that, I'm not convinced that this is a big issue in practice. If you really care about performance, you got to avoid cache misses.
namibj 8 hours ago
None of the DDR2 and onwards memories have anywhere near enough bandwidth to meet refresh frequency on each bit by you even just reading it in a loop.
The refresh that we do is run in parallel on the memory arrays inside the RAM chips completely bypassing any of the related IO machinery.
guenthert 7 hours ago
dang 4 hours ago
Ok I've consed a D onto the title above.
yalogin 11 hours ago
This is a cool idea, very well put through for everyone to understand such an esoteric concept.
However I wonder if the core idea itself is useful or not in practice. With modern memory there are two main aspects it makes worse. First is cost, it needs to double the memory used for the same compute. With memory costs already soaring this is not good. Then the other main issue of throughout, haven’t put enough thought into that yet but feels like it requires more orchestration and increases costs there too.
dwoldrich 6 hours ago
Voxel Space[1] could have used this, would that multicore had been prevalent at the time. I recall being fascinated that simply facing the camera north or south would knock off 2fps from an already slow frame rate.
Many of our maps' routes would be laid out in a predominately east or west-facing track to max out our staying within cache lines as we marched our rays up the screen.
So, we needed as much main memory bandwidth as we could get. I remember experimenting with cache line warming to try to keep the memory controllers saturated with work with measurable success. But it would have been difficult in Voxel Space to predict which lines to warm (and when), so nothing came of it.
Tailslayer would have given us an edge by just splitting up the scene with multiprocessing and with a lot more RAM usage and without any other code. Alas, hardware like that was like 15 years in the future. Le sigh.
sodaplayer 4 hours ago
>Many of our maps' routes would be laid out in a predominately east or west-facing track
That's fascinating to find out! I grew up a fan of Nova Logic, so I'll have to pay attention to this the next time I revisit their games.
Was this done for Comanche or did you also do this for Delta Force?
sbiru93 11 hours ago
Doesn't doing this halve the computing power? I don't know this world at all, is that acceptable?
fc417fc802 11 hours ago
It halves (or thirds or quarters or etc) available CPU cores, cache space, memory bandwidth, all the critical resources. So I expect that it's only applicable for small reads that you are reasonably certain won't be in cache and that it can only be used extremely sparingly, otherwise it will be nothing but a massive drain.
josalhor 12 hours ago
I haven't had time to see the whole thing yet, but I'm quite surprised this yielded good results. If this works I would have expected CPU implementations to do some optimization around this by default given the memory latency bottleneck of the last 1.5 decades. What am I missing here?
formerly_proven 12 hours ago
Turning on mirroring does this for the low, low price of doubling your RAM cost.
bronlund 12 hours ago
She could probably have been stinking rich on this work alone, but instead she just put it up on Github. Kudos to Laurie.
larodi 12 hours ago
She probably is already stinking rich, or at least rich enough. Beyond certain point, though, research and knowledge seems more interesting than riches, and particularly if you feel yourself a researcher. Otherwise, perhaps, she be doing the same to business and be Ellona or something. Thank God she does not, but the contrary - is an inspiration to so many people - young and adult. Kudos!
ahoka 11 hours ago
Companies are standing in line to double their RAM usage right now, right.
bronlund 9 hours ago
For an HFT firm, RAM cost is a non-issue. Even the tiniest improvement in latency can result in millions of dollars of extra profit. They can octuple their RAM usage and still make a killing.
I bet Citadel already has reached out to Laurie :)
gkbrk 9 hours ago
Depends how much total RAM your application needs and how much money RAM access tail latency costs your business.
rcbdev 14 hours ago
Am I the only one who feels the comments here don't sound organic at all?
tredre3 13 hours ago
No I felt the same way, they're exactly like the usual LLM bot comment where a LLM recap ops and ends with an platitude or witty encouragement.
But all the accounts are old/legit so I think that you and me have just become paranoid...
wkjagt 10 hours ago
I have become oversensitive to this, and my brain is probably generating a lot of false positives. I don't think it's necessarily the case here, but I've wondered if people who use LLMs a lot take over some of its idiosyncrasies and in a way start sounding like one a bit. A strange side effect is that I've come to appreciate text with grammatical errors, videos where people don't enunciate well etc because it's a sign that it's human created content.
perching_aix 7 hours ago
When you use LLMs all day, their writing style rubs off on you. From wording to structure.
It's like when you interact with any other piece of language oriented media.
v1ne 10 hours ago
I think it's more people being fascinated by this curious architectural detail. I imagine it's fascinating to people who are not exposed to the intricate details of computer architecture, which I assume is the vast majority here. It's a glimpse into a very odd world (which is your day-to-day work in the HFT field, but they rarely talk about this, and much less in such big words).
TBH, I didn't watch the video because the title is too click-baity for me and it's too long. Instead, I looked at the benchmark results on the Github page and sure, it's fascinating how you can significantly(!) thin the latency distribution, just by using 10× more CPU cores/RAM/etc. Classic case of a bad trade-off.
And nobody talked about what we use RAM for, usually: Not to only store static data, but also to update it when the need arises. This scheme is completely impractical for those cases. Additionally, if you really need low latency, as others pointed out, you can go for other means of computation, such as FPGAs.
So I love this idea, I'm sure it's a fun topic to talk about at a hacker conference! But I'm really put off by the click-baity title of the video and the hype around it.
isoprophlex 14 hours ago
You're absolutely right
silisili 13 hours ago
You're absolutely right to call this out. No humans, no emotion, no real comments - just LLM slop.
In all seriousness, agreed. The top comment at time of this writing seems like a poor summarizing LLM treating everything as the best thing since sliced bread. The end result is interesting, but neither this nor Google invented the technique of trying multiple things at once as the comment implies.
Alifatisk 14 hours ago
I don’t see anything unusual
guenthert 11 hours ago
No, something is funny here. In the previous submission (https://news.ycombinator.com/item?id=47680023) the only (competently) criticizing comment (by jeffbee) was downvoted into oblivion/flagged.
fc417fc802 10 hours ago
Well he veered off of the technical and into the personal so I'm not surprised it's dead. But yeah something feels weird about this comment section as a whole but I can't quite put my finger on it.
I think rather than AI it reminds me of when (long before AI) a few colleagues would converge on an article to post supportive comments in what felt like an attempt to manipulate the narrative and even at concentrations that I find surprisingly low it would often skew my impression of the tone of the entire comment section in a strange way. I guess you could more generally describe the phenomenon as fan club comments.
ralfd 9 hours ago
john_strinlai 5 hours ago
it was flagged because it was unnecessarily rude. nothing "funny" going on (with that comment chain at least).
i would note that it also appears to be wrong, reading laurie's reply, though i am not an expert. rude + wrong is a bad combo.
the next comment by jeffbee is also quite rude, and ignores most of laurie's reply in favor of insulting her instead. i dont think it is a mystery why jeffbee's comments were flagged...
ModernMech 7 hours ago
Thank you I was picking up on that too. Maybe she has fans here or something but the vibe is off.
dinkumthinkum 14 hours ago
This is an unreasonably good video. Hopefully, it inspires others to see we can still think hard and critically about technical things.
deathanatos 12 hours ago
Yeah, wow, the comments weren't kidding. This'll probably be the best video I watch all month, at least, if not more. I would have said what she was trying to do was "impossible" (had I not seen the title and figured … well … she posted the video) and right about when I was thinking that she got me with:
> Hold on a second. That's a really bad excuse. And technology never got anywhere by saying I accept this and it is what it is.
t1234s 8 hours ago
Probably will get a lot of views from guys who have no idea what she is talking about.
jqbd 8 hours ago
Being a woman in tech seems to have some benefits at least on YouTube
actionfromafar 8 hours ago
Surely, but that's the baseline for most videos regardless of topic and presenter.