The time the x86 emulator team found code so bad they fixed it during emulation (devblogs.microsoft.com)

440 points by paulmooreparks 13 hours ago

psanchez 11 hours ago

This reminds me of a story from 15 years ago, where I was developing a technology to download games on demand by hooking into the OS calls.

There was a particular game that was superslow when this tech was applied. Original game loading took around 15-20 seconds, whereas once the tech was applied it took easily 3-5 min, even with all data already downloaded.

When I started digging into it, I realized the reason was the game was using something like

   fread(data, 1, 65536, fptr);
instead of

   fread(data, 65536, 1, fptr);
Which basically expanded back in the day to 65k reads of 1 byte for several MB file. Each fread translated to 65k reads of ReadFile Windows API. Since my code was hooking on ReadFile system call, and my call was heavier than ReadFile, the game loading felt really slow. Unusable. It would have not been fun for players.

The easy fix was to swap arguments for certain calls. The long fix required to use an internal cache to account for these cases so that the hooked ReadFile was faster when data was already in disk.

Funny thing is that as we started rolling out the tech and applying it to more and more games we realized lots of games did this. We went for the cache fix and games ended up loading faster than before. Honestly, games could have load all the data in a couple of seconds by just swapping the args. I'm guessing developers did this on purpose so that games seemed like they were loading a lot of stuff, although you never know.

Taniwha 10 hours ago

I used to be a graphics card/chip architect for macs in the early/mid 90s - our chips were the fastest, but some programs were resistant because they did stupid stuff: pagemaker invalidated the font cache every time it went thru its main loop, quark with ATM did an n*2 thing every time it wrote text etc etc. We had special hardware to accelerate text drawing and it did nothing because the software pissed it away. We considered creating a plugin that fixed all these things, it would have been hard to maintain, in the end we travelled around to the people who made these apps and talked them through their problems

To be fair excel would erase places white that it wanted to write up to 9 times before it drew any black pixels, we made that very fast! we didn't tell them :-)

At the time 24-bit framebuffers were so slow that before we built graphics acceleration hardware people would switch back to 8-bit to get stuff done, making 24-bit/true colour your daily driver was a big step forward.

nxobject 7 hours ago

Does that make you the first in a long tradition of GPU developers going to blockbuster app devs to say "hey, you should be doing this instead?"

PS – I am looking through the NuBus cards that I have... did you work for SuperMac or RasterOps?

Taniwha 6 hours ago

urbandw311er 9 hours ago

This is a horrible and yet not unexpected insight into the internals of Excel

Xirdus 3 hours ago

Taniwha 9 hours ago

bathtub365 9 hours ago

trelbutate 8 hours ago

PaulHoule 6 hours ago

I remember when 24 bit color was exotic and aspirational and you had to settle for 16.

saltcured 6 minutes ago

projektfu 5 hours ago

spauldo 2 hours ago

dhosek 3 hours ago

xattt 6 hours ago

What would have been the purpose of stupid code like that?

Was it a workaround for things that didn’t fully complete on one iteration, so the devs kept hammering away at it until it worked?

kazinator 9 minutes ago

phire 6 hours ago

Xirdus 3 hours ago

Reminds me of the "community patch" to GTA Online from a few years ago. The game was plagued by 10+ minute loading times. The situation remained for years and only got worse with time. Some hacker figured out that the game spent 80% of loading time reading the in-game store listing file. The file was tens of megabytes IIRC, and it literally used the Schlemiel the Painter's Algorithm - for each entry, start reading from the beginning byte after byte. The hacker made a tiny patch that made it remember where it found the last entry. This cut the total loading time by 80%, from over 10 minutes to less than 3.

Edit: removed incorrect information.

exrook 2 hours ago

This is not quite an accurate telling of rockstar's reaction, there were actually receptive to it and paid out $10k for the discovery. Though it's an understandable mistake given rockstar's hostile history with the gta modding scene.

See the original post and discussion for the whole story:

https://nee.lv/2021/02/28/How-I-cut-GTA-Online-loading-times... https://news.ycombinator.com/item?id=26296339

Xirdus an hour ago

Someone 9 hours ago

> Which basically expanded back in the day to 65k reads of 1 byte for several MB file. Each fread translated to 65k reads of ReadFile Windows API

What software did that that badly? If the code asks for (up to) 65,536 single byte items, why would you split that into 65,536 calls?

Also, that change changes behavior. The old call could read anything from zero to 65,536 bytes, the new one only can read zero or 65,536 bytes.

(Reading the source of a few implementations, I think most implementations will fill the output buffer with partial objects if the input doesn’t supply an integral number of them, but the return value of fread cannot signal that to the caller)

tom_ 2 hours ago

The standard says that fread calls fgetc multiple times for each object:

> For each object, size calls are made to the fgetc function and the results stored, in the order read, in an array of unsigned char exactly overlaying the object

(wording unchanged since C99)

If the file is unbuffered, depending on how the implementation handles buffering, and how it interprets the standard, then perhaps it does end up hitting a path where there's 1 ReadFile call per byte...

I don't know how most implementations get around this. Presumably it's valid to interpret "calls are made" as "behaving as if calls are made", meaning fread can copy data out of the FILE's buffer directly, or make calls directly to whatever routine fgetc defers to, rather than calling fgetc N times literally. Looks like glibc's fread does this.

klodolph 2 hours ago

micampe 8 hours ago

A long time ago I worked with someone who read 1 byte at a time from a socket because they insisted data was cached so the kernel was going to batch it magically somehow. It took me days to convince them to measure it.

vidarh 5 hours ago

quietbritishjim 8 hours ago

macintux 6 hours ago

I assumed it was a simple mistake: easy to forget what order the two integers are sent.

mort96 7 hours ago

Wait, is that wrong? I always call fread as:

    fread(data, 1, sizeof(buffer), f);
with the rationale that I'm interested in reading sizeof(buffer) individual bytes. The buffer size is incidental, not the size of the items I'm trying to read from the file; "read one item whose size is sizeof(buffer)" seems semantically wrong.

Is this just the case of Windows having a bad stdlib fread implementation 15 years ago or is my thinking here actually wrong?

chadgpt3 7 hours ago

It's not wrong. Guy just wrote a bad implementation of fread and blamed everyone else.

DarkUranium 6 hours ago

projektfu 5 hours ago

fread should be buffered, but different values may cause buffering at different rates. Perhaps it didn't generate 65535 calls to ReadFile but it generated 16 or 64.

fsfod 7 hours ago

Part of Windows Explorer actually does tons of tiny 4 byte ReadFile calls in to its tracking database like file when you delete a file. If you deleting lots of files this quickly adds up.

pbhjpbhj 3 hours ago

Is this why Windows takes so long to delete things?? Presumably those reads aren't done when using del from a console as that always seems a bit faster.

jonathanlydall 2 hours ago

Asmod4n 2 hours ago

somenameforme 10 hours ago

Doesn't that break anything relying on the return value? fread gives you the number of objects read as a return. So I think a pretty typical thing would be to fread and then parse that number of characters, and that'd just break?

jcul 9 hours ago

I've seen a lot of code that just assumes fread / fwrite succeeded without bothering to check the return value...

But in this case if the code was calling fread 65536 times in a loop and getting 64KiB each time it wouldn't be good either!

Sounds like the parent comment had to fix this with the internal cache thing to speed up the small freads. I think they meant the easy fix would have been swapping the args in the original / caller code.

account42 9 hours ago

koolala 9 hours ago

I think they turned it from a tiny file read to a tiny ram read.

DonHopkins 9 hours ago

The type of programmer who swaps the args to fread tends to be the type of programmer who doesn't bother to check the return value, fortunately.

Edit: mort96: So did you check the return value or not?

mort96 7 hours ago

account42 9 hours ago

lukan 9 hours ago

"I'm guessing developers did this on purpose so that games seemed like they were loading a lot of stuff"

I really hope that was not the case and rather think incompetence or to deal with obscure legacy problems, but the gamer in me gets enraged at the thought someone would artificially increase loading times.

dfox 4 hours ago

The most important fix in SP1 for Office 2007 was fixing exactly that in Excel. Doing ridiculous amount of 4 byte reads made it basically unusable on network filesystems.

chadgpt3 7 hours ago

Why does your fread to anything other than multiplying the two arguments?

Sesse__ 6 hours ago

The idea of having two arguments to fread() is presumably to be able to do something else than all-or-nothing when there's a short read.

chadgpt3 5 hours ago

dlcarrier 12 hours ago

SimCity had a read-after-free bug that Microsoft patched in Windows 95. That was a lot easier for customers than having Maxis fix it, which could have required exchanging copies of the game.

oceansky 7 hours ago

There's also the opposite effect, a windows security update broke GTA San Andreas because it relied on undefined behavior.

https://silentsblog.com/2025/04/23/gta-san-andreas-win11-24h...

icase 5 hours ago

in this dark age of agents writing code that gets debugged by other agents, i love reading stuff like this: stories of human intuition fixing human mistakes. thanks for a fascinating read.

Cthulhu_ 10 hours ago

It feels like graphics drivers do / did this a lot too. At the very least they make specific optimizations for specific games, probably by tweaking settings and features that the game developers didn't optimize properly themselves.

kalleboo 9 hours ago

Famously if you renamed Quake 3 to "Quack" 3, it would slow down on the ATI Radeon 8500 https://web.archive.org/web/20091016055550/https://hardocp.c...

account42 8 hours ago

rbits an hour ago

Yep. I know the Minecraft optimisation mod Sodium has encountered some issues because Nvidia drivers try to optimise the game in ways that can cause issues for them

SyzygyRhythm 9 hours ago

There are many, many, cases like this, including correctness fixes. One recent example I remember had a shader that computed: x = a / b * b

The optimizer was allowed, but not obligated, to transform that into: x = a

However, in this case, b was sometimes 0. And if so, the unoptimized version computed: x = a / 0 * 0 = Inf * 0 = NaN

So badness ensued if the that particular path didn't get optimized, which could happen under various circumstances. We had to add some code to ensure that transformation always happened on that game.

DarkUranium 6 hours ago

easyThrowaway 10 hours ago

The most interesting part is that IIRC they shipped the entire Windows 3.11 memory allocator to make it work.

I have very little understanding on how allocation works at OS level, but I'm surprised there are no wrappers like dgVoodoo or dxWrapper specifically for this kind of issues. There are quite a bunch of old Windows games (Need for Speed 1-4 for a start) that refuse to run on modern OSes due to rather...bold memory management strategies.

rincebrain 9 hours ago

Apparently the recollection of the fix was that they deferred actually freeing memory for a while if they detected it was SimCity running. [1]

[1] - https://www.joelonsoftware.com/2000/05/24/strategy-letter-ii...

DonHopkins 9 hours ago

A story I heard at Sun, which may be apocryphal but was fucking hilarious enough to be a repeatable rumor, was that a release of an early operating system in BETA was determined to be solid and tested and ready to release and ship to customers, so they simply changed the version string from something like "SunOS2.1BETA" to "SunOS2.1FCS" (First Customer Ship), and recompiled. But the change from a 12 character version to an 11 character version threw off the alignment of some important data structures somewhere in the kernel, and the entire OS ran MUCH SLOWER because of 68k unaligned memory accesses!

hodgehog11 12 hours ago

I think we're starting to see more of this sort of thing happening now with Proton and Wine gaining prominence in the Linux community. Some games (Elden Ring comes to mind) have bad enough PC ports when they come out that the compatibility layer can incorporate a hotfix to improve performance, while users of the software on the original platform still had to suffer.

Gigachad 11 hours ago

Fairly sure GPU drivers do the same thing where they include a ton of per game tweaks to make them run faster. It does feel like a fragile way of doing things where an external component that should be agnostic to the software running ends up including a handful of junk trying to fix stuff that should have been fixed by the consumer of the driver.

Guvante 10 hours ago

It goes the other way too, sometimes you trigger some optimization silliness in the driver and the game needs to adapt to avoid it.

rickdeckard 10 hours ago

zoenolan 10 hours ago

The big one I remember was many applications, not just games assuming the buffer swap was performed by a blit into the display buffer, not an framebuffer pointer update. They relied on the previous frames data still being in the back buffer. For those applications you were forced to blit the buffer, not swap the pointer and take a performance hit.

I also remember a media player being called out by name in the code for doing invalid operations, needing a work around and code to detect it was running just to function.

anilakar 11 hours ago

GPU driver packages are already a huge collection of workarounds for bad game engine coding.

An Nvidia employee once told me that one of the easiest ways to squeeze out a few extra frames on your old machine is to rename the game executable to hl2.exe.

st_goliath 10 hours ago

> GPU driver packages are already a huge collection of workarounds for bad game engine coding.

And of course, browser engines also do the same things for certain websites:

https://github.com/WebKit/WebKit/blob/main/Source/WebCore/pa...

https://github.com/WebKit/WebKit/blob/main/Source/WebCore/pa...

necovek 10 hours ago

I can see how it can modify GPU driver behavior, but I cannot see how it would get you better performance with everything else the same?

What it should do is ensure some things not relevant to Half-Life 2 were not done, thus getting better performance for this game in particular, but there is no guarantee that same optimizations work for other applications or games, so one should not expect an overall improvement.

Unless they are doing some silly things like dropping quality, but that's the "everything else the same" point.

If not, why not have this enabled as default behavior instead?

sfink 3 hours ago

dlcarrier 9 hours ago

limflick 11 hours ago

> to rename the game executable to hl2.exe

This seems genuinely unbelievable. Does anyone have a technical explanation for this?

hurtigioll 11 hours ago

proton_9 11 hours ago

This sounds like a really interesting story, would like to read more on why half life 2 specifically? the game itself was pretty well optimized and ran on really low end hardware even back in the day.

db48x 11 hours ago

AHTERIX5000 10 hours ago

Yep, someone needs to do the same workarounds Windows drivers do but on Linux and the translation layer is a good spot for them, look at https://github.com/HansKristian-Work/vkd3d-proton/blob/938d7... for example

harrall 11 hours ago

A big portion of GPU driver updates are actually just that, same with Windows updates.

Windows 95 patched a bug in SimCity just to get it to work.

selcuka 11 hours ago

To be fair it is possible that the developer enabled a special "unroll all loops, no matter what" optimisation flag during compilation.

I agree it would be stupid for a compiler to even support such a flag, but those were the 1980s/90s.

ack_complete 2 hours ago

Doesn't require any special flags, just hitting optimizer limits can do it with MSVC.

https://www.reddit.com/r/cpp/comments/1i36ahd/is_this_an_msv...

cyberax 11 hours ago

PhilipRoman 10 hours ago

Right up there with fun, safe math optimizations

account42 8 hours ago

kazinator 11 hours ago

> Anyway, my colleague found that there was one program that needed to allocate around 64KB of memory on the stack and initialize it. The standard way of doing this is to perform a stack probe to ensure that 64KB of memory is available, then subtracting 65536 from the stack pointer, and then initializing the memory in a small, tight loop.

Actually, the standard way of allocating 64 kB of memory on the stack is to just assume you can do it, subtract 64k from the stack pointer, and hope for the best.

Most stack allocations in the wild are not checked.

i_don_t_know 9 hours ago

IIRC you have to probe every page of the stack on Windows. You cannot just subtract a value from ESP/RSP. If you don't probe every page in order, you get a page fault or some other exception (I don't remember which one).

justsid 26 minutes ago

How else would the OS know your read/write 16 pages away from the current stack pointer is in fact an attempt to increase the stack and not just really bad pointer arithmetic and a bug? How many pages should the runtime let you skip before its just a segfault?

ashdnazg 9 hours ago

I worked on a transpiler from Nand2tetris assembly to WebAssembly, and had some really annoying memory corruption bug that I just couldn't solve.

That is, until I checked the program I used for testing (which I didn't write), and found the following code:

  dealloc(this)
  return this->field
With the original allocator, this worked fine, since the deallocation didn't touch the memory.

My allocator, however, overwrote the field during the deallocation with bookkeeping stuff, which meant the returned value was not what the programmer intended and after a short while the program crashed.

Unlike TFA, I had the luxury of just fixing the test program.

wazoox 8 hours ago

IIRC, one of the similar old story from Raymond Chen is about SimCity 2000, that did a similar trick (free memory, then start immediately using it) that worked just fine under DOS, but was a big no-no starting with Windows 95. The game was so common that Windows had to include a special rule to make it run...

zimmund 2 hours ago

I can't stop thinking about all the unoptimized code we have around. As processors (and memory) over the last 2-3 decades improved faster than we needed to fix the inefficiencies we created, we silently accepted that we don't need efficiency everywhere. So maybe a compiler, an emulator or some critical piece of code were created with this in mind, but the average app or website just waste resources left and right and pray for the best.

With more and more code being written with AI (which has notoriously inefficient solutions to simple problems), I expect this issue to become more prevalent. I just hope we optimize at the source of the problem (AI and humans using it) and not on platforms (compiler and engine/kernel heuristics)

smallstepforman 2 hours ago

Half the compute and reduce memory by factor of x4 and in a decade we’ll have double the performance we have now.

I do old school embedded, the amount of desktop bloat is insane. Any function I really need to refactor, I can reduce size and improve performance. And there are better engineers out there that are more efficient than me.

cranx 5 hours ago

Loop unrolling is a basic compiler optimization and depending on the machine language and processor instruction set should be faster taking into account all the house keeping required to execute a conditional, jump, move register values etc. This article is missing the analysis of why. If someone didn’t “like” it and was offended then that seems like an equally silly reason. On the surface 256k to init less does seem silly, but what if it was faster?

ryukoposting 2 hours ago

A few things to consider.

In this case we're talking about a tight initialization loop with probably a single instruction in the body. The HW optimizations necessary to make a loop like this perform equally to the unrolled form are so rudimentary that they're taken for granted on basically any CPU, even 30 years ago. Seriously, we're talking about optimizations I made in an "intro to Verilog" class as an undergrad, and I'm not even a HW engineer.

It also depends how often this code is being hit. Does the code run once while the program loads? Nobody will notice a 2 microsecond improvement in loading times. Does the code run in a timing-sensitive hot path, like a game loop or a GUI rendering thread? Well now optimization matters. But again, consider the HW argument above.

Also remember that, back then, storage wasn't cheap. 256K of code is 18% of a 1.44MB floppy, and 35% of a 720K floppy.

classichasclass 12 hours ago

Betting Alpha was the native architecture in question. It seemed to have the best support.

0xdecrypt 4 hours ago

256 KB of code to zero 64 KB of memory is the kind of optimization that makes you question every life choice that led to it.

jeffbee 12 hours ago

People from Transmeta told me stories about how their translators were full of special case optimizations to fix horrors they discovered in Microsoft Windows itself.

wolfi1 10 hours ago

speaking of which, what became of it?

hbbio 9 hours ago

Acquired by a patent monetization business...

electroglyph 11 hours ago

heh, when Raymond Chen dunks on the MSVC team =)

mkl 3 hours ago

There's no indication it was MSVC, and there are lots of compilers (and used to be more).

ant6n 10 hours ago

Arguably more of an optimization, rather than a fix. Looks like un-unrolling a loop, or better, rolling a loop. Or rolling straight line code?

senfiaj 8 hours ago

Yeah, but after a certain point the win is negligible. Huge code can also increase cache misses which will slow down things.

m1r 12 hours ago

Couldn't they just turn the optimization off for this loop?

MadnessASAP 12 hours ago

They didn't have the code for the offensive program, they were creating the emulator to run it on a different architecture.

McGlockenshire 12 hours ago

> offensive program

Agreed.

notorandit 12 hours ago

Which optimizer replaces a 64k loop with 64k instructions?

Ah, yes. Microsoft's!

selcuka 11 hours ago

There is no indication that the compiler that produced the code was Microsoft's. Actually the article hints otherwise ("[...] whatever compiler was used to compile this code").

notorandit 7 hours ago

pantulis 4 hours ago

I was just curious and checked The Old New Thing archive... yes I've been reading Raymond Chen's stories for as long as I remember but hey, it's been 23 years of delivering consistently solid stories about Windows.

notorandit 12 hours ago

> they fixed it during emulation

It means the fix was applied to run during the emulation loop execution, not that the fix was found and applied while the emulation loop was running.

Which would have made it an emulation code escape.

canucker2016 an hour ago

I was looking through the compiler docs about memory allocation and I found the section about the debug version of the CRT which could fill the allocated memory with a non-zero canary value to help detect uninitialized memory (assuming you weren't calling calloc - which zero-init's allocated memory).

But there wasn't any similar programmatic debugging aid for detecting uninitialized stack memory.

Going further down the rabbit hole, I discovered the _chkstk function.

The MS C compiler would emit a call to _chkstk on function entry to ensure that stack memory had been paged in. But further reading noted that _chkstk was only emitted if the function allocated a lot of stack memory. And there was source code! MS included the assembly language source code for _chkstk in the CRT source code, installed with compiler.

I needed _chkstk to be emitted for every function not only for functions that allocated >= 4KB of stack variables.

Curses, foiled again.

Then, while perusing the list of compiler command line switches, I see "/Ge".

  /Ge (Enable Stack Probes)

  Activates stack probes for every function call that requires storage for local variables.
Ahhhhh! The grey, storm clouds parted and the sun rays bathed shone down on me in their warmth.

I had all the pieces I needed to fill uninitialized stack memory with a non-zero canary value so I could make detection of uninitialized stack variables more reliable.

_stkfil was born

Modifying _chkstk was easy. I needed to write to every byte of stack in a stack page instead of reading only 4 bytes and skipping to the next page of stack.

While I was mucking in the bowels of modifying _chkstk, I added a 4-byte global variable to hold my canary value. Let the app override what value to use.

In debug builds, _stkfil helped find a couple of bugs, but soon all the stray uninited stack vars were gone and the code was forgotten.

Then I read about InitAll in https://www.microsoft.com/en-us/msrc/blog/2020/05/solving-un...

  InitAll - Automatic Initialization

  In addition to the previously mentioned approaches, Microsoft is now using a feature known as InitAll which performs automatic compile-time initialization of stack variables.

  This section documents how Windows is using this technology and the rationale for why.

  Current Windows Settings

  The following types are automatically initialized:

  - Scalars (arrays, pointers, floats)
  - Arrays of pointers
  - Structures (plain-old-data structures)

  The following are not automatically initialized:

  - Volatile variables
  - Arrays of anything other than pointers (i.e. array of int, array of structures, etc.)
  - Classes that are not plain-old-data

  For optimized retail builds, the fill pattern is zero. For floats the fill pattern is 0.0.

  For CHK builds or developer builds (i.e. unoptimized retail builds), the fill pattern is 0xE2. For floats the fill pattern is 1.0.

yieldcrv 12 hours ago

> All in all, it took this program 256 kilobytes of code to initialize 64 kilobytes of data.

solidity sweating profusely