The time the x86 emulator team found code so bad they fixed it during emulation

451 points - today at 4:46 AM

Comments

psanchez today at 6:23 AM

This reminds me of a story from 15 years ago, where I was developing a technology to download games on demand by hooking into the OS calls.

There was a particular game that was superslow when this tech was applied. Original game loading took around 15-20 seconds, whereas once the tech was applied it took easily 3-5 min, even with all data already downloaded.

When I started digging into it, I realized the reason was the game was using something like

   fread(data, 1, 65536, fptr);

instead of

   fread(data, 65536, 1, fptr);

Which basically expanded back in the day to 65k reads of 1 byte for several MB file. Each fread translated to 65k reads of ReadFile Windows API. Since my code was hooking on ReadFile system call, and my call was heavier than ReadFile, the game loading felt really slow. Unusable. It would have not been fun for players.

The easy fix was to swap arguments for certain calls. The long fix required to use an internal cache to account for these cases so that the hooked ReadFile was faster when data was already in disk.

Funny thing is that as we started rolling out the tech and applying it to more and more games we realized lots of games did this. We went for the cache fix and games ended up loading faster than before. Honestly, games could have load all the data in a couple of seconds by just swapping the args. I'm guessing developers did this on purpose so that games seemed like they were loading a lot of stuff, although you never know.

dlcarrier today at 5:46 AM

SimCity had a read-after-free bug that Microsoft patched in Windows 95. That was a lot easier for customers than having Maxis fix it, which could have required exchanging copies of the game.

andikleen2 today at 6:07 PM

Dave Jones used to have a series of "Why user space sucks" Linux kernel conference talks with many such examples, usually with dumb and redundant system calls.

However as someone who looks a lot at instruction traces I could probably write on e on why Linux kernel code sucks too. One of my current pet peeves is the way Linux walks bitmasks of CPU bits, which is a reasonably common operation. Due to a chain of unfortunate changes and decisions it currently needs 16+ instructions to find the next bit for something which the x86 instruction set has a single instruction. Of course that is so big that it is even outlined, adding even more overhead.

hodgehog11 today at 5:17 AM

I think we're starting to see more of this sort of thing happening now with Proton and Wine gaining prominence in the Linux community. Some games (Elden Ring comes to mind) have bad enough PC ports when they come out that the compatibility layer can incorporate a hotfix to improve performance, while users of the software on the original platform still had to suffer.

selcuka today at 6:03 AM

To be fair it is possible that the developer enabled a special "unroll all loops, no matter what" optimisation flag during compilation.

I agree it would be stupid for a compiler to even support such a flag, but those were the 1980s/90s.

kazinator today at 6:36 AM

> Anyway, my colleague found that there was one program that needed to allocate around 64KB of memory on the stack and initialize it. The standard way of doing this is to perform a stack probe to ensure that 64KB of memory is available, then subtracting 65536 from the stack pointer, and then initializing the memory in a small, tight loop.

Actually, the standard way of allocating 64 kB of memory on the stack is to just assume you can do it, subtract 64k from the stack pointer, and hope for the best.

Most stack allocations in the wild are not checked.

ashdnazg today at 8:42 AM

I worked on a transpiler from Nand2tetris assembly to WebAssembly, and had some really annoying memory corruption bug that I just couldn't solve.

That is, until I checked the program I used for testing (which I didn't write), and found the following code:

  dealloc(this)
  return this->field

With the original allocator, this worked fine, since the deallocation didn't touch the memory.

My allocator, however, overwrote the field during the deallocation with bookkeeping stuff, which meant the returned value was not what the programmer intended and after a short while the program crashed.

Unlike TFA, I had the luxury of just fixing the test program.

zimmund today at 3:36 PM

I can't stop thinking about all the unoptimized code we have around. As processors (and memory) over the last 2-3 decades improved faster than we needed to fix the inefficiencies we created, we silently accepted that we don't need efficiency everywhere. So maybe a compiler, an emulator or some critical piece of code were created with this in mind, but the average app or website just waste resources left and right and pray for the best.

With more and more code being written with AI (which has notoriously inefficient solutions to simple problems), I expect this issue to become more prevalent. I just hope we optimize at the source of the problem (AI and humans using it) and not on platforms (compiler and engine/kernel heuristics)

cranx today at 12:27 PM

Loop unrolling is a basic compiler optimization and depending on the machine language and processor instruction set should be faster taking into account all the house keeping required to execute a conditional, jump, move register values etc. This article is missing the analysis of why. If someone didn’t “like” it and was offended then that seems like an equally silly reason. On the surface 256k to init less does seem silly, but what if it was faster?

classichasclass today at 5:22 AM

Betting Alpha was the native architecture in question. It seemed to have the best support.

0xdecrypt today at 1:09 PM

256 KB of code to zero 64 KB of memory is the kind of optimization that makes you question every life choice that led to it.

jeffbee today at 5:36 AM

People from Transmeta told me stories about how their translators were full of special case optimizations to fix horrors they discovered in Microsoft Windows itself.

electroglyph today at 5:58 AM

heh, when Raymond Chen dunks on the MSVC team =)

ant6n today at 7:04 AM

Arguably more of an optimization, rather than a fix. Looks like un-unrolling a loop, or better, rolling a loop. Or rolling straight line code?

m1r today at 5:22 AM

Couldn't they just turn the optimization off for this loop?

notorandit today at 5:25 AM

> they fixed it during emulation

It means the fix was applied to run during the emulation loop execution, not that the fix was found and applied while the emulation loop was running.

Which would have made it an emulation code escape.

pantulis today at 1:06 PM

I was just curious and checked The Old New Thing archive... yes I've been reading Raymond Chen's stories for as long as I remember but hey, it's been 23 years of delivering consistently solid stories about Windows.

canucker2016 today at 4:29 PM

I was looking through the compiler docs about memory allocation and I found the section about the debug version of the CRT which could fill the allocated memory with a non-zero canary value to help detect uninitialized memory (assuming you weren't calling calloc - which zero-init's allocated memory).

But there wasn't any similar programmatic debugging aid for detecting uninitialized stack memory.

Going further down the rabbit hole, I discovered the _chkstk function.

The MS C compiler would emit a call to _chkstk on function entry to ensure that stack memory had been paged in. But further reading noted that _chkstk was only emitted if the function allocated a lot of stack memory. And there was source code! MS included the assembly language source code for _chkstk in the CRT source code, installed with compiler.

I needed _chkstk to be emitted for every function not only for functions that allocated >= 4KB of stack variables.

Curses, foiled again.

Then, while perusing the list of compiler command line switches, I see "/Ge".

  /Ge (Enable Stack Probes)

  Activates stack probes for every function call that requires storage for local variables.

Ahhhhh! The grey, storm clouds parted and the sun rays bathed shone down on me in their warmth.

I had all the pieces I needed to fill uninitialized stack memory with a non-zero canary value so I could make detection of uninitialized stack variables more reliable.

_stkfil was born

Modifying _chkstk was easy. I needed to write to every byte of stack in a stack page instead of reading only 4 bytes and skipping to the next page of stack.

While I was mucking in the bowels of modifying _chkstk, I added a 4-byte global variable to hold my canary value. Let the app override what value to use.

In debug builds, _stkfil helped find a couple of bugs, but soon all the stray uninited stack vars were gone and the code was forgotten.

Then I read about InitAll in https://www.microsoft.com/en-us/msrc/blog/2020/05/solving-un...

  InitAll - Automatic Initialization

  In addition to the previously mentioned approaches, Microsoft is now using a feature known as InitAll which performs automatic compile-time initialization of stack variables.

  This section documents how Windows is using this technology and the rationale for why.

  Current Windows Settings

  The following types are automatically initialized:

  - Scalars (arrays, pointers, floats)
  - Arrays of pointers
  - Structures (plain-old-data structures)

  The following are not automatically initialized:

  - Volatile variables
  - Arrays of anything other than pointers (i.e. array of int, array of structures, etc.)
  - Classes that are not plain-old-data

  For optimized retail builds, the fill pattern is zero. For floats the fill pattern is 0.0.

  For CHK builds or developer builds (i.e. unoptimized retail builds), the fill pattern is 0xE2. For floats the fill pattern is 1.0.

yieldcrv today at 5:44 AM

> All in all, it took this program 256 kilobytes of code to initialize 64 kilobytes of data.

solidity sweating profusely

phantasmat today at 5:45 PM

[dead]

deleted today at 7:13 AM

rohitsriram today at 7:03 AM

[flagged]