About Source Tickrates

This is really simple and short, but it paves the way for a bug we encountered in SourceMod. What is a tickrate in Half-Life 2?

A game is like a film; it has “frames” and the number of frames per second determine the smoothness of the experience. Don’t confuse this with graphical FPS, which although related, is different. In Half-Life 2, a “frame” is a short burst of time where all entities, physics, client updates, and netcode are processed. By running frames at specific intervals, a game can simulate real time without using an extraordinary amount of CPU.

The default interval in Half-Life 2 is 15ms. That is, every 15ms, a frame is processed. The engine, in pseudocode, does this:

TIME = 0
WHILE TRUE
   RUN FRAME
   TIME = TIME + INTERVAL
   SLEEP
END WHILE

In this manner, a global time value is updated to keep track of how many intervals have passed. The mod uses this time value to schedule all internal actions. The subtle note is that this time is not related to ‘real time’. Any number of actual clock seconds may pass in between that interval. For example, if you were to halt the game in a debugger, then resume it five minutes later, its internal time would be unaffected. It would pick up where it left off and continue as if nothing happened.

In Half-Life 2, mods are able to set their own tickrates. CS:S doubles the default interval, for a 30ms tickrate. You can also change the default tickrate with the -tickrate command line parameter. Any value over 10 will result in a new tickrate of 1.0/x seconds. For example, a tickrate of 100 will be 10ms, and a tickrate of 66 will be 15.15ms. Sometimes this is advantageous; for example, CS:S will have smoother gameplay on higher tickrates (“100 tick servers” are popular), though it will consume more CPU.

Simple reverse calculation: If the tick interval is 30ms, the number of ticks per second is 1/0.03.

Next week: A mis-conception in the ticking accuracy caused a flaw in both AMX Mod X and SourceMod’s timer systems.

Update 2007/12/09: CyberMind pointed out a typo in my math.

Team Fortress Crashes – Not my fault!

This week a TF2 update started crashing SourceMod servers. It seemed to happen with random combinations of third party plugins, and would only happen after 20-30 seconds. Naturally, everyone pointed their fingers at us. After all, it only crashes when running SourceMod, right?

The call trace was incomprehensible:

The first task was deducing the correct trace. I started off by inspecting the stack; that is, the memory at the ESP register:

Where was 0x0BBD464B located? Looking at the modules list:

The address 0x0BBD464B was in the range of engine.dll. So I opened engine.dll in IDA, which said that the image base was 0x10000000:

Since in Visual Studio engine.dll was loaded at 0x0BB60000, I calculated 0x10000000 + (0x0BBD464B - 0x0BB60000), and got 0x1007464B. Jumping to this address in IDA:

The address on the stack is the return address pushed from the call instruction on the second to last line. The goal was to find what preceded this function in the callstack, so I analyzed this function’s stack usage:

  1. 0x514 is subtracted from ESP.
  2. Four push instructions subtract 0x10 from ESP.
  3. Five push instructions subtract 0x14 from ESP.
  4. 0x14 is added to ESP.

Thus, we can say that 0x524 bytes are added to the stack in this function. So, next I inspected ESP+0x524, or 0x00128F30:

It looked like there was another engine function on the stack (0x0BBD4C40), but more interesting were its parameters. I inspected 0x0012AFF4 and got lucky:

What’s this? It’s an A2S_RULES reply! Now I had enough data to make a guess: the server was crashing trying to reply to a rules query. I started a Team Fortress server on a weird port so that incoming queries would be blocked by my router. After about two minutes, there were no crashes. I started HLSW and queried my server. It crashed instantly.

Finally, the reason was confirmed. Third party plugins weren’t causing the crash. Third party plugins usually add public cvars, which get added to A2S_RULES replies. Valve had probably broken the mechanism which split up large query reply packets, and even a few extra public cvars were pushing it over the edge.

I sent a quick e-mail to Alfred Reynolds who confirmed the bug and said it would be fixed shortly. I’m glad the bug wasn’t on our side, but I wish Valve would release PDBs of their binaries.

(Note: I’ve posted this off-schedule so the next article will be on December 3rd. Enjoy your Thanksgiving!)

Server Query Optimization, Part 2

Last week I described a nasty situation. My program to query all Half-Life 2 servers used 50 threads and 500MB of RAM. Today I’m going to discuss the solution I used toward making it more efficient.

First, I observed sslice‘s solution. He had a Python script which spawned two threads: one sent packets, and the other received. His Python script with far less overhead also completed in five minutes. I was doing something wrong.

However, I couldn’t just use his solution outright. He was only querying A2S_INFO which sends one reply, so there was a cheap shortcut: thread 1 pumped a single socket socket full of packets, and thread 2 polled the same socket for replies. We needed to add multiple successive queries into the fray, which means storing state information about each IP:Port being handled.

That was the first part of the solution. I took the netcode and moved it into an object, CQueryState, whose member variables were essentially the stack variables for the thread function. Then I converted the code to be state-based. “Flattening” such code was a tedious process.

Pseudo-code of per-thread state machine model:

FUNCTION PROCESS_SERVER
  SEND A
  WAIT 2 SECONDS
  RECV B
  SEND C
  WAIT 2 SECONDS
  RECV D
END

Pseudo-code of object-per-state-machine model:

FUNCTION CQueryState::ProcessState
  IF STATE == SEND_A
    SEND A
    SENT_TIME = CURRENT_TIME
    STATE = RECV_B
  ELSE IF STATE == RECV_B
    IF RECV B
      STATE = SEND_C
    ELSE IF CURRENT_TIME - SENT_TIME >= 1 SECOND
        STATE = GET_NEW_SERVER
    END IF
  ELSE IF STATE == SEND_C
    SEND C
    SENT_TIME = CURRENT_TIME
    STATE = RECV_D
  END IF
END

With the overhead of threads gone, I had to tackle how to actually process these objects. sslice‘s model was to have one thread for processing sends, and one thread for processing receives. Porting that to multiple send/receive states felt complicated, and the sending thread spent most of its time sleeping (to avoid overflowing the UDP queue).

The solution I chose was to simulate threading. As each state machine was non-blocking, it was feasible to process a huge number of them at a time, sleep for a short period of time, then process again. The new algorithm became:

  1. Loop through every CQueryState object:
    1. Process the state.
    2. If the state is “done,” push the object into a processing queue, and in its place, put the next server we need to query.
  2. Sleep for 1ms or so. This adds some needed delay time into the system’s packet processing.
  3. Go back to step 1.
  4. Meanwhile, a separate thread processes completed state objects.

I defaulted to 150 state objects. With 1ms of delay in between frames (a frame being a processing of all state objects), querying 35,000 servers resulted in the following results:

  • Memory usage was reduced from 500MB to about 20MB at most.
  • Completion time was reduced from around 330 seconds to 90 seconds (including statistics computation+uploading). The Half-Life 1 version was reduced from around to 150 seconds.
  • Disk usage reduced from ~70MB to 0 bytes.
  • Thread count was reduced from 50 to 1 (State machines get pushed and popped from a separate thread to handle BZ2 decompression of packets).
  • False negatives (detecting alive servers as dead) were reduced from about 4,000 to around 250.
  • The Perl text processor was completely eliminated.

The solution of converting each thread to a “mini-thread,” and simulating each mini-thread every few microseconds, was an astounding success. I don’t think I’ve ever experienced such improvements in a programming project before; nor will I likely again, since in retrospect, the original design was extremely flawed. I’ve chalked this incident up to “learning by experience.”

Other notes: In the current program, I used one socket per state object. It’s probably feasible to rework this to use one socket, but it’d be a lot more work and memory to map the incoming packets back to viable state objects. The other interesting suggestion on IRC was by OneEyed, who suggested creating N sockets, then select()ing on them to see which ones have changed (and thus will have a viable state transition). It sounds feasible, but I have not tried it yet (nor can I see any huge benefits).

Server Query Optimization, Part 1

For little more than a year we’ve had a little tool (mostly used by us) which tracks stats. The database structure was rather flimsy, so one day I decided to rewrite it. The database overhaul made huge progress to future expandability, but the weight of old, sloppy code begin to overtake the efficiency of the operation.

Note: If the solution results seem obvious/trivial, you are ahead of me, as my level of netcode experience is only introductory.

The entire operation originally took place with the following steps:

  1. A C++ program was invoked:
  2. The Half-Life 1/2 master was queried. This gave a list of around 35,000-40,000 IP addresses and ports.
  3. 50 pthreads were created (this is Linux-only). Each thread was responsible for taking a slice of the server list. That is, each one queried exactly N/50 servers.
  4. Each query operation waited for a reply using select() and a 2-second timeout. If the server timed out, it was dropped from the list and the next server was read.
  5. Each thread queried its servers one at a time. Information contained basic info (players, mod, operating system, cvar key/value pairs). The information was allocated on the heap and cached in a global array.
  6. Once all threads were complete, the global array was dumped to a giant XML-style text file. The file was around 70MB.
  7. The C++ program terminated.
  8. A perl script read the file into memory, storing all sorts of details in huge hash arrays. That information got computed and thrown into the database.

For Half-Life 2, this process took just over six minutes in total: five minutes to query, and one minute to parse. However, we had a problem with deadlocks from the complexity of the threading model. When I attached GDB and dumped a core file, I was shocked to see the little C++ program using 500MB of memory!

My first intuition was that latter portion of the process was inefficient. Caching all query results in memory, then dumping them to a file, only to be squeezed through perl, was bad. The statistics could be computed on the fly, discarding and re-using memory efficiently. Then we could eliminate the obese text file and perl script entirely. Conclusion: moving statistics generation into the C++ program would be lightning fast, in-memory, and require no text parsing.

However, after a few quick tests, I found this wasn’t the cause of the memory overusage — it was the thread count alone. Each call to pthread_create added about 12MB of overhead. That might seem small for one or two threads, but with 50, it equaled a massive amount of memory. Unfortunately, using less threads meant the query operation took much more time. Below is a graph depicting this (click to enlarge):

After discussing the matter in IRC, sslice and devicenull helped me to see that there were essentially two bottlenecks working together. While it was efficient to try and do fifty things at once, it wasn’t efficient to wait for a timeout. The call to select() meant that each thread would simply idle, ruining overall efficiency. The second bottleneck is that only so many UDP packets can be sent at a time before things start getting mucky. I hadn’t even approached that. Because each thread spent most of its time blocking, the properties of UDP weren’t being fully utilized.

For comparison, sslice had a Python script with only two threads (one for sending packets, one for receiving) that was able to accomplish almost the same task in the same time. However, his model was based on one socket for many connections, something that does not scale well to querying server rules (I’ll explain this further next week).

Next week: the solution, using non-blocking sockets and state machines.

Using the Wrong Return Value

One thing I continually see which bugs me is using the wrong return value from a function. For example, take the following function:

Select All Code:
/**
 * Does something.
 *
 * @param error		On failure, a null-terminated error message is printed to this buffer.
 * @param maxlength	Maximum length, in bytes, of the message buffer.
 * @return		True on success, false on failure.
 */
bool DoSomething(char *error, size_t maxlength);

One thing people will start doing is this:

Select All Code:
char error[255];
error[0] = '\0';
DoSomething(error, sizeof(error));
if (error[0] != '\0')
{
	PrintErrorMessage(error);
}

This is a subtle logical fallacy. The documentation says that “failure implies a null-terminated string.” That proposition doesn’t imply that “failure implies a string of length greater than 0” or that “success implies a string of length zero.” It simply says that if the function fails, the buffer will contain a valid string.

Of course, eliminating the zeroing of the first byte would make this code completely wrong. As it stands, the code will only be incorrect when DoSomething() returns false and has no error message available.

Perhaps, the documentation could be clearer. For example:

Select All Code:
/**
 * Does something.
 *
 * @param error		On failure, a null-terminated error message is printed to this buffer.  If 
 *			no error message is available, an empty string is written.  The buffer is 
 *			not modified if DoSomething() returns successfully.
 * @param maxlength	Maximum length, in bytes, of the message buffer.
 * @return		True on success, false on failure.
 */
bool DoSomething(char *error, size_t maxlength);

However, to the astute reader, this is just clarifying the obvious: the function only modifies the input buffer when it says it will, and an empty string is a valid, null-terminated string.

Making this mistake was very common in AMX Mod X, where return values were often ill-defined, non-uniform, and coupled with strange tags. Another case of this is through callbacks. SourceMod has a callback that essentially looks like:

Select All Code:
public SomethingHappened(Handle:hndl1, Handle:hndl2, const String:error[], any:data);

The rules for determining an error depend on the inputs and values of hndl1 and hndl2. These are well-defined. However, if you don’t read these rules, or simply ignore them, it’s tempting to just check the contents of error. Unfortunately, the API does not specify that checking the error buffer is a valid method of checking for an error, and your assumption could potentially lead to a bug.

Conclusion: If a function has a return value that definitively determines success or failure, use it. Don’t rely on testing values that are merely secondary in the failure process.

Backwards Compatibility Means Bugs Too

Sometimes backwards compatibility means preserving bugs. When working on AMX Mod X 1.8.0, we found a very nasty bug in the code for creating new-menus.

For the unfamiliar, new-menus are a replacement for the original menu code (now dubbed “old-menus”). They were designed to be much simpler to use, though the basics are similar:

  • menu_create() binds a new-menu to a new-menu callback.
  • register_menucmd() binds an old-menu to an old-menu callback.

The two callbacks are very different: new-menu callbacks have three parameters (menu, client, item). Old-menu callbacks have two parameters (client, keypress). They are mutually exclusive.

Yet, there was a serious, accidental bug in AMX Mod X 1.7x — menu_create called register_menucmd on the same callback it used for its own callback. Effectively, your new-menu callback could receive random, invalid data because it was being invoked as an old-menu handler and a new-menu handler. This created almost untraceable and inexplicable bugs in a few plugins that had ventured into new-menu territory.

Needless to say, the code was removed very fast. Then when it came time to beta test 1.8.0, we received two reports that a specific plugin had stopped working. One of its menus was broken. After successfully reproducing the problem, I opened the plugin’s source code and found why.

The plugin was calling menu_create on callback X. However, it never called any of the other relevant new-menu functions. Instead, it used old-menus to actually draw and display the menu. Effectively, the author had replaced register_menucmd with menu_create. He probably intended to rewrite the rest of his code as well, but forgot. After all, his functionality continued to work as normal! So when AMX Mod X 1.8.0 removed this bug, his plugin broke.

Deciding whether to insert backward compatibility shims is always tough. Generally, we always want plugins to work on future versions, no matter what. There is a rough checklist I went through.

  • How many plugins are affected by the change? One.
  • How many people are affected by the change? 100-200. Punishing one author punishes all of these users.
  • How much code exists that is affected by the change? Unknown, assumed minimal.
  • Is the compatibility change a design decision, or a bug fix? How severe? Severe bug fix.
  • Is the break occurring from valid API usage or invalid API usage? If invalid, is it a mistake that is pervasive? Invalid usage localized to one line in one plugin.
  • Is backwards compatibility possible? No, the flaw is too severe to revert in any form.
  • Are any of the affected plugins no longer being maintained? No.

Given the above checklist, we decided to allow for this one plugin to receive reverted functionality. menu_create will call register_menucmd if and only if the plugin matches X and the plugin’s version matches Y. We notified the author saying, “if you release version Y+1, it will break on future AMX Mod X versions, so you will want to fix it.”

The conclusion: backwards compatibility sometimes means preserving bugs. Unfortunately, this gives new meaning to bugs being “undocumented features.” In the end, choosing whether to preserve bugs (or indeed any compatibility) depends on how many users you want to affect and how badly you want to affect them.

Design Decision I Most Regret

I previously mentioned that I’m a stickler for good platform code. Of all the design decisions I’ve ever made, there is one that I regret above all others. To this day, it haunts me mercilessly. It has appeared over and over again in countless lines of code for Half-Life 1, and silently passed to a few Half-Life 2 plugins. What is it?

The [g|s]_et_pdata functions in AMX Mod X allow you to reach into a CBaseEntity’s private data (i.e., its member variables) and pull out values given an offset from the this pointer. What’s the problem? Let’s take a look at how each of the retrieval mechanisms work:

  • 4-byte retrievals are given by a 4-byte aligned offset ((int *)base + offset)
  • Float retrievals are offset by a float aligned offset ((float *)base + offset)

At the time, I probably didn’t understand how addressing or memory really worked. I was also basing the API off years of code that already made these mistakes — learning from an incorrect source doesn’t help. So, what’s the actual problem?

In C/C++, given pointer X and offset Y, the final address is computed as X + Y * sizeof(X). That means when we do (int)base, the actual offset becomes offset * sizeof(int), or 4*offset. Not only have we lost byte addressability, but to get unaligned values we have to do various bitwise operations. This type of deep-rooted mistake is easily visible in the following code in the cstrike module:

Select All Code:
if ((int)*((int *)pPlayer->pvPrivateData + OFFSET_SHIELD) & HAS_SHIELD)
	return 1;

It turns out HAS_SHIELD is a bitmask to clear unwanted data, when in reality, the offset is wrong and (int *) should be (bool *).

Alas, backwards compatibility prevents us from ever making these API changes, and changing internal usage would then confuse users further.

In summary, the practice of casting to random types is ridiculous and SourceMod’s API summarily kills it off. The entire practice can now be expressed in a function like this:

Select All Code:
template <typename T>
void get_data(void *base, size_t offset, T * buffer)
{
   *buffer = *(T *)((char *)base + offset);
}

Pointer Invalidation, or When Caching is Dangerous

One of the most tricky things about C++, as opposed to a garbage-collected language, is the lifetime of a pointer (or the memory it points to). The standard cases are the ones beginners learn pretty quickly:

  • “Forever”: Static storage means global lifetime
  • Short: Local storage means until the end of the current scope block
  • Very short: Objects can be temporarily created and destroyed from passing, return values, or operator overloads.
  • Undefined: Manually allocated and destroyed memory

A more subtle type of pointer invalidation is when storing objects in a parent structure. For example, observe the following code:

Select All Code:
    std::vector<int> v;
    v.push_back(1);
 
    int &x0 = v[0];
 
    printf("%d\n", x0);
 
    v.push_back(2);
 
    printf("%d\n", x0);

Running this little code block on gcc-4.1, I get:

Select All Code:
[dvander@game020 ~]$ ./test
1
0

What happened here? I took an address (references, of course, are just addresses) to a privately managed and stored object (in this case, the object is an integer). When I changed the structure by adding a new number, it allocated new memory internally and the old pointer suddenly became invalid. It still points to readable memory, but the contents are wrong; in a luckier situation it would have crashed.

What’s not always obvious about this type of mistake is that the vector is not being explicitly reorganized, as would be the case with a call like clear(). Instead, it’s important to recognize that if you’re going to cache a local address to something, you need to fully understand how long that pointer is guaranteed to last (if at all).

This type of error can occur when aggressively optimizing. Imagine if the previous example used string instead of int — suddenly, storing a local copy of the string on the stack involves an internal new[] call, since string needs dynamic memory. To avoid unnecessary allocation, a reference or pointer can be used instead. But if you’re not careful, a simple change will render your cached object unusable — you must update the cached pointers after changes.

There was a was very typical error in SourceMod that demonstrated this mistake. SourceMod has a caching system for lump memory allocations. You allocate memory in the cache and the cache returns an index. The index can give you back a temporary pointer. Observe the following pseudo-code:

Select All Code:
int index = cache->allocate(sizeof(X));
X *x = (X *)cache->get_pointer(index);
//...code...
x->y_index = cache->allocate(sizeof(Y));

Can you spot the bug? By the time allocate() returns, the pointer to x might already be invalidated. The code must be:

Select All Code:
int index = cache->allocate(sizeof(X));
X *x = (X *)cache->get_pointer(index);
//...code...
int y_index = cache->allocate(sizeof(Y));
x = (X *)cache->get_pointer(index);
x->y_index = y_index

There are many ways to create subtle crash bugs by not updating cached pointers, and they often go undetected if the underlying allocation doesn’t trigger address changes very often. Worse yet, whether that happens or not is often very dependent on the hardware or OS configuration, and serious corruption or crash bugs may live happily undiscovered for long periods of time.

The moral of this story is: Be careful when keeping pointers to where you don’t directly control the memory. Even if you’ve written the underlying data structure, make sure you remember exactly what the pointer lifetime is guaranteed to be.

I’ve Dumped TortoiseSVN

I hate TortoiseSVN. The interface is great and it makes working with SVN a lot easier, but it’s so buggy it’s unbearable. Just a few of my complaints:

  • It’s slow. After a fresh install of Windows XP and TortoiseSVN, it completely locked up Explorer while it attempted to traverse a massive repository I had.
  • It causes weird explorer behavior. In particular, it feels like the screen is getting redrawn way too much.
  • TSVNCache is a horrible idea, or at least horribly coded. I often find that I can’t eject removable media because TSVNCache is holding file locks on random files, even if explorer is closed. Why? I have no idea. The only solution has been to kill it.
  • Sometimes, files randomly won’t have the right overlay for their status. Sometimes refreshing works, sometimes you have to exit (or even kill) explorer. In worse cases, TSVNCache must be killed, and in the worst case, you have to reboot (I’ve only had this happen once).
  • On my Core 2 Duo with 2GB of RAM, TortoiseSVN adds noticeable delay to explorer. It takes a split second longer to draw everything. The bigger (or deeper) your repository gets, the longer the delay seems to be.

So, TSVN is buggy and slow. I now use the command line on Windows instead. Getting this working with SSH keys is a bit of a chore, but here’s what I did:

  • Download the Windows PuTTY installer (I installed to C:\Program Files\PuTTy). Click here to get the exact package I did.
  • Get the latest Subversion package for Windows. I chose the binaries compiled against “Apache 2.2” — the exact package was “svn-win32-1.4.5.zip.”
  • Extract the zip file to somewhere. I chose C:\Tools and renamed the main subversion folder to get C:\Tools\Subversion.
  • Go to Start -> Settings -> Control Panel -> Advanced -> Environment Variables.
  • Edit the “Path” variable under “System variables,” and append the following string to the end, using your subversion’s bin path: ;C:\Tools\Subversion\bin
  • Add a new variable under “User variables.” Call it “SVN_EDITOR” and set it to some text editor you like (mine points to C:\Windows\notepad.exe).
  • Open notepad, and open the file “Application Data/Subversion/config” where “Application Data” is the folder under your user account (i.e. “C:\Documents and Settings\dvander\Application Data\…” or just “C:\Users” on Vista).
  • Under the “[tunnels]” section, find the commented line that looks like:

    # ssh = $SVN_SSH ssh

    Replace it with:

    ssh = C:/Program Files/PuTTy/plink.exe

    Or something similar depending on where you installed plink.exe.

  • Load pageant from the same folder plink is in (C:\Program Files\PuTTy\pageant.exe for me).
  • Load your SSH keys into pageant.

If you don’t use SSH keys, or don’t know what they are, see this article.

You can now run subversion from the command line. Examples:

svn co svn+ssh://[email protected]/svnroot/sourcemm sourcemm
cd sourcemm
svn update
svn commit
...etc...

If it doesn’t work, make sure your environment is updated. For example, start a new cmd session rather than using an old one.

Note: I never had any problems with TortoiseCVS.

The Hell that MOVAPS Hath Wrought

One of the most difficult bugs we tracked down in SourceMod was a seemingly random crash bug. It occurred quite often in CS:S DM and GunGame:SM, but only on Linux. The crash usually looked like this, although the exact callstack and final function varied:

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread -1209907520 (LWP 5436)]
0xb763ed87 in CPhysicsTrace::SweepBoxIVP () from bin/vphysics_i486.so
(gdb) bt
#0  0xb763ed87 in CPhysicsTrace::SweepBoxIVP () from bin/vphysics_i486.so
#1  0xb7214329 in CEngineTrace::ClipRayToVPhysics () from bin/engine_i686.so
#2  0xb7214aad in CEngineTrace::ClipRayToCollideable () from bin/engine_i686.so
#3  0xb72156cc in CEngineTrace::TraceRay () from bin/engine_i686.so

This crash occurred quite often in normal plugins as well. Finally, one day we were able to reproduce it by calling TraceRay() directly. However, it would only crash from a plugin. The exact same code worked fine if the callstack was C/C++. But as soon as the call emanated from the SourcePawn JIT, it crashed. Something extremely subtle was going wrong in the JIT.

After scratching our heads for a while, we decided to disassemble the function in question — CPhysicsTrace::SweepBoxIVP(). Here is the relevant crash area, with the arrow pointing toward the crashed EIP:

   0xb7667d7c :        mov    DWORD PTR [esp+8],edi
   0xb7667d80 :        lea    edi,[esp+0x260]
-> 0xb7667d87 :        movaps XMMWORD PTR [esp+48],xmm0
   0xb7667d8c :        mov    DWORD PTR [esp+0x244],edx

We generated a quick crash and checked ESP in case the stack was corrupted. It wasn’t, and the memory was both readable and writable. So what does the NASM manual say about MOVAPS?

When the source or destination operand is a memory location, it must be aligned on a 16-byte boundary. To move data in and out of memory locations that are not known to be on 16-byte boundaries, use the MOVUPS instruction.

Aha! GCC decided that it was safe to optimize MOVUPS to MOVAPS because it knows all of its functions will be aligned to 16-byte boundaries. This is a good example of where whole-program optimization doesn’t take external libraries into account. I don’t know how GCC determines when to make this optimization, but for all intents and purposes, it’s reasonable.

The SourcePawn JIT, of course, was making no effort to keep the stack 16-byte aligned for GCC. That’s mostly because the JIT is a 1:1 translation of the compiler’s opcodes, which are processor-independent. As a fix, faluco changed the JIT to align the stack before handing control to external functions.

Suddenly, an entire class of bugs disappeared from SourceMod forever. It was a nice feeling, but at least a week of effort was put into tracking it down. The moral of this story is that source-level debugging for “impossible crashes” is usually in vain.