One of the most difficult bugs we tracked down in SourceMod was a seemingly random crash bug. It occurred quite often in CS:S DM and GunGame:SM, but only on Linux. The crash usually looked like this, although the exact callstack and final function varied:
Program received signal SIGSEGV, Segmentation fault. [Switching to Thread -1209907520 (LWP 5436)] 0xb763ed87 in CPhysicsTrace::SweepBoxIVP () from bin/vphysics_i486.so (gdb) bt #0 0xb763ed87 in CPhysicsTrace::SweepBoxIVP () from bin/vphysics_i486.so #1 0xb7214329 in CEngineTrace::ClipRayToVPhysics () from bin/engine_i686.so #2 0xb7214aad in CEngineTrace::ClipRayToCollideable () from bin/engine_i686.so #3 0xb72156cc in CEngineTrace::TraceRay () from bin/engine_i686.so
This crash occurred quite often in normal plugins as well. Finally, one day we were able to reproduce it by calling TraceRay() directly. However, it would only crash from a plugin. The exact same code worked fine if the callstack was C/C++. But as soon as the call emanated from the SourcePawn JIT, it crashed. Something extremely subtle was going wrong in the JIT.
After scratching our heads for a while, we decided to disassemble the function in question — CPhysicsTrace::SweepBoxIVP(). Here is the relevant crash area, with the arrow pointing toward the crashed EIP:
0xb7667d7c: mov DWORD PTR [esp+8],edi 0xb7667d80 : lea edi,[esp+0x260] -> 0xb7667d87 : movaps XMMWORD PTR [esp+48],xmm0 0xb7667d8c : mov DWORD PTR [esp+0x244],edx
We generated a quick crash and checked ESP in case the stack was corrupted. It wasn’t, and the memory was both readable and writable. So what does the NASM manual say about MOVAPS?
When the source or destination operand is a memory location, it must be aligned on a 16-byte boundary. To move data in and out of memory locations that are not known to be on 16-byte boundaries, use the MOVUPS instruction.
Aha! GCC decided that it was safe to optimize MOVUPS to MOVAPS because it knows all of its functions will be aligned to 16-byte boundaries. This is a good example of where whole-program optimization doesn’t take external libraries into account. I don’t know how GCC determines when to make this optimization, but for all intents and purposes, it’s reasonable.
The SourcePawn JIT, of course, was making no effort to keep the stack 16-byte aligned for GCC. That’s mostly because the JIT is a 1:1 translation of the compiler’s opcodes, which are processor-independent. As a fix, faluco changed the JIT to align the stack before handing control to external functions.
Suddenly, an entire class of bugs disappeared from SourceMod forever. It was a nice feeling, but at least a week of effort was put into tracking it down. The moral of this story is that source-level debugging for “impossible crashes” is usually in vain.