Last week I described a nasty situation. My program to query all Half-Life 2 servers used 50 threads and 500MB of RAM. Today I’m going to discuss the solution I used toward making it more efficient.
First, I observed sslice‘s solution. He had a Python script which spawned two threads: one sent packets, and the other received. His Python script with far less overhead also completed in five minutes. I was doing something wrong.
However, I couldn’t just use his solution outright. He was only querying A2S_INFO which sends one reply, so there was a cheap shortcut: thread 1 pumped a single socket socket full of packets, and thread 2 polled the same socket for replies. We needed to add multiple successive queries into the fray, which means storing state information about each IP:Port being handled.
That was the first part of the solution. I took the netcode and moved it into an object, CQueryState, whose member variables were essentially the stack variables for the thread function. Then I converted the code to be state-based. “Flattening” such code was a tedious process.
Pseudo-code of per-thread state machine model:
FUNCTION PROCESS_SERVER
SEND A
WAIT 2 SECONDS
RECV B
SEND C
WAIT 2 SECONDS
RECV D
END
Pseudo-code of object-per-state-machine model:
FUNCTION CQueryState::ProcessState
IF STATE == SEND_A
SEND A
SENT_TIME = CURRENT_TIME
STATE = RECV_B
ELSE IF STATE == RECV_B
IF RECV B
STATE = SEND_C
ELSE IF CURRENT_TIME - SENT_TIME >= 1 SECOND
STATE = GET_NEW_SERVER
END IF
ELSE IF STATE == SEND_C
SEND C
SENT_TIME = CURRENT_TIME
STATE = RECV_D
END IF
END
With the overhead of threads gone, I had to tackle how to actually process these objects. sslice‘s model was to have one thread for processing sends, and one thread for processing receives. Porting that to multiple send/receive states felt complicated, and the sending thread spent most of its time sleeping (to avoid overflowing the UDP queue).
The solution I chose was to simulate threading. As each state machine was non-blocking, it was feasible to process a huge number of them at a time, sleep for a short period of time, then process again. The new algorithm became:
- Loop through every CQueryState object:
- Process the state.
- If the state is “done,” push the object into a processing queue, and in its place, put the next server we need to query.
- Sleep for 1ms or so. This adds some needed delay time into the system’s packet processing.
- Go back to step 1.
- Meanwhile, a separate thread processes completed state objects.
I defaulted to 150 state objects. With 1ms of delay in between frames (a frame being a processing of all state objects), querying 35,000 servers resulted in the following results:
- Memory usage was reduced from 500MB to about 20MB at most.
- Completion time was reduced from around 330 seconds to 90 seconds (including statistics computation+uploading). The Half-Life 1 version was reduced from around to 150 seconds.
- Disk usage reduced from ~70MB to 0 bytes.
- Thread count was reduced from 50 to 1 (State machines get pushed and popped from a separate thread to handle BZ2 decompression of packets).
- False negatives (detecting alive servers as dead) were reduced from about 4,000 to around 250.
- The Perl text processor was completely eliminated.
The solution of converting each thread to a “mini-thread,” and simulating each mini-thread every few microseconds, was an astounding success. I don’t think I’ve ever experienced such improvements in a programming project before; nor will I likely again, since in retrospect, the original design was extremely flawed. I’ve chalked this incident up to “learning by experience.”
Other notes: In the current program, I used one socket per state object. It’s probably feasible to rework this to use one socket, but it’d be a lot more work and memory to map the incoming packets back to viable state objects. The other interesting suggestion on IRC was by OneEyed, who suggested creating N sockets, then select()ing on them to see which ones have changed (and thus will have a viable state transition). It sounds feasible, but I have not tried it yet (nor can I see any huge benefits).