Server Query Optimization, Part 2

Last week I described a nasty situation. My program to query all Half-Life 2 servers used 50 threads and 500MB of RAM. Today I’m going to discuss the solution I used toward making it more efficient.

First, I observed sslice‘s solution. He had a Python script which spawned two threads: one sent packets, and the other received. His Python script with far less overhead also completed in five minutes. I was doing something wrong.

However, I couldn’t just use his solution outright. He was only querying A2S_INFO which sends one reply, so there was a cheap shortcut: thread 1 pumped a single socket socket full of packets, and thread 2 polled the same socket for replies. We needed to add multiple successive queries into the fray, which means storing state information about each IP:Port being handled.

That was the first part of the solution. I took the netcode and moved it into an object, CQueryState, whose member variables were essentially the stack variables for the thread function. Then I converted the code to be state-based. “Flattening” such code was a tedious process.

Pseudo-code of per-thread state machine model:

FUNCTION PROCESS_SERVER
  SEND A
  WAIT 2 SECONDS
  RECV B
  SEND C
  WAIT 2 SECONDS
  RECV D
END

Pseudo-code of object-per-state-machine model:

FUNCTION CQueryState::ProcessState
  IF STATE == SEND_A
    SEND A
    SENT_TIME = CURRENT_TIME
    STATE = RECV_B
  ELSE IF STATE == RECV_B
    IF RECV B
      STATE = SEND_C
    ELSE IF CURRENT_TIME - SENT_TIME >= 1 SECOND
        STATE = GET_NEW_SERVER
    END IF
  ELSE IF STATE == SEND_C
    SEND C
    SENT_TIME = CURRENT_TIME
    STATE = RECV_D
  END IF
END

With the overhead of threads gone, I had to tackle how to actually process these objects. sslice‘s model was to have one thread for processing sends, and one thread for processing receives. Porting that to multiple send/receive states felt complicated, and the sending thread spent most of its time sleeping (to avoid overflowing the UDP queue).

The solution I chose was to simulate threading. As each state machine was non-blocking, it was feasible to process a huge number of them at a time, sleep for a short period of time, then process again. The new algorithm became:

  1. Loop through every CQueryState object:
    1. Process the state.
    2. If the state is “done,” push the object into a processing queue, and in its place, put the next server we need to query.
  2. Sleep for 1ms or so. This adds some needed delay time into the system’s packet processing.
  3. Go back to step 1.
  4. Meanwhile, a separate thread processes completed state objects.

I defaulted to 150 state objects. With 1ms of delay in between frames (a frame being a processing of all state objects), querying 35,000 servers resulted in the following results:

  • Memory usage was reduced from 500MB to about 20MB at most.
  • Completion time was reduced from around 330 seconds to 90 seconds (including statistics computation+uploading). The Half-Life 1 version was reduced from around to 150 seconds.
  • Disk usage reduced from ~70MB to 0 bytes.
  • Thread count was reduced from 50 to 1 (State machines get pushed and popped from a separate thread to handle BZ2 decompression of packets).
  • False negatives (detecting alive servers as dead) were reduced from about 4,000 to around 250.
  • The Perl text processor was completely eliminated.

The solution of converting each thread to a “mini-thread,” and simulating each mini-thread every few microseconds, was an astounding success. I don’t think I’ve ever experienced such improvements in a programming project before; nor will I likely again, since in retrospect, the original design was extremely flawed. I’ve chalked this incident up to “learning by experience.”

Other notes: In the current program, I used one socket per state object. It’s probably feasible to rework this to use one socket, but it’d be a lot more work and memory to map the incoming packets back to viable state objects. The other interesting suggestion on IRC was by OneEyed, who suggested creating N sockets, then select()ing on them to see which ones have changed (and thus will have a viable state transition). It sounds feasible, but I have not tried it yet (nor can I see any huge benefits).