NEP Dev Update #8: Networking is hard

by Cuchaz

I thought I was just about done working on the game engine. I was looking forward to adding more content to the game.

Then I remembered that networking is hard. =(

Non-Essential Personnel is a multiplayer game. And not just LAN multiplayer. It's over-the-internet multiplayer. Over-the-internet multiplayer means we have to deal with the horrific badness you find in large networks like latency, packet loss, packet re-ordering, and lag spikes. And Non-Essential Personnel is a real-time mutiplayer game too.

If it weren't for that last bit, the real-time part, life would be pretty easy. If all you need to do is make sure your stream of packets gets transported from point A to point B at any leisurely time, then good ol' TCP will serve you well. TCP guarantees the application receiver will see your bits in exactly the same order they were sent, regardless of what massacre the internet will perform on your stream. If TCP can't meet those gaurantees, it will drop the connection and you can try again later.

The Horde engine (the custom engine I'm writing for Non-Essential Personnel), originally used TCP for network transport because it was super easy, it worked over a LAN, and I didn't need anything better. The trouble is, when you move a real-time application from a LAN to the internet, TCP stops being a good solution. The reason is how TCP meets its delivery guarantees. If a network drops a packet in a TCP stream, the protocol will keep resending it until the recipient gets it. In real-time applications, that means the entire stream will be paused until the recipient can get the missing packets. But by the time that happens, we might not even care about those packets anymore, because the state of the game might have changed in the meantime.

For real-time games, this means the first lag spike a player sees will will have long-lastinge adverse effects even after the network connectivity is completely restored. TCP will waste so much effort trying to retransmit stale data simply because it doesn't know we stopped caring about it.

Thankfully, the solution to this problem is simple.

Instead of re-sending old out-of-date information when packets get dropped, send new up-to-date information instead.

Unfortunately, TCP can't do this.

There's a simpler network protocol that can do this called UDP, but UDP makes no gauarantees that your packets will arrive at all. UDP even has a pesky hurdle called the MTU which means you have to break all your messages up into separate packets if they're too large.

If TCP is overkill for all traffic, and UDP is underkill for some traffic, it might seem like a good idea to just use both UDP and TCP at the same time. Send the packets you care about over TCP, and the packets you don't over UDP. Best of both worlds, right? The trouble with that is, the flow-control mechanism in TCP actually induces packet loss in a simultaneous UDP stream! Basically, the TCP stream hogs too much of the network connection and fewer UDP packets get through. What we really want is a transport protocol that's somewhere between TCP and UDP in terms of functionality, but I don't know of any widely-available options that actually do that. So everyone working in real-time networking usually just writes their own protocol.

Writing low-level networking procols just to make a game work on the internet, yay! Just what I always wanted to do! That will definitely let me release my game sooner!

No wonder so few studios make internet multiplayer games...

*sigh*

So, without further ado, the rest of this blog post will explain my transport protocol for the Horde Engine called the Horde Datagram Protocol, or HDP. =)

The basic idea behind HDP is that different types of traffic need different reliability guarantees, and we want to implement all of these together in a single connection.

1. Sometimes you really do want TCP-like behavior. Application control protocols (e.g., login/logout, initialization/cleanup) and sending relatively large messages (i.e., more than the MTU -- like level data) are good applications for totally reliable transport.

2. Sometimes, you really don't care if a packet never gets to the recipient because you know you'll just send 30 more of those packets in the next second. Messges sent once every game loop iteration are a good candidate for unreliable transport since you never want to hold up the connection if these get dropped. The recipient will likely never be able to catch up again after a lag spike if you need to guarantee all of these packets arrive safely. In this case, it's usually better to extrapolate from old data and wait for the new data to arrive in the next iteration.

3. Sometimes, it's ok if a packet doesn't get there, but you just want to know it got dropped so you can decide what to do about it. This is a good application for state synchronization systems where we don't expect the state to change every game loop iteration. For example, when some piece of state changes on the client, we want to tell the server about the new state, so we send an update message. If that update message gets dropped, we still want the server to know about the new state, but we don't want to just retransmit the old message since the state might have changed again. So, every time we send data to the server, we'll send a new message to make sure we send the most up-to-date information. Player inventory changes are a good candidate for this kind of semi-reliable transport.

And you want all of these different transport options to work without one of them choking out the others. You can only send so much data per second over a network connection, and network protocols generally have no idea what that maximum rate is beforehand, so they estimate it (i.e., do flow control). If these speed estimates are wrong in a multi-stream environment (*cough* TCP *cough*), the protocol can flood the network connection and cause packet loss. TCP can always resend the dropped packets of course, but the other streams will see higher packet loss too. If those are unreliable streams, like UDP, then that traffic is just lost.

To deal with this issue, HDP implements all three of these the different reliable transport options under one roof. Then all the different streams can share flow-control to keep from flooding the network connection and inducing higher-than-needed packet loss for the less reliable transport options.

I won't go into the details of how this is implemented in the Horde Engine, since there's nothing special at all about my protocol. Lots of other people have done other things like this before. It's just bits of TCP implemented on top of UDP with configurable behavior when a packet drop is detected. Apparently rolling your own transport protocol is just a necessary evil of programming real-time applications for unreliable networks. Maybe someday, we'll see adoption of a more flexible transport protocol lower in the network stack than the application level, but until then, we're stuck with application-level solutions.

Next time, I really really want to make some more actual content for the game. Hopefully there's not too much more technical debt in the engine that's waiting for me to deal with it and I can really make new things. =P