Monday, July 23, 2012

Gevent, Threads, and Benchmarks

In a previous post, I gave an introduction to gevent to show some of the benefits your application might get from using gevent greenlets instead of threads. Some people, however, took issue with my benchmark code, saying that the threaded example was contrived. In this post, I'll try to answer some of the objections.

(It actually turns out that there was a bug in the version of ab I was using to test, as well, so I re-ran the tests from the previous post, too.)

Threads versus Greenlets

Initially, I had proposed a dummy webserver that handled incoming requests by creating a thread and delegating communication to that thread. The code in question is below:

def threads(port):
    s = socket.socket()
    s.bind(('0.0.0.0', port))
    s.listen(500)
    while True:
        cli, addr = s.accept()
        t = threading.Thread(target=handle_request, args=(cli, time.sleep))
        t.daemon = True
        t.start()

When I could get the code above ot actually run the full benchmark (which it didn't often do) it ended up getting around 1300-1400 requests per second. The gevent version looked very similar:

import gevent

def greenlet(port):
    from gevent import socket
    s = socket.socket()
    s.bind(('0.0.0.0', port))
    s.listen(500)
    while True:
        cli, addr = s.accept()
        gevent.spawn(handle_request, cli, gevent.sleep)

This code was able to handle closer to 1600 requests per second. Maybe I should have called it out better, but the fact that the gevent version performed better than the threaded version does point out an important aspect of gevent:

Greenlets are significantly lighter-weight than true threads, particularly when creating them.

However, the folks objected by pointing out that you just don't do that with threads. Nobody does. It's a dumb way to design a server. I agree with all these points, though that wasn't really the point I was going for. One thing I will point out is that:

The reason you don't design threaded servers so they fork a thread each time you get a connection is that threads are expensive to fork, unlike greenlets.

Fixing the benchmark

So anyway, to "fix" the benchmark so it's a little more fair to threads, we'll use a thread pool to create all the threads up-front and then use a Queue.Queue to send work to them. Our server core now looks like this:

def threads(port, N=10):
    s = socket.socket()
    s.bind(('0.0.0.0', port))
    s.listen(500)
    q = Queue()
    for x in xrange(N):
        t = threading.Thread(target=thread_worker, args=(q,))
        t.daemon = True
        t.start()
    print 'Ready and waiting with %d threads on port %d' % (
        N, port)
    while True:
        cli, addr = s.accept()
        q.put(cli)

def thread_worker(q):
    while True:
        sock = q.get()
        handle_request(sock, time.sleep)

If I now run this with a thread pool of 200 threads, I can indeed finish the benchmark (ApacheBench as ab -r -n 2000 -c 200... with around 1300 requests per second (a little less, probably due to the synchronization overhead of the Queue). So updating the benchmark to use a thread pool did not improve the performance. The equivalent gevent code uses gevent.pool.Pool:

def greenlet(port, N=10):
    from gevent.pool import Pool
    from gevent import socket, sleep
    pool = Pool(N)
    s = socket.socket()
    s.bind(('0.0.0.0', port))
    s.listen(500)
    while True:
        cli, addr = s.accept()
        pool.spawn(handle_request, cli, sleep)

Running ab with the same parameters I now get... around 1200-1400 requests per second.

So why use gevent, again?

So yes, if I had designed the benchmark to omit the thread/greenlet creation entirely, threads and greenlets do indeed perform about the same. The big win for greenlets is when your thread pool isn't big enough to handle the concurrent connections.

It turns out that there's a clever denial-of-service attack on web servers known as slowloris that consumes threads from your thread pool quickly. Once your server's threads are all busy handling the slowloris requests, no further work can be done, and you end up with a very lightly loaded but still unresponsive server.

To illustrate this, we can try running our benchmark with the thread pool, but only running 20 threads in the pool, but modifying our request handler to take five seconds to handle a request. We'll go ahead and modify the benchmark line to allow more time for responses as well:

$ ab -n 2000 -c 200 -r -t 60 http://127.0.0.1:...

Now our threaded example ends up timing out connections as it tries to service 200 concurrent connections, each taking five seconds, with only 20 worker threads. If we go back to our naive (un-pooled) gevent example, however, we're able to achieve 47 requests per second, which is close to the theoretical maximum of 50 requests per second, with a very light server load.

The point? A slowloris attack will be able to eat up all the threads in your (finite-sized) thread pool, regardless of how big that pool is. Spawning a greenlet each time you receive a connection means you don't waste (almost) any resources waiting on IO.

Conclusion

There's a good bit more to gevent that I'd like to cover in future posts, but for now the points I'd like to leave you with are the following:

  • You shouldn't be spawning something expensive like a thread for each incoming connection. It eats up various types of server resources.
  • You shouldn't rely on thread pools to protect you from resource exhaustion, because they can fall victim to the slowloris attack.
  • Gevent greenlets are lightweight enough that you can spawn one for each connection, and you don't have to rely on a pool (which can become exhausted in a slowloris type attack).

So what do you think? Have I convinced you? I'd love to hear your reaction in the comments below!

8 comments:

  1. I'm glad to see you post a correction to your earlier thread code because frankly it made little sense.

    Just as some background, when I was doing some of my work investigating the GIL, I did a test involving a 512-thread pool and an EC2 instance that I asked people to obliterate. In that test, I was serving HTTP requests at a rate of about 1200-1500 requests a second, consistent with some of your results here.

    As for a slowloris attack, what would keep me from launching such an attack against greenlets and simply making your machine run out of ports or file descriptors? IMHO, this is a pretty weak justification for using threads over greenlets.

    I think a better case for using greenlets over threads might be applications where you need to have a very large number of persistent TCP connections (e.g., 30000 simultaneous clients). For instance, pushing updates, streaming data, etc. For something like that, you would definitely see a memory win over threads.

    ReplyDelete
    Replies
    1. Thanks for the comment!

      As for the slowloris attack, I suppose it's always possible to make the machine run out of ports or file descriptors, but that's going to take a lot more effort (and be more easily detectable with 32k open connections or so) than spinning up a few dozen simultaneous connections. So while greenlets don't make you completely exempt from slowloris, they make it a good bit more difficult.

      The many-simultaneous-connections case is a great point, though, and gives greenlets a big win over threads.

      One thing I'd be interested in, given your background with dissecting the GIL, is how a mix of compute-bound and IO-bound greenlets perform relative to threads. (Of course, you'd have to sprinkle in a bunch of gevent.sleep(0) calls in the compute-bound code). Do you have any experience comparing greenlet performance versus threads when the GIL is an issue?

      Thanks again for the feedback, David!

      Delete
  2. I haven't done any study of compute-bound greenlets, but since there is no preemption, a compute-bound greenlet would obviously block progress in all of the other ones until the computation was finished or it hit some blocking operation such as a sleep as you describe. You might see a small improvement from reduced thread context-switching although that's not something that I've measured.

    Getting back to the many simultaneous connection issue though, I've long been a skeptic about the greenlet approach over threads--I just couldn't wrap my brain around scenarios where it would make much of a difference. However, I recently listened to someone talk about some problem they were having with implementing some interactive web application. Essentially, they didn't want to write an app where clients constantly polled the server for updates. Instead, they wanted the server to simply push new data out to always connected clients when it became available. However, to do this, the server needed to somehow handle tens of thousands of simultaneous TCP connections.

    That particular scenario is definitely one that seems to tricky to handle with threads (e.g., writing a threaded server to handle that many clients). Sure, you could probably load the server up with a huge amount of memory and with some careful tuning, create 40000 threads. However, it certainly doesn't seem like the most efficient way to do it. Greenlets definitely seem like a more sane way of handling something extreme like that.

    ReplyDelete
    Replies
    1. What I would *hope* to see in the greenlet compute-vs-io test would be that the io threads would suffer less starvation than they do under the same scenario with threads (on a multicore system, and assuming that the compute greenlets were sprinkled with gevent.sleep(0), of course)

      The many simultaneous connection scenario is a great example of using greenlets, and it's actually the approach that gevent-socketio (and the underlying gevent-websockets) library uses - your web request comes in "normally" but you just send & receive data on the socket and don't return a proper "response". A big web chat server, or maybe an IRC or jabber server, would seem to be ideal candidates for greenlets.

      Delete
    2. This is exactly what multicast is for.

      Delete
    3. @JB: Thanks for the comment, but isn't multicast restricted to when you want to send the *same* information to all the clients? Also, multicast doesn't play well with web protocols as far as I know, so building a multicast app that lives in a browser is a non-starter unless you have some sort of local proxy.

      Delete
    4. you could utilize filters on the client to only respond to certain multicasts. but anyways, multicast is for a lan environment only, as far as i know

      Delete
    5. @scape: thanks for the reply. And while you certainly could filter out what you weren't interested in, multicast seems to have a relatively limited niche (as you mention, for instance, LAN-only) compared to TCP sockets.

      Delete