Ian Bicking: a blog :: The Web Server Benchmarking We Need

{ 2010 03 16 }

The Web Server Benchmarking We Need

Another WSGI web server benchmark was published. It’s a decent benchmark, despite some criticisms. But it benchmarks what everyone benchmarks: serving up a trivial app really really quickly. This is not very useful to me. Also, performance is not to me the most important differentiation of servers.

In Silver Lining we’re using mod_wsgi. Silver Lining isn’t tied to mod_wsgi (applications can’t really tell), and we may revisit that decision (mostly because of memory concerns), but it is a deliberate choice. mod_wsgi is one of the few multiprocess WSGI servers, and it manages its children (the same way Apache manages all its children). So if a child stops responding, it gets taken out of the pool and killed (brutal efficiency! Or at least brutal terminology). Child processes are also recycled, guarding against memory leaks or other peculiarities. Sometimes these kinds of things are dismissed for covering up bugs, but (a) production is a lousy time to learn about bugs, (b) it’s like a third tier of garbage collection, and (c) the bugs you are avoiding are often bugs you can’t fix anyway (for instance, if your mysql driver leaks memory, is that the application developer’s fault?)

I wish there was competition among servers not to see who can tweak their performance for entirely unrealistic situations, but to see who can implement the most fail-safe server. We’re missing good benchmarks. Unfortunately benchmarks are a pain in the butt to write and manage.

But I hope someone writes a benchmark like that. Here’s some things I’d like to see benchmarked:

A "realistic" CPU-bound application. for i in xrange(10000000): pass is a reasonable start.
An application that generates big responses, e.g., "x"*100000.
An I/O bound application. E.g., one that reads a big file.
A simply slow application (time.sleep(1)).
Applications that wedge. while 1: pass perhaps? Or lock = threading.Lock(); lock.acquire(); lock.acquire(). Wedging in C and wedging in Python are different, so a bunch of different kinds of wedging.
Applications that segfault. ctypes is specially designed for this.
Applications that leak memory like a sieve, e.g., global_var.extend(['x']*10000).
Large uploads.
Slow uploads, like a client that takes 30 seconds to upload 1Mb.
Also slow downloads.
In each case it is interesting what happens when something bad happens to just a portion of requests. E.g., if 1% of requests wedge hard. A good container will serve the other 99% of requests properly. A bad container will have its worker pool exhausted and completely stop.
Mixing and matching these could be interesting. For instance Dave Beazley found some bad GIL results mixing I/O and CPU-bound code.
Add ideas in the comments and I’ll copy them into this list.

The hardest part of writing this is not the applications (they are simple). One annoyance is wiring up the applications, but handily Nicholas covers that well in his benchmark. You also have to make sure to clean up, as many servers will not exit cleanly from some of the tests. Another nuisance is that some of these require funny clients. These aren’t too hard to write, but you can’t just use ab. Then you have to report.

Anyway: I would love it if someone did this, and packaged it as repeatable/runnable code/scripts. I’ll help some, but I can’t lead. I’d both really like to see the results, and in my ideal world people writing servers would start using these benchmarks to make their servers more robust.

Automatically generated list of related posts:

Javascript on the server AND the client is not a big deal All the cool kids love Node.js. I’ve used it a...
A new way to deploy web applications Deployment is one of the things I like least about...
A Python Web Application Package and Format (we should make one) At PyCon there was an open space about deployment, and...
The Shrinking Python Web Framework World When I was writing the summary of differences between WebOb...
Of Microformats and the Semantic Web I was talking a little with Daniel Krech (author of...

Posted by Ian Bicking on Tuesday, March 16th, 2010, at 5:23 pm, and filed under Programming, Python, Silver Lining, Web.

Comments have a feed.

23 Comments

Graham Dumpleton says:

March 16, 2010 at 5:32 pm

A couple of other things to add that I can think of right now.
1. For slow clients, also evaluate benefits for using nginx as a front end, especially where hosting web application via Apache and where keep alive can cause Apache to suffer.
2. Performance of wsgi.file_wrapper extension. Yes I know that it is flawed and some common paradigms defeat it and cause any performance optimisation it supports not to be used, but interesting all the same.
- Jehiah says:
  
  March 16, 2010 at 10:01 pm
  
  using nginx is a good solution to a large part of the mixed fast/slow client problem, but when you have multiple requests from nginx to your app server you still need to know how it handles those scenarios, they just happen much much more seldom
  
  (i’ll go so far to say that it should be assumed that app servers are run behind nginx now)
  - Graham Dumpleton says:
    
    March 16, 2010 at 10:52 pm
    
    Why would concurrent multiple requests be less inclined to happen if nginx is used as front end. The nginx server as I understand it only uses HTTP/1.0 for proxy requests so it cant even handle pipelining of requests, nor does it use keep alives for proxy connections. Thus all requests between nginx and backend when proxying require a separate socket connection. It would still be quite easy to have concurrent requests between nginx and backend because although it may help isolate the backend from slow clients, it isn’t going to help with the fact that processing time for a request by the backend is still going to take time, during which other requests can/will arrive.
    - Sergey Schetinin says:
      
      March 17, 2010 at 12:28 am
      
      For some reason this is true. I’ve been testing an app by running ab directly against it and against the same but nginx-proxied. The app manages to serve about 50 reqs per second and is served by a server spawning a thread per connection (no threadpool). I was monitoring how many threads it uses by running htop in a different console. When running ‘ab -n 1000 -c 10′ directly against it, it consumes a number of threads all the time. The same test against nginx and the server spawns about five threads initially and after a second settles on just one thread. The resulting performance is very close. So indeed, when proxied by nginx the appserver sees much lower concurrency.
      - Ian Bicking says:
        
        March 17, 2010 at 12:38 am
        
        This seems plausible to me — Nginx can accept the connection and just hold onto it and proxy it at its discretion. Realistically it seldom makes sense to try to handle more than a small number of requests in parallel, so if Nginx is smart (which it seems to be) then it would queue those incoming requests and forward them serially.
      - Graham Dumpleton says:
        
        March 17, 2010 at 2:12 am
        
        If you are using hello world which responds quickly then will accept it may be the case, but if request handlers are taking in the order of 100ms or more as opposed to 10ms, you logically would have to start seeing overlapping of requests to get any sort of decent throughput if talking through to a single process. This highlights again the need for not relying on hello world code for testing and look at a greater range of response times for requests within an application to understand the way things interact.
      - Sergey Schetinin says:
        
        March 17, 2010 at 2:24 am
        
        It’s an application that queries a database, converts the data from a wiki markup to html, parses it with lxml, does a couple passes on it (merging with a template, automatic typography, syntax highlighting, links fixup etc) and then serializes it. It’s not a hello-world app.
Graham Dumpleton says:

March 16, 2010 at 6:02 pm

Oh, one more thing. Any analysis should also endeavour to actually show how the results are in any way meaningful. Sites aren’t run such that they are at their maximum capacity all the time. There is absolutely no point trying to chase down what may be the best performing solution if your site is never ever going to do anything but idle along most of the time anyway. Yes the results may help in showing how much theoretical headroom you have for doing vertical scaling for a specific solution, but in reality, serious sites are going to be looking at horizontal scaling as well which brings its own class of problems to be solved. All of this also does nothing in relation to the fact that the real performance bottlenecks are going to be in the application and database. Generally the only real concern that is going to come from the WSGI server itself is what additional memory overhead it imposes due to how it is implemented and configured/misconfigured.
- Ian Bicking says:
  
  March 16, 2010 at 6:40 pm
  
  I’d simply ask that it be repeatable. If there’s a setup and it’s not too hard to run, then if you don’t like it you can take it (fork it) and try to adjust it for your needs. Probably the least satisfying aspect of benchmarking is how political it gets. Asking that the results be “meaningful” is inviting too much politics.
Kurt says:

March 16, 2010 at 8:05 pm

Stalled downloads. I’ve encountered this (i.e. PDF file downloads 2 megs… waits 30 seconds, then gives me the last meg). This really screws up some browsers and I’m guessing it’s not so good for the server (something bad is happening to cause this I assume).
- Eric says:
  
  March 16, 2010 at 8:36 pm
  
  If you’re getting stalled downloads a lot with PDF, particularly dynamically generated ones, you may run up against bugs using Acrobat Reader and the HTTP header Accept-Ranges. There are a whole suite of unique issues with Acrobat Reader and streaming PDFs. We’ve even seen very old versions (with browser embed plugins) firing off many dozens of individual requests (with full tcp/SSL handshakes for each connect) for a single dynamic PDF file. (We only noticed it due to massive load spikes of our service having to suddenly generate the same heavy-weight PDF dozens of times per second.)
Nicholas Piël says:

March 17, 2010 at 2:54 am

Hi Ian,

I agree with your points, that benchmarking is hard and that performance is not the most important factor and also that spawning, upgrading and reaping of workers is probably one of the most interesting aspect of the different servers.

Concerning the benchmark, with an increasing number of servers things get exponentially complex. In my benchmark I have been looking at 14 servers, if I excluded some of them I would have surely gotten comments that it is not complete and that I really ought to check out this or that server as well.

At first I tried to present a more complex web application which more lies in the line of my current project and plans. However, this would quickly rule out all threaded web servers. The following might indicate how things can easily get difficult. I have been thinking about time.sleep(n), but would it be fair to compare an async server such as Gevent or Eventlet with a threaded web server? These async server have implemented such a sleeping function in Greentlet style while for example ModWSGI blocks a complete thread. This would effectively limit the RPS rate on a threaded server to threadpool/n or you would corrupt the async server by using a blocking call.

I can imagine that when looking at different settings, such as for example the size of the response there might be lots of buffer tuning options in uWSGI to specifically tune for that setting (not sure if thats the case but uWSGI sure has lots of things to tune). The problem you quickly run into is that you will be tuning your server around a specific problem, how far will you allow this to go and still call it a comparative benchmark?

Another thing to keep in mind is that when publishing your benchmark results it might give some people the idea that you are judging their lifework, which really isn’t the case. The benchmark if anything is an objective comparison on a very limited domain which might benefit the performance of some servers. I tried to keep the application simple in the hope to keep it as unbiassed as possible but even that can already stir up some commotion.

And yes, the hardest part is not setting up the applications but the specifics on how to tune the application server for the specific application on certain hardware with a certain proxy server. It is difficult to control all these variables and might probably be best suited by something like the computer language shootout, where some of these variables can be pinned down.
- Graham Dumpleton says:
  
  March 17, 2010 at 5:16 am
  
  Using time.sleep(n) would block the thread for any WSGI server, not just mod_wsgi. As such, would also block uWSGI process threads, CherryPy process threads etc etc. That is how WSGI is but it wouldn’t matter as the whole point with such WSGI servers is that there are other threads and/or processes to handle concurrent requests.
  
  This is why in part you can only really compare synchronous WSGI servers against other synchronous WSGI servers. Similarly you should only compare asynchronous servers against other asynchronous based servers. To try and compare them against each other just doesn’t make a great deal of sense because the programming model is different and one application isn’t necessarily appropriate for the other model or going to be directly portable between them such that a fair comparison can be drawn.
  
  This is why the original Tornado results were arguably flawed, although that didn’t stop a lot of people flocking to it as the latest and greatest thing. The Tornado folks made it even more misleading by comparing a heavy weight Django application against what was for its asynchronous server not much more than a hello world. If one compares Tornado to a hello world application running on a WSGI server like you did, there isn’t much difference and not the huge difference Tornado folks claimed.
  
  One has to realise that the synchronous and asynchronous models are suited for different tasks. To turn Tornado into a server suitable for properly supporting WSGI applications with concurrent request processing, you need to stick a thread pool on top of it and bridge between the asynchronous and synchronous worlds. This is the only real way to avoid the problem of time.sleep(n) blocking the whole server with a true asynchronous server. Once you stick such a thread pool on top you loose various of the features that make asynchronous systems useful for certain tasks, such as being able to handle high number of concurrent requests with low resource usage. If you are going to loose features like that, what is the point, you may as well stick with a normal synchronous server.
  
  We get back to what I originally said to Ian, if one is going to develop a rang of benchmarks, there is no point just saying, here is a bunch of benchmarks and some pretty graphs, if one cant also explain how that is meaningful and actually applicable to what people are doing or need for there specific application so they can make a reasoned judgement as to what is best for them. All these graphs tend to do at the moment is cause the sheep who are easily impressed to move to whatever system the results give the impression of being better than anything else for that specific test.
  - Sylvain Hellegouarch says:
    
    March 17, 2010 at 7:40 am
    
    Well testing a server should always go with a context and a rationale behind said test. A benchmark isn’t a test per se, it’s just one aspect of a server that is pushed to the limit. So to me what Nicholas has done is fine since he explained the goal right in the premise of the article:
    
    > That benchmark looked specifically at the raw socket performance of various frameworks.
    
    What Ian expects is much more of course but I’d say that if by 2010 no one has come with a generic load/performance/etc testing of a web server, maybe it’s because it’s not actually doable in any meaningful way. Testing always depends on the context it takes place.
    
    Probably the only realistic test that can be performed across various servers is to test for static serving performances. Once you hit the application server, it becomes way harder. Personally I’m not finding what Ian proposes to test to be actually helpful. This is just a fancier benchmark, not a real test.
    
    For instance, when testing an intranet application, I would be interested in seeing how an operation that I consider sensitive perform under certain conditions. For example, I would like to know how login in performs on a Monday morning by simulating 500 virtual users. How many of those can log at once? How long does it take? Do I have failures? How long does it take for the servers to cool down? Most likely, the HTTP server performances won’t impact any of those.
    - Ian Bicking says:
      
      March 17, 2010 at 3:58 pm
      
      What Ian expects is much more of course but I’d say that if by 2010 no one has come with a generic load/performance/etc testing of a web server, maybe it’s because it’s not actually doable in any meaningful way. Testing always depends on the context it takes place.
      
      Do you know what your web server does when 1% of requests go into an infinite loop? That’s not hard to test, and it’s both a reasonable and easy to test. But how many people know what the result is? This represents exactly the kind of situation you can’t test in context, because developers don’t develop expecting to have that problem (it is after all a bug). But we all know these things happen.
      
      Right now I see primarily microbenchmarks (that benchmark one small isolated aspect of a system) and people who claim no general benchmarking is valid. There is a middle ground. It does not seem that difficult to me to just start to benchmark somewhat more interesting things. Server developers aren’t going to tweak their servers for your one particular application; we need some common understanding of performance because server and application are two separate lines of development. Outside of the process model decision (async/greenlets, threaded, multiprocess), I don’t see anything interesting happening in terms of performance, nor do I expect any server to accomplish anything by using the results of this kind of microbenchmark except perhaps outliers that want to get in line with the rest of the pack.
      - Sylvain Hellegouarch says:
        
        March 17, 2010 at 4:25 pm
        
        > Do you know what your web server does when 1% of requests go into an infinite loop?
        
        Where do you get that 1% from? What value that random percentage has? If a request goes into an infinite loop most likely the request will timeout one one end or the other.
        
        > But we all know these things happen.
        
        If it’s a bug, it should be unit tested, not performance tested.
        
        > This represents exactly the kind of situation you can’t test in context (..)
        
        If it can be shown under a certain load only, then there is a context.
        
        > we need some common understanding of performance
        
        True but this should not be translated into code. Merely asking people who have a good experience in those areas what should be taken into account when discussing about server performances. Why the compulsory requirement for some code?
        
        > Server developers aren’t going to tweak their servers for your one particular application
        
        I don’t expect them to but they can’t expect their servers to be used without an application context either (apart from static serving). So they must be aware of the context in which their server will be used.
  - Chris McDonough says:
    
    March 17, 2010 at 9:41 am
    
    “All these graphs tend to do at the moment is cause the sheep who are easily impressed to move to whatever system the results give the impression of being better than anything else for that specific test.”
    
    When you see people say “wow, awesome, I’ll really need to try CoolNewWSGIServerX!” in a reddit comment against a set of benchmarks, you might remember:
    - They aren’t going to try it. They would have just tried it instead of leaving a comment saying they’re going to try it.
    - Even if they do try it, they probably don’t have a production application anyway. If they had a production application, they’d be more concerned about stability and maintainability than raw speed, because except for a very few cases, the server isn’t their bottleneck; their app is always their bottleneck.
    As a result, although it’s good to stay involved and try to correct misstatements and benchmarking misconfigurations, I wouldn’t worry about sheep or flocking or whatever. At the end of the day, people are lazy, and they’re likely to choose the system that has the best production feature-set and the best documentation. Marketing also helps, of course, and positive benchmark results can be seen as a form of marketing. But if a server doesn’t have good docs and a good production feature set, it’s not going to see a lot of use, even if it goes at plaid speed.
- Ian Bicking says:
  
  March 17, 2010 at 3:46 pm
  
  Even making the example application a little more complicated would make the results more reasonable, I think. For instance, return a large string, or the first 100 Fibonacci numbers, or something like that. If that hurts threaded servers, so be it (I’m not sure it would be that bad though). Certainly time.sleep() has a particular problem that async servers have monkeypatched that specific function call, though arguably it might not be a bad representation of how a database call could work (if you have a genuine async database driver, or a pure-Python driver based on socket).
  
  As for combinations, there certainly are many; adding Nginx to the mix doubles the combinations, for example. And uWSGI can be embedded in a bunch of environments. And there are at least several distinct ways to configure mod_wsgi, and I don’t particularly know which one is best. Because of this I think getting a testing rig that other people can use and extend is more useful than one presentation of benchmark results. Actually putting together the results would be handy every so often, it also gives people a baseline to compare against (so they can test their new server setup against some other setup that seems good, instead of testing it against all other setups).
Yaniv Aknin says:

March 28, 2010 at 5:05 am

I wholeheartedly agree with your approach, and as I wrote you in a separate email, made a feeble attempt at sketching the kind of benchmark you’re talking about (see my site for a short writeup of what I coded or the “Labour” repository in my github for the actual code). Cheers.
Guy Siverson says:

August 3, 2010 at 8:39 am

“I wish there was competition among servers not to see who can tweak their performance for entirely unrealistic situations, but to see who can implement the most fail-safe server. We’re missing good benchmarks.”

What benchmarks would you like to see that are currently being missed?
Vince says:

August 14, 2010 at 2:05 am

Well, it seems that someone is talking less about theory and more about practice: those guys at TrustLeap not only disclosed the source code of their benchmark template but they also applied it to the market leaders.

Maybe that’s because their own Web Application server, G-WAN, is smoking everybody else – whether this is in the kernel or in user-mode.

Things can be simple, after all, when one does not fear comparisons.
Dr Sophie Henshaw says:

November 25, 2010 at 10:12 pm

I wish there was more competition among servers as well. Nicholas’s benchmark is a start, and benchmarks can only progress from here. We need to keep that in mind.
Edmundo Gave says:

December 30, 2010 at 6:19 pm

Hey I just wanted to let you know, I really like the written material on your website. But I am employing Firefox on a machine running version 8.x of Ubuntu and the design doesn’t seem quite as intended. Not a serious issue, I can still essentially read the articles and research for info, but just wanted to inform you about that. The navigation bar is kind of tough to use with the config I’m running. Keep up the great work!

Ian Bicking: a blog

The Web Server Benchmarking We Need

23 Comments

Home

About

Archives

Categories

Recent Posts

Recent Comments