Fast, Blazingly, Webserver

[Updated July 29th] We have made further improvements based on generous feedback. Thank you to everyone who shared their thoughts and suggestions. The updates include ways to push out benchmarking from the initial 200k requests/second to over 1.7 million. The details of how this is done are appended at the end of the post.

In this post, I, dear reader, will explain how I expanded our in-memory storage to support a fast HTTP 1.1 web server that can serve this page. The server is shown to deliver +1.7 million requests per second on a single core; for reference, this is ten times the throughput of nginx. It is single-threaded and uses an in-memory store. In our previous blog post, we explained how the in-memory store works. We also showed how the datastore was orders of magnitude faster than Redis.

1 CPU Core + 1 performant language = 1.7 Million requests/second.

As a reminder, we implemented the datastore using a hashmap and added a Redis compatable api. All this was done using listen, accept, recv, and send. There are no libraries, no madness, just vanilla POSIX interfaces, a while loop, and nothing more.

We will implement the HTTP server using similar principles.

Let’s get started

Let’s start by reminding ourselves how HTTP works. The client opens a TCP connection to the server, transmits the request, and waits for a response.

A request for this page would look like this:

GET /blog_fast_webserver.html HTTP/1.1
Host: freshql.ai
Accept: */*

Once the request is accepted, we have to generate a response of the following form:

HTTP/1.1 200 OK
Server: Fresh 1.0
Connection: close
Content-Length: 1104
Content-Type: text/html

<html>
<body>
    <h1>Fast Webserver</h1>
...

The empty line at the end signifies that the request is complete. That is, all we need to do is parse it. This is trivial to build since we have a parser combinator from building out parsing the Redis request.

The most interesting aspects of this format are the initial response line and calculating the correct size of the Content-Length, the length in bytes of the result being sent back.

The overarching architecture of a straightforward web server becomes the following.

Listen for new connections on a port(we use port 80)
Accept the connections when they come in
Receive data on the newly established connection
Send the result of evaluating the request.
Close the connection and go to 2.

This works but doesn’t really scale. As you can see, the server will have to fully process one request before handling the next one. On top of this, the server is building up and tearing down connections on every request, which is quite time-consuming. There is a solution for this: HTTP keep-alive.

Keep-Alive

Keep-Alive is a surprisingly simple feature where the client, on its initial request, informs the server that it would like to keep the connection open after the request and reuse it for subsequent requests. This removes the overhead of re-establishing the connection on subsequent requests, which strain the server and the client.

Keep-Alive was introduced in HTTP 1.0, and clients initiated, by adding the header Connection: Keep-Alive to the request. Which changes it to look like this:

GET /blog_fast_webserver.html HTTP/1.1
Host: freshql.ai
Connection: Keep-Alive
Accept: */*

If the server supports and accepts the keep-alive request, it responds by adding keepalive information to the request:

...
Connection: Keep-Alive
Keep-Alive: timeout=5, max=1000
...

The information tells the client how many requests can be in flight and how long the client should keep the connection open.

As you may have noticed, our web server supports HTTP 1.1, which supports Keep-alive as a standard. This means that the same logic is initiated, even if the client does not explicitly request it.

Supporting keep-alive does, however, mean that the connection now remains open, and we need to update the listening loop as follows:

Listen for new connections on a port(we use port 80)
Accept the connections when they come in
Receive data on the newly established connection
Send the result of evaluating the request.
Close the connection and go to 3.

We now have a server that can handle a massive sequence of requests, but only from one client, and as you might suspect, this is not enough for us. We want to support many clients who are making many requests.

Polling lots of connections

How do we support many open connections at the same time? Could we keep an array of file descriptors, and try reading from them on a loop – yes! Will it be efficient? No! Luckily, POSIX has a function that takes a list of file descriptors and returns a data structure describing which connections are ready and which operations they are ready for; the function is called poll. This means we can build the list of file descriptors, ask the operating system which of them has work for us to do, and then do the work.

In addition to using poll, we need to add input and output buffers per connection to allow us to move on if we either receive partial requests or are only able to write partial responses. If we found that polling updates in this way became a bottleneck, we can update the flow to use event-based systems like kqueue or epoll, but since these are operation system specific, and profiling shows reading and writing to be the main bottlenecks, we will stick with polling for now.

We update the clients to have an input buffer and an output buffer and, once again, change the client flow.

if there is pending data, we receive it on the established connection and save it to the client input buffer if it is complete(that is to say, it contains a complete request, we process the request and store the output in the output buffer
If data is in the output buffer and the operation system reports that we can write to the socket, we send the data.

We do this based on the output and keep trucking along. This is not only the core of our server; this is the server.

Let’s back up; why would this be fast?

Why do we expect this to be fast? We expect this to be fast for two reasons. Firstly, we are minimizing the use of system calls: send and receive. This means that we do not spend time going between kernel and user space. On top of this, the input buffer effectively becomes a scratch pad - a place for us to store all the data we need to evaluate the request since we know that once the input is considered a complete request, it will not change during the evaluation of processing the request. Once the request is complete, we can reset it and everything related to the request. This is a big deal; it means that any value we need from the input request can be constructed by creating an offset into the input buffer rather than having to copy substrings, leading to our memory being colocated, and even better, we know it to be in memory since it was just touched.

But enough talking, let’s do some benchmarking.

First, reader, let me point out the obvious: Any test of an HTTP server is, at best, synthetic. Synthetic in this context means problematic. Keeping this in mind, we will continue.

The tests will be done using wrk a modern HTTP benchmarking tool; the test are being run to mimic the tests used in the TechEmpower benchmarks

The command we use for testing is the following:

wrk -t2 -c100 -d60s --latency -s pipeline.lua   http://127.0.0.1:8080/

This command sends requests nonstop for 60 seconds using 100 connections, the keep-alive feature discussed above, and HTTP pipelining.

First, we test nginx as a reference. nginx is tested using the following configuration:

worker_processes  1;

events {
    worker_connections  1024;
    use kqueue;
    multi_accept on;
}

http {
    include       mime.types;
    default_type  application/octet-stream;
    
    sendfile       on;
    tcp_nopush     on;
    tcp_nodelay    on;
    
    keepalive_timeout  65;

    open_file_cache max=200000 inactive=20s;
    open_file_cache_valid 30s;
    open_file_cache_min_uses 2;
    open_file_cache_errors on;

    server {
        listen       8080;
        server_name  localhost;

        location / {
            root   html;
            index  index.html index.htm;
        }

        error_page   500 502 503 504  /50x.html;
        location = /50x.html {
            root   html;
        }
    }
    include servers/*;
}

The configuration specifies one worker process and serves up a static HTML page.

We will time nginx with the configuration set to 1 worker process and the configuration set to auto, making the entire machine available.

To create a, at least, apples-to-apples comparison, we time the FreshQL server in two modes. We start by timing FreshQl serving up the same static page as nginx, and then time it serving the dynamic frontpage of this site. We do this to stress exactly how much faster the freshql server is.

Results

Nginx

Firstly, let’s use Nginx as a benchmark. In the benchmark, nginx is serving the Nginx welcome page. [updated based on optimized configuration]

	Nginx(single)	Nginx(Multi)
Request/Second	168,318.44	285,964.01
Transfer rate MB/s	136.92	232.63

We see that nginx does about 168k requests per second on a single core and 285k when allowed to use all the machine’s cores. I am surprised that allowing Nginx to use all the cores only resulted in a doubling of the throughput.

Freshql

When testing FreshQL, we first test it with the same static page as nginx served up and then with the freshql.ai front page.

	FreshQL(simple)	FreshQL(complex)
Request/Second	1,748,610.45	200,979.72
Transfer rate MB/s	1340	3140

I want to stress once again that FreshQL is made to be single-threaded, and when profiling is mainly constrained by the OS when serving up the simple website.

In short, it is shown that even when Nginx is optimized, a large amount of performance remains unlocked. From profiling, we can also see FreshQL’s main bottlenecks in executing send/receive. As a fun note, the top three functions with respect to CPU cycles are send/recv and ‘sscan’, with scan being used to detect the requested resource in the initial HTTP request line.

In conclusion:

When presented with similar resources, this single-threaded beast delivers more than 10x the throughput of an optimized Nginx. If we look at other benchmarks, we see that most high-performance web servers struggle to squeeze 7 million requests/second out of a 28-threaded high-end CPU in simple benchmarks—FreshQl gets almost a third of that from a single CPU core. I believe most software is made with blind faith that “good software is about tradeoffs,” but no idea what the tradeoffs are remains true.

In this write-up, I have, as you may have noticed, very much glossed over the rendering of the websites. I have referenced how the server has a template system, and how this template system generates the websites. I will dive deeper into this templating system at a later time, for now, all I will say is that rendering the templates to the the output buffer is currently where about half the CPU time is being spent, and that optimizing the rendering engine should most likely prove a significant speed-up.