Fast Storage
Not too long ago, I had a thought, admittedly not a great thought, but a thought nonetheless. The thought was, dear reader, “Why are systems so slow?”. This thought was influenced mainly by listening too much to Jonathan Blow, Casey Muratori, and Alexey Kutepov on Twitch, Youtube, etc, and thinking, those guys are exaggerating. Now, dear reader, “Software can’t be that bad”, I thought. They are just getting overly excited by their beliefs and running their mouths, but, as so many times before, I, dear reader, was wrong! In this post, I will explain how I implemented a single-threaded minimal Redis-compatible in-memory storage engine with performance that outpaces Redis by orders of magnitude(20x write speeds).
The simple experiment
My initial idea was to create an in-memory storage system, not really for any other purpose in particular, just to see that I could. I needed an interface to interact with and benchmark, so having worked with Redis in the past, I implemented parts of the Redis communication protocol. I, dear reader, subscribe to the philosophy that you should only implement what you need, and guard against the cases you have not yet implemented – this allows you to get to a working solution fast, and you can then add implementation details as you start bumping into the guardrails.
The Redis wire-protocol
At its core, Redis delivers a couple of simple GET, SET, and DEL commands, which have the semantics you would expect. SET takes as its arguments a key and a value, and assigns the key to the value, GET retrieves the value and DEL deletes the value.
The wire protocol is as simple as:
“Open a socket to the server”
“Transmit the command followed by the key and value to the server.”
“Wait for the server to transmit the response to you on the socket.”
Easy peasy. (Redis supports a couple of different variations of this protocol RESP, RESP2, and RESP3, they are all variations on the same theme).
Having watched Tsoding write parsers enough times that I knew exactly how to do it, I got to work. After writing a simple parser for GET, SET, DEL, SAVE, and LOAD commands I implemented a simple hashmap, and there we had it – a mini Redis that could do GET, SET, DEL, and allowed me to start benchmarking. The results are shown below:
Benchmarking the API
The table below shows the operation per second for the GET and SET operations.
Pipeline | Redis GET | FreshQl GET | Redis SET | FreshQL SET |
---|---|---|---|---|
1 | 173k | 173k | 156k | 172k |
10 | 1,105k | 1,473k | 407k | 1,477k |
100 | 1,904k | 7,374k | 523k | 7,267k |
1000 | 2,411k | 12,300k | 558k | 12,547k |
Or, put another way, Redis delivers the following speedups.
Pipeline | Speedup (GET) | Ratio (SET) |
---|---|---|
1 | 1.0x | 1.1x |
10 | 1.33x | 3.6x |
100 | 3.8x | 13.9x |
1000 | 5.1x | 22.4x |
Imagine my surprise when I saw my implementation was almost 20x write speeds and 5x read for these simple operations. I was stunned, I tell you - Stunned!
How did I measure this? I ran the benchmarks using the redis-benchmark tool distributed with Redis. The command is fairly simple.
redis-benchmark -t set,get -P 1000 -r 100000 -n 10000000
You may be looking at the redis-benchmark command and thinking. “What does that command even measure?” and that is a reasonable thing to ask; let’s break it down.
The command argument(-t) lists the operation we will test. In this case, we are testing the GET and SET commands. We generate 10 million random GET and SET commands(-n is the number of commands we will execute). We are using 100k distinct keys (this is what the -r range specifies). Finally, we will send these commands to the server in batches of 1000 commands(-P specifies the batch size, Redis refers to this as pipelining). These values are picked to allow us to stress the system with datasets that are large enough that we are actually testing the hashtable. As for the batch size – the batches need to be large. Otherwise, we will find that the socket overhead overshadows the actual work.
So what?
Now, you might ask, dear reader, “So what? You implemented a fast parser and a hashtable for storing strings in memory, and it’s fast. So what?” – That is, dear reader, a reasonable question – “So what? We all know hashmaps can be fast”. “He cheated!” you might think. “He cheated, did SIMD tricks, and optimized memory access and cache sizes.” You might even think that I implemented this all in some crazy language to avoid having to interact with the operating systems – and you would be entirely correct. I implemented my own integer parser and serializer code. I optimized my hashmap to fit into the four megabytes of CPU cache on most modern machines, and I most certainly wrote all of this in C to avoid fighting a memory management system. I did all these things, and I’m proud of it!
It meant that I could show how much performance was being left on the table. Most of the software being written today mainly heats the oceans and keeps data centers toasty. I often hear about how tradeoffs are being made; “we are giving up speed for developer ergonomics.” “The code will be read many more times than it is written.” and other statements of the sort. To anyone arguing that tradeoffs are being made, I ask you, what are the tradeoffs? Do you know how much computing power you are leaving on the table? And for the argument that code is read more than it is written - if your code is any good, it is run many more times than it is read or written - we, dear reader, should have the user in mind. The person waiting patiently for a page to load, waiting for the 20 levels of javascript frameworks to decide that today is the day the user is allowed to check their bank balance(my bank’s website spends an unreasonable amount of time loading). All of this is to say that I need clarification as to why we keep seeing software as slow as it is when doing the right thing is very doable.
What’s next?
I, dear reader, am not done. As you might have noticed, this very page is being shown on a web server, and this web server is, it might surprise you to know, running on a handmade HTTP server. The server uses FreshQL as its internal storage and shows templated websites by fetching data from FreshQL, parsing it, and then rendering it into HTML. Once again, it does this using a single core and is currently able to render the front page of this website, with all the templates involved, at about 120k requests per second. For reference, I measured nginx locally and saw about 40k requests per second on the same machine. Soon, you will learn more about this here: A Fast Webserver!