Rendering a Website
Background: My name is Martin, and I am a software engineer and founder of Freshql. Freshql is a stack of simple, efficient software meant to make it easy, fun, and, more than anything, understandable to make software again.
This article gives a technical overview of how the ultra-fast Freshql stack generates web pages. The goal of Freshql is to create pages using minimal CPU and memory resources.
What is the Freshql stack?
This a small reminder of what the Freshql stack consists of. Freshql is a single-threaded application made up of the following:
- An ultra-fast storage system.
- An ultra-fast webserver
- An ultra-fast templating system
The storage system and webserver are outlined in their respective blog posts. In this post, I will focus on how the system comes together. I will show code snippets that I believe they ade understanding.
How it works
Let’s start by looking at a simplified template for generating a website like this one:
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1">
<style>
{% render "bundle.css" %}
</style>
<title>🔋 FreshQL</title>
</head>
<body>
{% render "navbar.html" %}
{% assign content = "content.html" %}
{% render content %}
{% render "footer.html" %}
</body>
</html>
The template consists of HTML with tags sprinkled into, in
particular, the tags {% render "key" %}
refers to templates
saved into the underlying storage system with the key of the entry being
used as the argument, the render function will call render the template
in place using the current set of bound variables.
The parser for the templates is relatively straightforward; it looks
for double {
’s and {"
and recursively checks
these for the syntax constructs.
Rendering templates
// One type structure per element type
typedef stringview CDATA_t;
typedef struct {
expression id;
} CTRL_RENDER_t;
typedef expression VARIABLE_t;
typedef struct {
expression cond;
tmpl * consequence;
tmpl * alternative;
} CTRL_IF_t;
typedef struct {
stringview variable;
stringview expr;
} CTRL_ASSIGN_t;
// And the union that creates the entire structure
typedef struct element {
enum { CDATA,
CTRL_RENDER,
CTRL_IF,
CTRL_ASSIGN,
VARIABLE } of;
union {
CDATA_t CDATA;
CTRL_RENDER_t CTRL_RENDER;
CTRL_IF_t CTRL_IF;
CTRL_ASSIGN_t CTRL_ASSIGN;
VARIABLE_t VARIABLE;
} data;
} element;
The elements reference stringviews, expressions, and templates(tmpl). Stringview are simple structs container char * and length values - templates are defined as below, and expressions are done in the same logical constructs style.
typedef struct tmpl {
stringview name;
// element list with static max length
struct {
element items[MAX_ELEMENTS];
int len;
} elements;
...
// the source code for reference
const char *source;
} tmpl;
The data structures mean that processing takes one of two simple patterns. First, the linear pattern:
// rendering a template to a an output buffer
int tmpl_render(
tmpl *t, Context *ctx,
char * outbuff, int outlen
) {
int nwritten = 0;
for (int i = 0; i != t->elements.len; i ++) {
nwritten += tmpl_element_render(
t->elements.items[i], ctx,
outbuff + nwritten, outlen - nwritten
);
}
return nwritten;
}
Secondly, there is the recursive pattern for rendering a union.
// recursive fanout to the respective rendering function.
int tmpl_element_render(
element e, Context *ctx,
char * outbuff, int outlen
) {
switch(e.of) {
case CDATA:
return render_CDATA(e.data.CDATA, ctx, outbuff, outlen);
case CONTROL_RENDER:
return render_CTRL_RENDER(e.data.CTRL_RENDER, ctx, outbuff, outlen);
case CONTROL_ASSIGN:
return render_CTRL_ASSIGN(e.data.CTRL_ASSIGN, ctx, outbuff, outlen);
case CONTROL_IF: {
return render_CTRL_IF(e.data.CTRL_IF, ctx, outbuff, outlen);
case VARIABLE:
return render_VARIABLE(e.data.VARIABLE, ctx, outbuff, outlen);
}
}
As a result of this structure, we are now able to add constructs(C) to the templating engine by simply adding a new structure(C_t) and a new render function (render_C). This makes the code extremely extendable; the actual implementation uses X macros to make the code even more easily extendable.
Why is this fast?
There are two main reasons for this. I will go over them one at a time since they are very different in nature.
Firstly: When looking at the switch statement above, seeing how it gets executed makes sense. Using the Godbolt Compiler Explorer, the code generates assembly, as shown below. While the assembly can be a little complex to read, it shows a table of offsets into the executed code; the switch statement inspects the table, jumps to the relevant code, and executes from there. This is done in 4 instructions, independent of the number of branches in the switch(The code is meant to be illustrative; in reality, the function calls are inlined).
tmpl_element_render:
...
# Lookup address of jump table
lea rsi, [rip + .LJTI0_0]
# Offset based on enum
movsxd rcx, dword ptr [rsi + 4*rcx]
# Complete offset from switch
add rcx, rsi
jmp rcx
.LBB0_2:
...
call render_CDATA@PLT
ret
.LBB0_3:
...
call render_CTRL_RENDER@PLT
ret
.LBB0_4:
...
call render_CTRL_ASSIGN@PLT
ret
.LBB0_5:
...
call render_CTRL_IF@PLT
ret
.LBB0_6:
...
call render_VARIABLE@PLT
ret
.LJTI0_0:
# Jump table
.long .LBB0_2-.LJTI0_0
.long .LBB0_3-.LJTI0_0
.long .LBB0_5-.LJTI0_0
.long .LBB0_4-.LJTI0_0
.long .LBB0_6-.LJTI0_0
Secondly: there are no runtime allocations once the
templates are parsed and in memory. The output buffer is also kept
static as part of the individual connection to the client, which makes a
call like render_CDATA
get reduced to a single call to the
system call memcpy
. The character data can only originate
in a couple of places: either it came from the request itself and hence
is sitting in the connected client input buffer, or it came from the
data store, and is sitting there, meaning that independent of the
source, all that needs to happen is that the source character data needs
to be copied into the output buffer for the requesting client, since the
server is single-threaded and the database is immutable throughout the
render call there is no need to worry about race conditions, or data
getting corrupted during the request.
Once the render function has been completed, the output buffer is written to the connected client.