Rendering a Website

Background: My name is Martin, and I am a software engineer and founder of Freshql. Freshql is a stack of simple, efficient software meant to make it easy, fun, and, more than anything, understandable to make software again.

This article gives a technical overview of how the ultra-fast Freshql stack generates web pages. The goal of Freshql is to create pages using minimal CPU and memory resources.

What is the Freshql stack?

This a small reminder of what the Freshql stack consists of. Freshql is a single-threaded application made up of the following:

An ultra-fast storage system.
An ultra-fast webserver
An ultra-fast templating system

The storage system and webserver are outlined in their respective blog posts. In this post, I will focus on how the system comes together. I will show code snippets that I believe they ade understanding.

How it works

Let’s start by looking at a simplified template for generating a website like this one:

<!DOCTYPE html>
<html lang="en">
  <head>
    
    <meta charset="utf-8">
    <meta http-equiv="X-UA-Compatible" content="IE=edge">
    <meta name="viewport" content="width=device-width, initial-scale=1">
        
    <style>
    {% render "bundle.css" %}
    </style>

    <title>🔋 FreshQL</title>
  </head>
  <body>
    {% render "navbar.html" %}
    
    {% assign content = "content.html" %}
    {% render content %}
    
    {% render "footer.html" %}
  </body>
</html>

The template consists of HTML with tags sprinkled into, in particular, the tags {% render "key" %} refers to templates saved into the underlying storage system with the key of the entry being used as the argument, the render function will call render the template in place using the current set of bound variables.

The parser for the templates is relatively straightforward; it looks for double { ’s and {" and recursively checks these for the syntax constructs.

Rendering templates

// One type structure per element type
typedef stringview CDATA_t;

typedef struct {
  expression id;
} CTRL_RENDER_t;

typedef expression VARIABLE_t;

typedef struct {
  expression cond;
  tmpl * consequence;
  tmpl * alternative; 
} CTRL_IF_t;

typedef struct {
  stringview variable;
  stringview expr;
} CTRL_ASSIGN_t;

// And the union that creates the entire structure
typedef struct element {
  enum { CDATA, 
         CTRL_RENDER, 
         CTRL_IF, 
         CTRL_ASSIGN, 
         VARIABLE } of;
  union {
    CDATA_t       CDATA;
    CTRL_RENDER_t CTRL_RENDER;
    CTRL_IF_t     CTRL_IF;
    CTRL_ASSIGN_t CTRL_ASSIGN;
    VARIABLE_t    VARIABLE;
  } data;
} element;

The elements reference stringviews, expressions, and templates(tmpl). Stringview are simple structs container char * and length values - templates are defined as below, and expressions are done in the same logical constructs style.

typedef struct tmpl {
  stringview name;
  
  // element list with static max length
  struct {
    element items[MAX_ELEMENTS];
    int len;
  } elements;

  ...

  // the source code for reference
  const char *source;
} tmpl;

The data structures mean that processing takes one of two simple patterns. First, the linear pattern:

// rendering a template to a an output buffer
int tmpl_render(
    tmpl *t, Context *ctx, 
    char * outbuff, int outlen
  ) {
  int nwritten = 0;
  for (int i = 0; i != t->elements.len; i ++) {
    nwritten += tmpl_element_render(
        t->elements.items[i], ctx,
        outbuff + nwritten, outlen - nwritten
    );
  }
  return nwritten;
}

Secondly, there is the recursive pattern for rendering a union.

// recursive fanout to the respective rendering function.
int tmpl_element_render(
    element e, Context *ctx,
    char * outbuff, int outlen
  ) {
  switch(e.of) {
  case CDATA:
    return render_CDATA(e.data.CDATA, ctx,  outbuff, outlen);
  case CONTROL_RENDER: 
    return render_CTRL_RENDER(e.data.CTRL_RENDER, ctx,  outbuff, outlen);
  case CONTROL_ASSIGN: 
    return render_CTRL_ASSIGN(e.data.CTRL_ASSIGN, ctx,  outbuff, outlen);
  case CONTROL_IF: {
    return render_CTRL_IF(e.data.CTRL_IF, ctx, outbuff, outlen);
  case VARIABLE:
    return render_VARIABLE(e.data.VARIABLE, ctx, outbuff, outlen);
  }
}

As a result of this structure, we are now able to add constructs(C) to the templating engine by simply adding a new structure(C_t) and a new render function (render_C). This makes the code extremely extendable; the actual implementation uses X macros to make the code even more easily extendable.

Why is this fast?

There are two main reasons for this. I will go over them one at a time since they are very different in nature.

Firstly: When looking at the switch statement above, seeing how it gets executed makes sense. Using the Godbolt Compiler Explorer, the code generates assembly, as shown below. While the assembly can be a little complex to read, it shows a table of offsets into the executed code; the switch statement inspects the table, jumps to the relevant code, and executes from there. This is done in 4 instructions, independent of the number of branches in the switch(The code is meant to be illustrative; in reality, the function calls are inlined).

tmpl_element_render:
        ...
        # Lookup address of jump table
        lea     rsi, [rip + .LJTI0_0]
        # Offset based on enum
        movsxd  rcx, dword ptr [rsi + 4*rcx]
        # Complete offset from switch
        add     rcx, rsi
        jmp     rcx
.LBB0_2:
        ...
        call    render_CDATA@PLT
        ret
.LBB0_3:
        ... 
        call    render_CTRL_RENDER@PLT
        ret
.LBB0_4:
        ... 
        call    render_CTRL_ASSIGN@PLT
        ret
.LBB0_5:
        ... 
        call    render_CTRL_IF@PLT
        ret
.LBB0_6:
        ... 
        call    render_VARIABLE@PLT
        ret
.LJTI0_0:
        # Jump table
        .long   .LBB0_2-.LJTI0_0
        .long   .LBB0_3-.LJTI0_0
        .long   .LBB0_5-.LJTI0_0
        .long   .LBB0_4-.LJTI0_0
        .long   .LBB0_6-.LJTI0_0

Secondly: there are no runtime allocations once the templates are parsed and in memory. The output buffer is also kept static as part of the individual connection to the client, which makes a call like render_CDATA get reduced to a single call to the system call memcpy. The character data can only originate in a couple of places: either it came from the request itself and hence is sitting in the connected client input buffer, or it came from the data store, and is sitting there, meaning that independent of the source, all that needs to happen is that the source character data needs to be copied into the output buffer for the requesting client, since the server is single-threaded and the database is immutable throughout the render call there is no need to worry about race conditions, or data getting corrupted during the request.

Once the render function has been completed, the output buffer is written to the connected client.