Build A Compiler

2022-08-05 3578 words 17 minutes

Contents

Introduction

This post will cover how to build a native compiler for a very simple language. It’ll start small by building a native compiler for a small calculator language. Later installments will add on to this base with more advanced features like arrays, structs, functions, strings, and more. Some code and explanation will be omitted for brevity, but the full working code base can be found here.

This post will use OCaml as the implementation language for the compiler. It should be fairly straight forward to use another language of choice.

Calculator Language

The initial language will be a calculator language. There are two main parts to the language. Operators, and Numbers. An Operator is one of:

+ : addition of numbers 
- : subtraction of numbers
* : multiplication of numbers
/ : division of numbers

and a Number is a non-negative integer.

The language supports nesting of operators and is written using prefix notation. All operators will be wrapped in parentheses to prevent any ambiguity. For example the calculation of: 2 + (2 * 2) would be written as (+ 2 (* 2 2)), or in tree form:

This means evaluation would proceed from the leaves to the root:

Step 1:

   *
  / \   =>  4
 2   2

Step 2:

   +
  / \   => 6
 2   4

Calculator Compiler

In this section we’ll be setting up the skeleton for the rest of the compiler to build on. In later sections we’ll build on this initial compiler so that it can compile a more complicated language.

Lexer

The first step of our compiler will be to turn the string of user input into a format more suitable for evaluation. This means we need to take a string like (+ 2 (* 2 2)) and parse it into something called an Abstract Syntax Tree (AST).

In OCaml our AST will have the following type:

type op = Add
        | Sub
        | Mul
        | Div

type exp = Op of op * exp list
         | Num of int

type ast = exp list

We’ll consider it an error to have parentheses with only one element or just a number, for example: (4), (+), or () would throw an error. Valid input would be something like (+ 1 (* 2 3)) which would become the following AST:

[
  Op (Add, [
    Num 1;
    Op (Mul, [
      Num 2;
      Num 3;
    ])
  ]
]

The first step in typical compiler is to parse the string into tokens, then turn the tokens into our AST. We’ll use the following type for tokens:

type token = LParen
           | RParen
           | Int of int
           | String of string

We’ll be using basic library functions instead of a parser generator or a parser combinator library like Angstrom. It should also be noted that we’re parsing s-expressions, which is already implemented by existing libraries like parsexp.

The basic approach will be purely functional, turning our input string into a list of characters and then turning that list of characters into a list of tokens. Let’s first take a look at the top-level parsing function:

(** Turn a string into a token list *)
let tokenize (input : string) : token list =
    let chars = String.to_list input in

    (* Top-level loop to parse entire string *)
    let rec loop acc = function
        | [] -> acc

        (* skip whitespace *)
        | ' ' :: rest
        | '\t' :: rest
        | '\n' :: rest
        | '\r' :: rest -> loop acc rest

        (* parentheses *)
        | '(' :: rest -> loop (LParen :: acc) rest
        | ')' :: rest -> loop (RParen :: acc) rest

        (* integers *)
        | ('0'..'9' as hd) :: rest -> 
            let num, rest = parse_int (hd :: rest) in
            loop (num :: acc) rest

        (* string *)
        | hd :: rest -> 
            let str, rest = parse_string (hd :: rest) in
            loop (str :: acc) rest
    in

    (* Run the loop *)
    chars
    |> loop []
    |> List.rev

It’s pretty straight forward. We look at each character and make a decision on what to do. If it’s white space we skip it. If it’s a digit, we call a helper function to parse it into an integer. If it’s any other character we assume it’s a string and call another helper function.

The parse_int function looks like:

(* Parse a single integer *)
let parse_int lst = 
    let rec loop acc = function
        | ('0'..'9' as digit) :: rest -> 
            loop (digit :: acc) rest

        | [] as rest
        | rest -> 
            let num = String.of_char_list (List.rev acc) in
            Int (Int.of_string num), rest
    in
    loop [] lst

It follows the same pattern as the top-level loop. It builds an accumulator list of digits and when it gets to a non-digit or the empty string, the accumulator gets transformed back into a string and then Int.of_string turns it into an integer.

The parse_string helper function is very similar:

let string_of_chars chars =
    let str = String.of_char_list (List.rev chars) in
    String str

let parse_string lst =
    let rec loop acc = function
        | [] as rest ->
            string_of_chars acc, rest

        | ('0'..'9' | '(' | ')' | ' ' 
        | '\t' | '\n' | '\r' as hd) :: rest ->
            string_of_chars acc, hd :: rest

        | hd :: rest -> 
            loop (hd :: acc) rest
    in
    loop [] lst

As long as the next character is not a digit, paren, or whitespace it accumulates a list of characters and then turns that accumulator into a string.

Parser

Now we need to turn this list of tokens into an AST. The basic approach is very similar to the tokenize method from the previous section. We’ll consume an expression until we cannot consume it anymore, returning the tail of the list that is unparsed. For any malformed input we’ll throw an exception.

To start, the simplest function is parsing an operator kind:

let parse_op = function
    | "+" -> Add
    | "-" -> Sub
    | "*" -> Mul
    | "/" -> Div
    | _ -> failwith "Invalid op"

Next we will need to have a set of mutually recursive functions. One for parsing an expression, and one for parsing an operator. This is because operators can contain expressions and vice-versa.

let tokens_to_ast (lst : token list) : ast =
    let rec parse_operator (lst : token list) : exp * token list =
        match lst with
        | String s :: rest ->
            let op = parse_op s in

            let rec loop acc = function
                | [] -> failwith "Expected right paren"
                | RParen :: rest -> acc, rest
                | rest ->
                    let exp, rest = parse_exp [] rest in
                    loop (acc @ exp) rest
            in
            let exprs, rest = loop [] rest in
            Op (op, List.rev exprs), rest

        | _ -> failwith "Cannot parse input"

    and parse_exp acc : token list -> exp list * token list = function
        (* End of input, success *)
        | [] -> List.rev acc, []

        (* Start of an operator *)
        | LParen :: (String _ as op) :: rest -> 
            let op, rest = parse_operator (op :: rest) in
            parse_exp (op :: acc) rest

        (* End of operator *)
        | RParen :: rest ->
            acc, RParen :: rest

        (* Integer *)
        | Int i :: rest -> 
            parse_exp (Num i :: acc) rest

        (* Error *)
        | _ -> failwith "Error parsing input"
    in

    parse_exp [] lst |> fst

When an LParen is seen, we know this is the start of an operator, so we call out to the parse_operator method. That method in turn keeps calling parse_exp until it sees an RParen.

Interpreter

Before doing code-gen we should figure out how to calculate a result from an AST. This AST technically supports multiple expressions in a row, but we’ll only return the last result. In later sections when we add side effects and variables, these ignored expressions will be more useful.

The evaluator is very simple. We do a bottom-up traversal where the root becomes our result.

let eval (ast : ast) =
    (* Only return the last expression's result *)
    let last = List.last_exn ast in

    let rec compute : exp -> int = 
        let reduce op exps =
            let nums = List.map exps ~f:compute in
            List.reduce ~f:op nums |> Option.value_exn
        in
        function
        | Op (Add, exps) -> reduce ( + ) exps
        | Op (Sub, exps) -> reduce ( - ) exps
        | Op (Mul, exps) -> reduce ( * ) exps
        | Op (Div, exps) -> reduce ( / ) exps
        | Num num -> num
    in

    compute last

Code Generation

Now that we have an AST and can evaluate the AST to a result, we need to think about how we can translate this into machine instructions. Ideally, we’d figure out a way to translate our eval function directly into machine code that the CPU could run and produce the same result. Unfortunately this part is very machine and OS specific, so a decision will have to be made on the platform target first.

For this example we’ll use x86 64-bit and Linux. This means we’ll need to understand the Executable and Linkable Format (ELF) specification, as well as which CPU instructions from the x86-64 ISA we should use.

Evaluation Method

Before deciding what instructions to use, we need to know what operations need to happen in order to evaluate an AST. Much like our eval code above we’d like to start at the leaves and work through the tree left to right, bottom to top. Let’s do an example:

Input: (+ 1 (* 2 3 4) 5) 

AST:
    +___
   / \   \
  1   *   5
     /|\
    2 3 4

For a stack based approach we could evaluate the tree in the following order:

Code      Stack
===============
push 1    1

push 2    1 2
push 3    1 2 3
mul       1 6
push 4    1 6 4
mul       1 24

add       25

push 5    25 5
add       30

So now we more or less know the order we want to do our operations in, and what kinds of operations are needed from the x86-64 instruction set.

Instructions

There’s a number of operations that are required in this calculator language: addition, subtraction, multiplication, and division. As well as stack operations like push and pop. This means we need to go look at the instruction specification and see if the x86-64 ISA supports these operations, and how they work.

Taking a look at here, and doing some digging, the following instructions can be found: mov add sub imul idiv push pop. These should be enough to get started on the code generation.

We will also need a way to encode these instructions into raw bytes. We’ll do this by creating a function to generate each of the instructions. This will make it easier so we don’t need to deal with raw bytes directly. You can use this website to play around with the various instructions and their decodings/encodings.

Below is an example instruction function. View the code to see all of them.

MOV

type arg = RAX
         | RDX
         | RBX
         | RCX

let mov_r ~src ~dst =
   let reg =
       match src, dst with
       | RAX, RAX -> 0xc0
       | RAX, RBX -> 0xd8
       | RAX, RCX -> 0xc8
       | RAX, RDX -> 0xd0

       | RBX, RAX -> 0xc3
       | RBX, RBX -> 0xdb
       | RBX, RCX -> 0xcb
       | RBX, RDX -> 0xd3

       | RCX, RAX -> 0xc1
       | RCX, RBX -> 0xd9
       | RCX, RCX -> 0xc9
       | RCX, RDX -> 0xd1

       | RDX, RAX -> 0xc2
       | RDX, RBX -> 0xda
       | RDX, RCX -> 0xca
       | RDX, RDX -> 0xd2
   in
   [0x48; 0x89; reg]
;;

ELF

In order to have the OS run our code we need to create what’s called an ELF file. It contains information that tells the operating system how to load and run our code. There’s three main parts to an ELF file. The File Header, Program Header, and Section Header. We won’t be using the section header, so we’ll only talk about the file and program headers. More information can be found on wikipedia.

File Header

The file header contains metadata about the ELF file, like what machine type is expected, where the other headers are located, entry points, etc. Wikipedia has a great table on the ELF page linked above. Especially this image. Most of the fields are constant values, or simple offsets and sizes. You can use the readelf tool to dissect ELF files. The -l -S -h flags are particularly useful for viewing the different headers.

Here’s the output of readelf for the executable we’ll be making:

ELF Header:
  Magic:   7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00 
  Class:                             ELF64
  Data:                              2's complement, little endian
  Version:                           1 (current)
  OS/ABI:                            UNIX - System V
  ABI Version:                       0
  Type:                              EXEC (Executable file)
  Machine:                           Advanced Micro Devices X86-64
  Version:                           0x1
  Entry point address:               0x4010b8
  Start of program headers:          64 (bytes into file)
  Start of section headers:          0 (bytes into file)
  Flags:                             0x0
  Size of this header:               64 (bytes)
  Size of program headers:           56 (bytes)
  Number of program headers:         2
  Size of section headers:           0 (bytes)
  Number of section headers:         0
  Section header string table index: 0

This is the file header. And here is an annotated hex dump of that header:

Hex Dump
00000000: 7f45 4c46 0201 0100 0000 0000 0000 0000  .ELF............
00000010: 0200 3e00 0100 0000 b810 4000 0000 0000  ..>.......@.....
00000020: 4000 0000 0000 0000 0000 0000 0000 0000  @...............
00000030: 0000 0000 4000 3800 0200 0000 0000 0000  [email protected].........

Explained:
[7f 45 4c 46] ........... Magic constant, used to know it's an ELF file
[02] .................... 2 = 64-bit mode
[01] .................... 1 = Little Endian
[01] .................... 1 = ELF version 1
[00] .................... 0 = System V ABI
[00 00 00 00 00 00 00 00] Always zero, padding
[02 00] ................. Type of file is an executable
[3e 00] ................. Contains AMD64 code (x86-64)
[01 00 00 00] ........... ELF version 1
[b8 10 40 00 00 00 00 00] Code starts at address 0x4010b8
[40 00 00 00 00 00 00 00] Program header start offset
[00 00 00 00 00 00 00 00] Section header start offset
[00 00 00 00] ........... Flags, none
[40 00] ................. The size of this header is 64-bytes
[38 00] ................. The size of a program header is 56-bytes
[02 00] ................. There are two program headers
[00 00] ................. Section headers are 0-bytes in size
[00 00] ................. There are 0 section headers
[00.00] ................. There is no section header table

Program Header

There are two program headers in our ELF file. One for the code, and one for the data. The reason for two is each one has a specific set of flags that control the permissions on the memory. For example we can say that one section has executable permissions, but is read only, and the other section contains writable data. This isn’t required, we could easily throw everything in a section that has all the permissions, but it’s helpful for debugging and security. If we accidentally try overwritting our code with data, we will get a segmentation fault and the program will crash. This is preferrable to execution continuing and returning the wrong result, making it hard to trace the issue.

Below is the dump of the program headers from the readelf -l tool:

Program Headers:
  Type           Offset             VirtAddr           PhysAddr
                 FileSiz            MemSiz              Flags  Align
  LOAD           0x0000000000000000 0x0000000000401000 0x0000000000401000
                 0x0000000000000104 0x0000000000000104  R E    0x1000
  LOAD           0x0000000000000000 0x0000000000801000 0x0000000000801000
                 0x0000000000000104 0x0000000000000104   W     0x1000

They’re pretty similar, the main difference being the flags. We can see one has R E, meaning read-only, executable, and the other W meaning it maps writable memory. An annotated hex dump of the first section is shown below:

Hex Dump
00000040: 0100 0000 0500 0000 0000 0000 0000 0000  ................
00000050: 0010 4000 0000 0000 0010 4000 0000 0000  ..@.......@.....
00000060: de00 0000 0000 0000 de00 0000 0000 0000  ................
00000070: 0010 0000 0000 0000                      ........

Explained:
[01 00 00 00] ........... 1 = Loadable segment
[05 00 00 00] ........... 5 = Read-only + Executable
[00 00 00 00 00 00 00 00] File offset, 0 to load aligned in memory
[00 10 40 00 00 00 00 00] Virtual address to load at
[00 10 40 00 00 00 00 00] Physical address to load at
[04 01 00 00] ........... Size in the file (bytes)
[04 01 00 00] ........... Size in memory (bytes)
[00 00 00 00] ........... Flags
[00 10] ................. Alignment (4096)

Generate Code

To generate code we need to walk the tree and output a list of x86-64 instructions. To make this easier the tree can be normalized so that a post-order traversal over the tree will make instruction generation easy. Normalize in this case means to make it so each operation has only two children, unlike now where the AST allows any number of children. Take the example from above:

Input: (+ 1 (* 2 3 4) 5) 

AST:
    +___
   / \   \
  1   *   5
     /|\
    2 3 4

Normalized:
     +
    / \
   +   5
  / \
 1   *
    / \
   *   4
  / \
2    3

This allows us to operate on the tree left-to-right and keep nested operations in proper order. We would generate the following instruction stream for this tree:

0:  48 c7 c1 01 00 00 00    mov    rcx,0x1
7:  51                      push   rcx
8:  48 c7 c1 02 00 00 00    mov    rcx,0x2
f:  51                      push   rcx
10: 48 c7 c1 03 00 00 00    mov    rcx,0x3
17: 51                      push   rcx
18: 5b                      pop    rbx
19: 59                      pop    rcx
1a: 48 0f af cb             imul   rcx,rbx    -- 2*3
1e: 51                      push   rcx
1f: 48 c7 c1 04 00 00 00    mov    rcx,0x4
26: 51                      push   rcx
27: 5b                      pop    rbx
28: 59                      pop    rcx
29: 48 0f af cb             imul   rcx,rbx    -- 6*4
2d: 51                      push   rcx
2e: 5b                      pop    rbx
2f: 59                      pop    rcx
30: 48 01 d9                add    rcx,rbx    -- 1 + 24
33: 51                      push   rcx
34: 48 c7 c1 05 00 00 00    mov    rcx,0x5
3b: 51                      push   rcx
3c: 5b                      pop    rbx
3d: 59                      pop    rcx
3e: 48 01 d9                add    rcx,rbx    -- 25 + 5
41: 51                      push   rcx
42: 48 c7 c0 01 00 00 00    mov    rax,0x1
49: 5b                      pop    rbx        -- Save result
4a: cd 80                   int    0x80       -- Return

So what does normalizing these op codes look like?

let rec normalize exp =
    match exp with
    | Num _ as n -> n
    | Op (_, []) -> failwith "empty op"
    | Op (Sub, [_]) as op -> op
    | Op (_, [exp]) -> normalize exp
    | Op (op, exps) ->
        List.reduce_exn ~f:(fun l r ->
            let l = normalize l in
            let r = normalize r in
            Op (op, [l; r])
        ) exps

There’s a special case in here for subtraction. If we have subtraction with a single child we will keep it, so we can generate special negation code.

Once the tree has been normalized we can loop over it and generate the instructions:

let open Instructions in
let rec compute (instrs : int list list) =
    let do_op op = function
        | [left; right] ->
            let left = compute [] left in
            let right = compute [] right in
                push RCX
                :: op ~src:RBX ~dst:RCX
                :: pop RCX
                :: pop RBX
                :: right @ left
            | _ -> failwith "Un-normalized exp"
    in

    function
    (* Two's complement negation *)
    | Op (Sub, [exp]) ->
        let v = compute [] exp in
        [push RAX;
        add ~dst:RAX ~src:RBX;
        mov_const ~dst:RBX ~const:1;
        not RAX;
        pop RAX] @ v
    | Op (Add, exps) -> do_op add exps
    | Op (Sub, exps) -> do_op sub exps
    | Op (Mul, exps) -> do_op mul exps
    (* Handle division specially to make sure the
       right registers are used for operands *)
    | Op (Div, [left; right]) ->
        let left = compute [] left in
        let right = compute [] right in
        push RAX
        :: div RCX
        :: pop RAX
        :: pop RCX
        :: right @ left
    | Op (Div, _) -> failwith "Bad div op"
    | Num num ->
        push RCX
        :: mov_const ~dst:RCX ~const:num
        :: instrs

We have the special case here for subtraction where we do the bit-wise negation and add one. As well as division, because it has fixed registers that it operates on, so the code has to be careful about what is put where in order to get the right result.

From here we need to bring our full compiler together and generate an executable from a string input. We do so with the following function:

let compile (input : string) =
    input
    |> tokenize
    |> build_ast
    |> gen_code
    |> write_elf

With this, we have a working compiler. It can take an expression from user input to generating an executable!

Conclusion

This was the first step to making a more advanced compiler. This compiler generates code for a very simple arithmetic language and outputs an ELF file using x86-64 instructions to compute the result. Future installments will add onto this compiler with more advanced features like arrays, structs, functions, strings, and more. Check out the full code base here to read any of the omitted parts.