Build A Compiler
Introduction
This post will cover how to build a native compiler for a very simple language. It’ll start small by building a native compiler for a small calculator language. Later installments will add on to this base with more advanced features like arrays, structs, functions, strings, and more. Some code and explanation will be omitted for brevity, but the full working code base can be found here.
This post will use OCaml as the implementation language for the compiler. It should be fairly straight forward to use another language of choice.
Calculator Language
The initial language will be a calculator language. There are two main parts to the language. Operators, and Numbers. An Operator is one of:
+ : addition of numbers
- : subtraction of numbers
* : multiplication of numbers
/ : division of numbers
and a Number is a non-negative integer.
The language supports nesting of operators and is written using prefix notation. All operators will
be wrapped in parentheses to prevent any ambiguity. For example the calculation of: 2 + (2 * 2)
would be written as (+ 2 (* 2 2))
, or in tree form:
+
/ \
2 *
/ \
2 2
This means evaluation would proceed from the leaves to the root:
Step 1:
*
/ \ => 4
2 2
Step 2:
+
/ \ => 6
2 4
Calculator Compiler
In this section we’ll be setting up the skeleton for the rest of the compiler to build on. In later sections we’ll build on this initial compiler so that it can compile a more complicated language.
Lexer
The first step of our compiler will be to turn the string of user input into a format more suitable for evaluation.
This means we need to take a string like (+ 2 (* 2 2))
and parse it into something called an Abstract Syntax Tree (AST).
In OCaml our AST will have the following type:
type op = Add
| Sub
| Mul
| Div
type exp = Op of op * exp list
| Num of int
type ast = exp list
We’ll consider it an error to have parentheses with only one element or just a number, for example:
(4)
, (+)
, or ()
would throw an error. Valid input would be something like (+ 1 (* 2 3))
which
would become the following AST:
[
Op (Add, [
Num 1;
Op (Mul, [
Num 2;
Num 3;
])
]
]
The first step in typical compiler is to parse the string into tokens, then turn the tokens into our AST. We’ll use the following type for tokens:
type token = LParen
| RParen
| Int of int
| String of string
We’ll be using basic library functions instead of a parser generator or a parser combinator library like Angstrom. It should also be noted that we’re parsing s-expressions, which is already implemented by existing libraries like parsexp.
The basic approach will be purely functional, turning our input string into a list of characters and then turning that list of characters into a list of tokens. Let’s first take a look at the top-level parsing function:
(** Turn a string into a token list *)
let tokenize (input : string) : token list =
let chars = String.to_list input in
(* Top-level loop to parse entire string *)
let rec loop acc = function
| [] -> acc
(* skip whitespace *)
| ' ' :: rest
| '\t' :: rest
| '\n' :: rest
| '\r' :: rest -> loop acc rest
(* parentheses *)
| '(' :: rest -> loop (LParen :: acc) rest
| ')' :: rest -> loop (RParen :: acc) rest
(* integers *)
| ('0'..'9' as hd) :: rest ->
let num, rest = parse_int (hd :: rest) in
loop (num :: acc) rest
(* string *)
| hd :: rest ->
let str, rest = parse_string (hd :: rest) in
loop (str :: acc) rest
in
(* Run the loop *)
chars
|> loop []
|> List.rev
It’s pretty straight forward. We look at each character and make a decision on what to do. If it’s white space we skip it. If it’s a digit, we call a helper function to parse it into an integer. If it’s any other character we assume it’s a string and call another helper function.
The parse_int
function looks like:
(* Parse a single integer *)
let parse_int lst =
let rec loop acc = function
| ('0'..'9' as digit) :: rest ->
loop (digit :: acc) rest
| [] as rest
| rest ->
let num = String.of_char_list (List.rev acc) in
Int (Int.of_string num), rest
in
loop [] lst
It follows the same pattern as the top-level loop. It builds an accumulator list of digits and when it gets to a non-digit or the empty string, the accumulator gets transformed back into a string and then Int.of_string
turns it into an integer.
The parse_string
helper function is very similar:
let string_of_chars chars =
let str = String.of_char_list (List.rev chars) in
String str
let parse_string lst =
let rec loop acc = function
| [] as rest ->
string_of_chars acc, rest
| ('0'..'9' | '(' | ')' | ' '
| '\t' | '\n' | '\r' as hd) :: rest ->
string_of_chars acc, hd :: rest
| hd :: rest ->
loop (hd :: acc) rest
in
loop [] lst
As long as the next character is not a digit, paren, or whitespace it accumulates a list of characters and then turns that accumulator into a string.
Parser
Now we need to turn this list of tokens into an AST. The basic approach is very similar to the tokenize
method from the previous section. We’ll consume an expression until we cannot consume it anymore, returning the tail of the list that is unparsed. For any malformed input we’ll throw an exception.
To start, the simplest function is parsing an operator kind:
let parse_op = function
| "+" -> Add
| "-" -> Sub
| "*" -> Mul
| "/" -> Div
| _ -> failwith "Invalid op"
Next we will need to have a set of mutually recursive functions. One for parsing an expression, and one for parsing an operator. This is because operators can contain expressions and vice-versa.
let tokens_to_ast (lst : token list) : ast =
let rec parse_operator (lst : token list) : exp * token list =
match lst with
| String s :: rest ->
let op = parse_op s in
let rec loop acc = function
| [] -> failwith "Expected right paren"
| RParen :: rest -> acc, rest
| rest ->
let exp, rest = parse_exp [] rest in
loop (acc @ exp) rest
in
let exprs, rest = loop [] rest in
Op (op, List.rev exprs), rest
| _ -> failwith "Cannot parse input"
and parse_exp acc : token list -> exp list * token list = function
(* End of input, success *)
| [] -> List.rev acc, []
(* Start of an operator *)
| LParen :: (String _ as op) :: rest ->
let op, rest = parse_operator (op :: rest) in
parse_exp (op :: acc) rest
(* End of operator *)
| RParen :: rest ->
acc, RParen :: rest
(* Integer *)
| Int i :: rest ->
parse_exp (Num i :: acc) rest
(* Error *)
| _ -> failwith "Error parsing input"
in
parse_exp [] lst |> fst
When an LParen is seen, we know this is the start of an operator, so we call out to the parse_operator
method. That method in turn keeps calling parse_exp
until it sees an RParen.
Interpreter
Before doing code-gen we should figure out how to calculate a result from an AST. This AST technically supports multiple expressions in a row, but we’ll only return the last result. In later sections when we add side effects and variables, these ignored expressions will be more useful.
The evaluator is very simple. We do a bottom-up traversal where the root becomes our result.
let eval (ast : ast) =
(* Only return the last expression's result *)
let last = List.last_exn ast in
let rec compute : exp -> int =
let reduce op exps =
let nums = List.map exps ~f:compute in
List.reduce ~f:op nums |> Option.value_exn
in
function
| Op (Add, exps) -> reduce ( + ) exps
| Op (Sub, exps) -> reduce ( - ) exps
| Op (Mul, exps) -> reduce ( * ) exps
| Op (Div, exps) -> reduce ( / ) exps
| Num num -> num
in
compute last
Code Generation
Now that we have an AST and can evaluate the AST to a result, we need to think about how we can translate this into machine instructions. Ideally, we’d figure out a way to translate our eval
function directly into machine code that the CPU could run and produce the same result. Unfortunately this part is very machine and OS specific, so a decision will have to be made on the platform target first.
For this example we’ll use x86 64-bit and Linux. This means we’ll need to understand the Executable and Linkable Format (ELF) specification, as well as which CPU instructions from the x86-64 ISA we should use.
Evaluation Method
Before deciding what instructions to use, we need to know what operations need to happen in order to evaluate an AST. Much like our eval
code above we’d like to start at the leaves and work through the tree left to right, bottom to top. Let’s do an example:
Input: (+ 1 (* 2 3 4) 5)
AST:
+___
/ \ \
1 * 5
/|\
2 3 4
For a stack based approach we could evaluate the tree in the following order:
Code Stack
===============
push 1 1
push 2 1 2
push 3 1 2 3
mul 1 6
push 4 1 6 4
mul 1 24
add 25
push 5 25 5
add 30
So now we more or less know the order we want to do our operations in, and what kinds of operations are needed from the x86-64 instruction set.
Instructions
There’s a number of operations that are required in this calculator language: addition, subtraction, multiplication, and division. As well as stack operations like push and pop. This means we need to go look at the instruction specification and see if the x86-64 ISA supports these operations, and how they work.
Taking a look at here, and doing some digging, the following instructions can be found: mov
add
sub
imul
idiv
push
pop
. These should be enough to get started on the code generation.
We will also need a way to encode these instructions into raw bytes. We’ll do this by creating a function to generate each of the instructions. This will make it easier so we don’t need to deal with raw bytes directly. You can use this website to play around with the various instructions and their decodings/encodings.
Below is an example instruction function. View the code to see all of them.
MOV
type arg = RAX
| RDX
| RBX
| RCX
let mov_r ~src ~dst =
let reg =
match src, dst with
| RAX, RAX -> 0xc0
| RAX, RBX -> 0xd8
| RAX, RCX -> 0xc8
| RAX, RDX -> 0xd0
| RBX, RAX -> 0xc3
| RBX, RBX -> 0xdb
| RBX, RCX -> 0xcb
| RBX, RDX -> 0xd3
| RCX, RAX -> 0xc1
| RCX, RBX -> 0xd9
| RCX, RCX -> 0xc9
| RCX, RDX -> 0xd1
| RDX, RAX -> 0xc2
| RDX, RBX -> 0xda
| RDX, RCX -> 0xca
| RDX, RDX -> 0xd2
in
[0x48; 0x89; reg]
;;
ELF
In order to have the OS run our code we need to create what’s called an ELF file. It contains information that tells the operating system how to load and run our code. There’s three main parts to an ELF file. The File Header, Program Header, and Section Header. We won’t be using the section header, so we’ll only talk about the file and program headers. More information can be found on wikipedia.
File Header
The file header contains metadata about the ELF file, like what machine type is expected, where the other headers are located, entry points, etc. Wikipedia has a great table on the ELF page linked above. Especially this image. Most of the fields are constant values, or simple offsets and sizes. You can use the readelf tool to dissect ELF files. The -l -S -h
flags are particularly useful for viewing the different headers.
Here’s the output of readelf
for the executable we’ll be making:
ELF Header:
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00
Class: ELF64
Data: 2's complement, little endian
Version: 1 (current)
OS/ABI: UNIX - System V
ABI Version: 0
Type: EXEC (Executable file)
Machine: Advanced Micro Devices X86-64
Version: 0x1
Entry point address: 0x4010b8
Start of program headers: 64 (bytes into file)
Start of section headers: 0 (bytes into file)
Flags: 0x0
Size of this header: 64 (bytes)
Size of program headers: 56 (bytes)
Number of program headers: 2
Size of section headers: 0 (bytes)
Number of section headers: 0
Section header string table index: 0
This is the file header. And here is an annotated hex dump of that header:
Hex Dump
00000000: 7f45 4c46 0201 0100 0000 0000 0000 0000 .ELF............
00000010: 0200 3e00 0100 0000 b810 4000 0000 0000 ..>.......@.....
00000020: 4000 0000 0000 0000 0000 0000 0000 0000 @...............
00000030: 0000 0000 4000 3800 0200 0000 0000 0000 [email protected].........
Explained:
[7f 45 4c 46] ........... Magic constant, used to know it's an ELF file
[02] .................... 2 = 64-bit mode
[01] .................... 1 = Little Endian
[01] .................... 1 = ELF version 1
[00] .................... 0 = System V ABI
[00 00 00 00 00 00 00 00] Always zero, padding
[02 00] ................. Type of file is an executable
[3e 00] ................. Contains AMD64 code (x86-64)
[01 00 00 00] ........... ELF version 1
[b8 10 40 00 00 00 00 00] Code starts at address 0x4010b8
[40 00 00 00 00 00 00 00] Program header start offset
[00 00 00 00 00 00 00 00] Section header start offset
[00 00 00 00] ........... Flags, none
[40 00] ................. The size of this header is 64-bytes
[38 00] ................. The size of a program header is 56-bytes
[02 00] ................. There are two program headers
[00 00] ................. Section headers are 0-bytes in size
[00 00] ................. There are 0 section headers
[00.00] ................. There is no section header table
Program Header
There are two program headers in our ELF file. One for the code, and one for the data. The reason for two is each one has a specific set of flags that control the permissions on the memory. For example we can say that one section has executable permissions, but is read only, and the other section contains writable data. This isn’t required, we could easily throw everything in a section that has all the permissions, but it’s helpful for debugging and security. If we accidentally try overwritting our code with data, we will get a segmentation fault and the program will crash. This is preferrable to execution continuing and returning the wrong result, making it hard to trace the issue.
Below is the dump of the program headers from the readelf -l
tool:
Program Headers:
Type Offset VirtAddr PhysAddr
FileSiz MemSiz Flags Align
LOAD 0x0000000000000000 0x0000000000401000 0x0000000000401000
0x0000000000000104 0x0000000000000104 R E 0x1000
LOAD 0x0000000000000000 0x0000000000801000 0x0000000000801000
0x0000000000000104 0x0000000000000104 W 0x1000
They’re pretty similar, the main difference being the flags. We can see one has R E
, meaning read-only, executable, and the other W
meaning it maps writable memory. An annotated hex dump of the first section is shown below:
Hex Dump
00000040: 0100 0000 0500 0000 0000 0000 0000 0000 ................
00000050: 0010 4000 0000 0000 0010 4000 0000 0000 ..@.......@.....
00000060: de00 0000 0000 0000 de00 0000 0000 0000 ................
00000070: 0010 0000 0000 0000 ........
Explained:
[01 00 00 00] ........... 1 = Loadable segment
[05 00 00 00] ........... 5 = Read-only + Executable
[00 00 00 00 00 00 00 00] File offset, 0 to load aligned in memory
[00 10 40 00 00 00 00 00] Virtual address to load at
[00 10 40 00 00 00 00 00] Physical address to load at
[04 01 00 00] ........... Size in the file (bytes)
[04 01 00 00] ........... Size in memory (bytes)
[00 00 00 00] ........... Flags
[00 10] ................. Alignment (4096)
Generate Code
To generate code we need to walk the tree and output a list of x86-64 instructions. To make this easier the tree can be normalized so that a post-order traversal over the tree will make instruction generation easy. Normalize in this case means to make it so each operation has only two children, unlike now where the AST allows any number of children. Take the example from above:
Input: (+ 1 (* 2 3 4) 5)
AST:
+___
/ \ \
1 * 5
/|\
2 3 4
Normalized:
+
/ \
+ 5
/ \
1 *
/ \
* 4
/ \
2 3
This allows us to operate on the tree left-to-right and keep nested operations in proper order. We would generate the following instruction stream for this tree:
0: 48 c7 c1 01 00 00 00 mov rcx,0x1
7: 51 push rcx
8: 48 c7 c1 02 00 00 00 mov rcx,0x2
f: 51 push rcx
10: 48 c7 c1 03 00 00 00 mov rcx,0x3
17: 51 push rcx
18: 5b pop rbx
19: 59 pop rcx
1a: 48 0f af cb imul rcx,rbx -- 2*3
1e: 51 push rcx
1f: 48 c7 c1 04 00 00 00 mov rcx,0x4
26: 51 push rcx
27: 5b pop rbx
28: 59 pop rcx
29: 48 0f af cb imul rcx,rbx -- 6*4
2d: 51 push rcx
2e: 5b pop rbx
2f: 59 pop rcx
30: 48 01 d9 add rcx,rbx -- 1 + 24
33: 51 push rcx
34: 48 c7 c1 05 00 00 00 mov rcx,0x5
3b: 51 push rcx
3c: 5b pop rbx
3d: 59 pop rcx
3e: 48 01 d9 add rcx,rbx -- 25 + 5
41: 51 push rcx
42: 48 c7 c0 01 00 00 00 mov rax,0x1
49: 5b pop rbx -- Save result
4a: cd 80 int 0x80 -- Return
So what does normalizing these op codes look like?
let rec normalize exp =
match exp with
| Num _ as n -> n
| Op (_, []) -> failwith "empty op"
| Op (Sub, [_]) as op -> op
| Op (_, [exp]) -> normalize exp
| Op (op, exps) ->
List.reduce_exn ~f:(fun l r ->
let l = normalize l in
let r = normalize r in
Op (op, [l; r])
) exps
There’s a special case in here for subtraction. If we have subtraction with a single child we will keep it, so we can generate special negation code.
Once the tree has been normalized we can loop over it and generate the instructions:
let open Instructions in
let rec compute (instrs : int list list) =
let do_op op = function
| [left; right] ->
let left = compute [] left in
let right = compute [] right in
push RCX
:: op ~src:RBX ~dst:RCX
:: pop RCX
:: pop RBX
:: right @ left
| _ -> failwith "Un-normalized exp"
in
function
(* Two's complement negation *)
| Op (Sub, [exp]) ->
let v = compute [] exp in
[push RAX;
add ~dst:RAX ~src:RBX;
mov_const ~dst:RBX ~const:1;
not RAX;
pop RAX] @ v
| Op (Add, exps) -> do_op add exps
| Op (Sub, exps) -> do_op sub exps
| Op (Mul, exps) -> do_op mul exps
(* Handle division specially to make sure the
right registers are used for operands *)
| Op (Div, [left; right]) ->
let left = compute [] left in
let right = compute [] right in
push RAX
:: div RCX
:: pop RAX
:: pop RCX
:: right @ left
| Op (Div, _) -> failwith "Bad div op"
| Num num ->
push RCX
:: mov_const ~dst:RCX ~const:num
:: instrs
We have the special case here for subtraction where we do the bit-wise negation and add one. As well as division, because it has fixed registers that it operates on, so the code has to be careful about what is put where in order to get the right result.
From here we need to bring our full compiler together and generate an executable from a string input. We do so with the following function:
let compile (input : string) =
input
|> tokenize
|> build_ast
|> gen_code
|> write_elf
With this, we have a working compiler. It can take an expression from user input to generating an executable!
Conclusion
This was the first step to making a more advanced compiler. This compiler generates code for a very simple arithmetic language and outputs an ELF file using x86-64 instructions to compute the result. Future installments will add onto this compiler with more advanced features like arrays, structs, functions, strings, and more. Check out the full code base here to read any of the omitted parts.