Build A Compiler - Variables

2022-11-14 1413 words 7 minutes

Contents

Note: This is the second part of the series, post 1 can be found here

Introduction

This installment of the build-a-compiler series will cover global variables, and multi-expression programs. Currently the compiler is a glorified calculator, but adding the ability to store calculations for later use will bring us one step closer to a full programming language. The full code for this article is available here, note that it’s under the variables branch.

Storage

Currently the simple calculator language has no way to save computations for later use. This makes it very difficult to write larger computations, or computations that have common sub-expressions. Thankfully this is a solved problem, we can implement variables to save and recall previous computations.

For our implementation of variables, we’ll need to introduce two new concepts to our calculator language, definitions and variable references. A definition will create a new variable name (or assign an existing one) as well as initialize it. Variable references will recall an existing definition. See the program below for an example of two variable definitions, and two varaible references:

        
(define x 10) ; definition 1
(define y 20) ; definition 2
(+ x y)       ; reference x, reference y

Definitions

Definitions will use a new syntax:

(define name expression)

Where name can be any identifier (we’re deferring to Base.Sexp from Jane Street here for what constitutes as a valid identifier), and expression is any expression, except another define, for initializing the variable value. The following are all valid defines:

        
(define x 10)

        
(define x 10)
(define y x)

        
(define x (+ 10 (* 20 30)))

A redefinition of a name will use that new value from that point onward in the program, but leave old references with the old value.

        
(define x 10)
(define y x)  ; y = 10
(define x 20) ; x = 20 - redefined
(+ x y)       ; x + y = 30

Variable References

Variable references will use a defined name in place of an expression to recall that definition’s value. See examples below:

        
(define x 10)
(+ x x)

        
(define x (* 10 20))
(define y (+ x 5))
(+ x y)

Variables should be defined before they are used in an expression.

Multi-line programs

One other addition to the language is that it’s now multi-expression. It will support having one expression per line, in order to allow definitions followed by usages. Only the very last expression will be used as the output of the program. So the following program will have a result of 20, where the computation of 30 before is thrown away:

        
(define x 10)
(define y 20)
(+ x y) ; x + y = 30, thrown away
(+ x x) ; x + x = 20, program result

Implementation

To implement this we will need to do a number of things:

Update the syntax of the language
Update our AST
Reserve storage space for the variable values
Save the computations into storage
Recall the computations

Updating Syntax

To make life simpler, especially now that multiple expressions will be supported, switching over to a more robust parser is useful. Since the real meat of the compiler is turning text into executable bytes, this will let us iterate on the syntax faster. This means we can delete all the tokenizing code, and replace it with Core.Sexp.of_string_many which will take a string and parse it into a list of S-expressions:

        
(* S-expression type *)
type t = Atom of string
       | List of t

The structure of a define will be:

        
List [Atom "define"; Atom name; exp]

Variables need no special syntax, we’ll differentiate them by assuming it’s a variable if it’s an Atom that’s not a number. E.g.

        
Atom "10" (* a number *)
Atom "x"  (* a variable *)

Here’s the excerpt from building the AST:

        
  (* Variable definition *)
  | List [Atom "define"; Atom var; exp] :: rest ->
     let value, _ = parse_exp [] [exp] in
     let exp = Def (var, List.hd_exn value) in
     parse_top_level (exp :: acc) rest

  | List (Atom "define" :: _) :: _ ->
     failwith "Error with define"

Abstract Syntax Tree

The AST will need some updates as well:

        
        
        
    
type op = Add
        | Sub
        | Mul
        | Div

type exp = Op of op * exp list
         | Num of int
         | Def of string * exp (* New - define *)
         | Var of string       (* New - variable reference *)

type ast = exp list

Two new expression types have been added, Def which represents a definition, and Var which represents a variable reference.

Reserve Storage

There also needs to be some location in memory for each variable. To keep things simple each variable will be 8 bytes in size, representing a 64-bit integer. We’ll need to read the program to find all the unique variable definitions and reserve a slot for each one. We say unique, because new definitions with the same name can overwrite the old location without issue.

        
        
        
    
let sizeof (_ : exp) = 8

type location_info = {
    locations : (string, int) Base.Hashtbl.t;
    total_size : int;
}

let allocate_variables (ast : ast) : location_info =
    (* Read the number of top-level defines, and
       allocate a spot for each one of them *)
    let offset = ref Elf.base_data_address in
    let total_size = ref 0 in
    let locations = Hashtbl.Poly.create () in
    List.iter ~f:(function
        | Def (var, exp) ->
            Hashtbl.update locations var ~f:(function
                | Some o -> o
                | None ->
                    let sz = sizeof exp in
                    let off = !offset in
                    offset := !offset + sz;
                    total_size := !total_size + sz;
                    off
            )
        | _ -> ()
    ) ast;
    { locations; total_size = !total_size }
;;

The Elf.base_data_address represents the starting location for our data. Recall that the ELF file defines two sections, one marked as R+X at 0x401000 and one marked as W starting at 0x801000, but they actually reference the same underlying data in the ELF file. This means we can prepend the instruction area with space for the data and adjust our entry point to be after the reserved data location. See the illustration below:

ELF
+-------------------------+
| Data                    |
| 0x0000000000000000      |
+-------------------------+
| Code                    |
| mov RCX, 0xA            |
| push RCX                |
| ...                     |
+-------------------------+

Memory
+-------------------------+
| 0x401000            R+X |
+-------------------------+
|    ~~Header Info~~      |
| 0x0000000000000000      |
| 0x0000000000000000      |
| mov RCX, 0xA            | <- Entry 0x4010b8
| push RCX                |
+~~~~~~~~~~~~~~~~~~~~~~~~~+
|                         |
+~~~~~~~~~~~~~~~~~~~~~~~~~+
| 0x801000              W |
+-------------------------+
|    ~~ Header Info ~~    |
| 0x0000000000000000      | <- Data start 0x8010b0
| 0x0000000000000000      | <- Second variable location
| mov RCX, 0xA            |
| push RCX                |
+-------------------------+

The mapping looks like this because the two sections in our ELF file are specified such that the entire ELF file is mapped to the addresses 0x401000 and 0x801000 with different permissions. A normal executable would have the sections padded and mapped to different areas without this overlapping code and data. We did this for simplicity in our case, to keep the ELF file small and the mappings easy.

Save Computations

With each variable mapped to a location in memory, we need to implement the code-gen that will save a definition. This is pretty straight forward because we have a stack machine, and have already have the SAVE instruction available. Excerpt below:

        
        
        
    
| Def (name, exp) ->
    let v = compute [] exp in
    let loc = Hashtbl.Poly.find_exn locations name in
    [
        save ~src:RCX ~addr:loc;
        pop RCX;
    ]
    @ v

First the initialization expression is ran, and because it’s a stack machine, the result will be saved to the top of the stack. We then pop the top of the stack and use the SAVE instruction to store the value into the location defined by the allocate_variables map.

Recall Computations

Loading saved variables is even easier, we move the value from memory into a register and then push the register.

        
| Var name ->
    let loc = Hashtbl.Poly.find_exn locations name in
    [push RCX;
    load ~dst:RCX ~addr:loc]

Conclusion

These are the high-level changes that were needed to add variable support to the language. It’s a simple concept, needing only two new primitives, DEF, and VAR. These changes add new functionality that was impossible before, and show how the existing AST and code-gen can be modified with new features. Later installments will be more involved, requiring larger changes, and new intermediate representations. Stay tuned for the next installment.

Code available here.