Build a toy compiler from scratch

Goal of this experiment

To compile a program to the Python Virtual Machine using a home brewed compiler

How does this little language look like

fn fib(n)
  if n == 0 or n == 1
    return n
  else {
    previous = 1
    current = 1
    for i in range(2, n) {
      tmp = current + previous
      previous = current
      current = tmp
    }
    return current
  }

A look from above

effigy-an-experiment-writing-a-compiler-overview.png

How does Python do it

From https://devguide.python.org/compiler

  1. Parse source code into a parse tree (Parser/pgen)
  2. Transform the parse tree into an AST (Python/ast.c)
  3. Transform AST into a Control Flow Graph (Python/compile.c)
  4. Emit bytecode based on the Control Flow Graph (Python/compile.c)

How are we going to do it

  1. Parse source code into parse tree with a PEG grammar
  2. Semantic actions transform parse tree into AST
  3. Traverse AST with a second PEG grammar and use its semantic actions to emit code

Text parsing

Take a stream of characters and produce a tree structure

Example

Input

print((fn (x) x + 2)(40))

Output

['Module',
 ['Statement',
  ['Call',
   [['Load', 'print'],
    [['Call',
      [['Lambda',
        [['Params', [['Param', 'x']]],
         ['Code',
          ['Statement',
           ['BinOp',
            ['Load', 'x'],
            '+',
            ['Value', ['Number', 2]]]]]]],
       [['Value', ['Number', 40]]]]]]]]]]

Parsing Expression Grammars

Semantics

  1. Language to describe recursive top-down parsers
  2. Borrow productions from Context Free Grammars
  3. Expression operators borrowed from regexes
  4. Infinite look-ahead via semantic predicates
  5. Lexing & Parsing can happen together
  6. Unsuitable for handling ambiguity, but can describe all deterministic context-free languages

Expressions

sequence e1 e2  
ordered choice e1 / e2  
not predicate !e  
and predicate &e (sugar for !!e)
zero or more e*  
one or more e+ (sugar for ee*)
optional e? (sugar for &ee / !e)

A tiny example

GrammarInput

File  <- Line*
Line  <- Value (',' Value)* '\n'
Value <- (![,\n] .)*

id,name
01,Moe
02,Larry
03,Curly

Output

['File',
 [['Line', [['Value', 'id'], [',', ['Value', 'name']], '\n']],
  ['Line', [['Value', '01'], [',', ['Value', 'Moe']], '\n']],
  ['Line', [['Value', '02'], [',', ['Value', 'Larry']], '\n']],
  ['Line', [['Value', '03'], [',', ['Value', 'Curly']], '\n']]]]

Semantic Actions

Traverse the parse tree and apply user-defined functions to each node after parsing has finished successfully

const parser = peg
 .pegc('./csv.peg')
 .bind({
   File:  ({ action, visit }) => [action, visit()],
   Line:  ({ action, visit }) => [action, visit()],
   Value: ({ action, visit }) => [action, visit().join('')],
 });
parser(input);

A little detour #1

What happens if semantic actions are applied before parsing is finished

A little detour #2

A trade-off of having support for semantic actions embedded within the PEG text

Language Grammar

With a PEG implementation at hand, the next step is to get the language grammar ready to take input text and generate an Abstract Syntax Tree (or AST for short). Take a look at the grammar.

Scope rules

Traverse the AST and map out scope rules

What problem are we solving

Where do we store x in the example below when plus_n is finished executing

fn plus_n(x) fn(y) x + y
plus_two = plus_n(2)
plus_five = plus_n(5)
print(plus_two(2)) # Equals 4
print(plus_five(2)) # Equals 7

Types of variables in Python

  • Global: Module scope & built-ins
  • Local: Created and destroyed within a function
  • Free: Created outside the scope it's used. Has to be kept around after scope that it was declared is gone

Symbol table

Data structure for bookkeeping information about variables. The Python compiler has one, so does the Effigy compiler.

Example

fn plus_n(x) fn(y) x + y

Symbol TableAnnotated AST

[{
  node: 'module',
  children: [{
    node: 'function',
    deref: ['x'],
    children: [{
      node: 'lambda',
      fast: ['y'],
      deref: ['x'],
    }]
  }]
}]

['Module',
 [['Statement',
   ['Function',
    [['ScopeId', 2], 'plus_n',
     ['Params', [['Param', 'x']]],
     ['Code',
      ['Statement',
       ['Lambda',
        [['ScopeId', 1],
         ['Params', [['Param', 'y']]],
         ['Code', ['Statement',
           ['BinOp', ['Load', 'x'], '+', ...

Declaration and assignment

PythonEffigy

def parse(source):
    cursor = 0
    def current():
        return source[cursor]
    def nextc():
        nonlocal cursor
        cursor += 1
    # ...

fn parse(source) {
  let cursor = 0
  fn current() source[cursor]
  fn nextc() cursor = cursor + 1
  # ...
}


Grammars for parsing lists

The traversals on the AST for mapping the scope rules and for emitting code are guided by a PEG grammar with different sets of semantic actions. Take a look at the grammar.

Code generation

Output Format

Structure of a pyc file

Code Objects

PyObject_HEAD
int co_argcount;            /* #arguments, except *args */
int co_posonlyargcount;     /* #positional only arguments */
int co_kwonlyargcount;      /* #keyword only arguments */
int co_nlocals;             /* #local variables */
int co_stacksize;           /* #entries needed for evaluation stack */
int co_firstlineno;         /* first source line number */
PyObject *co_code;          /* instruction opcodes */
PyObject *co_consts;        /* list (constants used) */
PyObject *co_names;         /* list of strings (names used) */
PyObject *co_varnames;      /* tuple of strings (local variable names) */
PyObject *co_freevars;      /* tuple of strings (free variable names) */
PyObject *co_cellvars;      /* tuple of strings (cell variable names) */

The tiniest example

SourceCode Object

fn() 1

{ constants: [{
    constants: [1],
    name: '<lambda>',
    instructions: [
      ['load-const', 0],
      ['return-value'],
    ],
  }, '<lambda>', null],
  instructions: [
    ['load-const', 0],
    ['load-const', 1],
    ['make-function', 0],
    ['pop-top'],
    ['load-const', 2],
    ['return-value']] },

How bytecode is generated

The AST is traversed by a PEG grammar and code is emitted during the semantic action execution.

Assembler API

  • enter: Enter a new scope (create Code Object)
  • leave: Leave scope & return Code Object
  • emit: Add new instruction co_code
  • attr: Setter & Getter for object tables
  • pos: Index of the current instruction
  • ref: Create a new label
  • fix: Fix parameter of already emitted instruction

Example

IfStm: ({ visit, node }) => {
  const [test, body, elsestm] = node[1];
  visit(test.value);        // Visit the test expression
  const lb0 = ref();
  emit('pop-jump-if-false', lb0);
  visit(body.value);        // Visit the body of the statement
  if (elsestm) {
    const lb1 = ref();
    emit('jump-forward', lb1);
    const savedPos = pos();
    fixjabs(lb0);
    visit(elsestm.value);   // Visit the body of `else' branche
    fixjrel(lb1, savedPos);
  } else {
    fixjabs(lb0);
  }
  return true;
},

A new pyc

The top-level Code Object returned by the traversal for emitting code and then it is marshaled and written right after to header into a pyc file.

Thank you

Sorry, your browser does not support SVG.