Using RPython and RPly to build a language interpreter, part 1

Earlier this year, I decided I would try to create my own toy language. I'd been reading about language design, compilers, interpreters, and seriously geeking out over it. Sometimes I can really get excited by an idea that just won't leave me alone, and I'll devour everything I can find that's related. When my interest hadn't dissipated for a few months I decided the best way to get it out would be to bite the bullet and actually do something — write my own language. I wanted to create a grammar for a language and an interpreter that could take in some source code in that language and execute it. In the end, I called it Braid — feel free to check out the source on GitHub.

I went into this project with a keen interest, some vague ideas from my research, but no academic background in language design and pretty much NO IDEA WHAT I WAS DOING!

In this first part I'll talk about the tools I've picked and the lexing and parsing steps. In the second follow-up post I'll talk about bytecode, interpreters, and RPython pitfalls.

RPython and RPly

RPython is a statically typed, compiled subset of Python. It powers Pypy, the "Python interpreter written in Python", and includes a GC and a JIT you can use yourself.

RPython's subset is fiddly to use, and the authors recommend you don't use it to write your everyday Python projects. However, as a means to write a compiler and interpreter it looks attractive on paper — being almost as fully-featured as Python means you get a heap of high-level concepts for free, and in the end, it all compiles via C to a single binary executable! Because of my Python background I decided this was a good choice of language in which to implement my own.

To handle lexing and parsing I picked RPly. (RPython has its own EBNF-based lexer and parser, but the docs are sparse and I got stuck very quickly, so I moved on.) RPly is an RPython-compatible fork of PLY, and PLY in turn is an implementation of lexx and yacc in Python (Python Lexx Yacc = PLY). Whew! Lexx and yacc are a venerable, well-respected lexing and parsing combo, so we're cloning on the codebases of giants and so on.

Lexing

Lexing is the process of turning a string (of source code) into a set of tokens which can be parsed by the parser, based on rules we define about what constitutes a valid token, and what we'll ignore. Here's a sample of my language's final set of rules, defined by regular expressions:

from rply import LexerGenerator
lg = LexerGenerator()
 
# build up a set of token names and regexes they match 
lg.add('FLOAT', '-?\d+\.\d+')
lg.add('INTEGER', '-?\d+')
lg.add('STRING', '(""".*?""")|(".*?")|(\'.*?\')')
lg.add('BOOLEAN', "true(?!\w)|false(?!\w)")
lg.add('IF', 'if(?!\w)')
lg.add('ELSE', 'else(?!\w)')
lg.add('END', 'end(?!\w)')
lg.add('AND', "and(?!\w)")
lg.add('OR', "or(?!\w)")
lg.add('NOT', "not(?!\w)")
lg.add('LET', 'let(?!\w)')
lg.add('FUNCTION', 'func(?!\w)')
lg.add('MODULE', 'mod(?!\w)')
lg.add('IMPORT', 'import(?!\w)')
lg.add('IDENTIFIER', "[a-zA-Z_][a-zA-Z0-9_]*")
lg.add('==', '==')
lg.add('!=', '!=')
lg.add('>=', '>=')
lg.add('<=', '<=')
lg.add('>', '>')
lg.add('<', '<')
lg.add('=', '=')
lg.add('[', '\[')
lg.add(']', '\]')
lg.add('{', '\{')
lg.add('}', '\}')
lg.add('|', '\|')
lg.add(',', ',')
lg.add('DOT', '\.')
lg.add('COLON', ':')
lg.add('PLUS', '\+')
lg.add('MINUS', '-')
lg.add('MUL', '\*')
lg.add('DIV', '/')
lg.add('MOD', '%')
lg.add('(', '\(')
lg.add(')', '\)')
lg.add('NEWLINE', '\n')
 
# ignore whitespace 
lg.ignore('[ \t\r\f\v]+')
 
lexer = lg.build()

This step is fairly fun and straightforward. Define pretty basic regular expressions, and receive a list of tokens! It seems like we're halfway there, right?! Not even close. But it's handy that this is where we start, because it's a gentle introduction and we'll need to hold onto that feeling of accomplishment to make it through the next step.

It seems like standard practice to ignore whitespace, because this means you don't need to think about where whitespace is valid when it comes time to write your parser, and the parsing step becomes much more simple. So I ignored whitespace too. However, I've kept newlines as significant, because that's what I'm into.

An issue I ran into at the lexing stage was with identifiers (i.e. variable names) that started with other keywords. Because lexing works by moving through the string a letter at a time until it finds a match, and because whitespace is ignored, any identifier starting with a keyword will be matched as that keyword, with the lexer then resetting and trying to match the next token. To combat this, you can see I've added (?!\w) at the end of all my keywords, so they don't match if other characters appear afterwards. This feels like a terrible solution to me, but it's all I came up with.

You can't see it in that code snippet, but I'm also just stripping out comments manually at this stage, because they caused a frightful mess when I couldn't get the lexer to ignore them properly.

At the end of this stage, we can pass our lexer object some source code and get back a stream of RPly's Token objects. Exciting!

>>> lexer.lex("let a = 5")
<rply.lexer.LexerStream object at 0x7f26ac47c990>
 
>>> list(lexer.lex("let a = 5"))
[Token('LET', 'let'), Token('IDENTIFIER', 'a'), Token('=', '='), Token('INTEGER', '5')]

Parsing

Parsing is the process of turning a list of tokens into an Abstract Syntax Tree (AST). We combine our linear lists of single tokens into nested objects representing sets of tokens as expressions and statements. This is done by defining rules (AKA our previously-mentioned grammar) on how things go together. Or as I tend to think of it, "what goes into what".

First we tell our parser about all the valid tokens we'll accept from the lexing stage. Then we list them in order of precedence so that in an ambiguous situation, the parser knows in which order to handle rules.

pg = ParserGenerator(
    # A list of all token names, accepted by the parser. 
    ['STRING', 'INTEGER', 'FLOAT', 'IDENTIFIER', 'BOOLEAN', 
     'PLUS', 'MINUS', 'MUL', 'DIV', 
     'IF', 'ELSE', 'COLON', 'END', 'AND', 'OR', 'NOT', 'LET','WHILE', 
     '(', ')', '=', '==', '!=', '>=', '<=', '<', '>', '[', ']', ',', 
     '{','}', 
     '$end', 'NEWLINE', 'FUNCTION', 
 
    ],
    # A list of precedence rules with ascending precedence, to 
    # disambiguate ambiguous production rules. 
    precedence=[ 
        ('left', ['FUNCTION',]), 
        ('left', ['LET',]), 
        ('left', ['=']), 
        ('left', ['[',']',',']), 
        ('left', ['IF', 'COLON', 'ELSE', 'END', 'NEWLINE','WHILE',]), 
        ('left', ['AND', 'OR',]), 
        ('left', ['NOT',]), 
        ('left', ['==', '!=', '>=','>', '<', '<=',]), 
        ('left', ['PLUS', 'MINUS',]), 
        ('left', ['MUL', 'DIV',]),        
    ] 
)

That part is pretty boring.

But next we get to define the rules of our grammar. This is both massively rewarding and an exercise in frustration.

@pg.production('statement : expression')
def statement_expr(state, p):
    return p[0]

Here I'm using RPly's nifty decorator syntax to tell the parser that anything we call an 'expression' is also a valid 'statement'. This rule does nothing but return the expression.

@pg.production('statement : LET IDENTIFIER = expression')
def statement_assignment(state, p):
    return Assignment(Variable(p[1].getstr()),p[3])

Aha, this is more interesting. This rule declares that one valid type of statement is this set of "terminals" and "nonterminals". A terminal is in caps (or just a symbol, like the equals sign), so you know that it doesn't need to be expanded further — whatever token the parser put here is our final result. You can see that expression is lower-case, because its result can be anything matching any other rule that defines an expression.

So we return an Assignment object which contains a Variable (whose first argument is the literal identifier we've been passed), and the tree representing the expression. It could be a Call object, i.e. the result of a function call, which would also contain things like a list of arguments, themselves maybe other expressions too. You can see why it's called a Syntax Tree.

@pg.production('expression : IF expression COLON statement END')
def expression_if_single_line(state, p):
    return If(condition=p[1],body=p[3])

Here's a single-line if expression (not a statement, because it can return a value). In my toy language, if we wrote something like if true: 1 end, it would be parsed by this rule. And because I've designated that this is an expression that returns a value, I could hook it up with my earlier assignment rule like so:

let a = if true: 1 end

(There are other rules handling multi-line variants and else-blocks, don't worry.)

The grammar can get complicated as we define overlapping rules that the parser cannot apply cleanly. I received plenty of ParserGeneratorWarning: 1 shift/reduce conflict warnings along the way. These can have varying effects, from throwing exceptions to infinite loops. Keeping the entire web of connected rules in my head was one of the hardest parts of the project so far. The solution here, of course, is to make sure our rules do not conflict, but without more enlightening error messages we'll need to test everything to find the source of the issue.

Remember that our grammar's rules return nodes which are objects, like If and Variable. Each is an instance of a class that extends RPly's BaseBox, as RPython requires a common superclass. Node definitions can be quite simple:

class Variable(BaseBox):
    def __init__(self, name):
        self.name = str(name)
        self.value = None
 
    def getname(self):
        return str(self.name)
 
    def eval(self, env):
        if env.variables.get(self.name, None) is not None:
            self.value = env.variables[self.name].eval(env)
            return self.value
        raise LogicError("Not yet defined")
 
    def to_string(self):
        return str(self.name)

A Variable takes its identifier as an argument, as we saw in the earlier example. It can also hold a value, which isn't determined until we evaluate it. Each of our nodes also has a to_string method which we use when printing the AST for debugging.

Interpreting the AST

At this point we can actually "interpret" our code and get an answer. We do this by calling eval on each of our nodes, traversing the tree as required until we end up back at the root node (which is called Program) with a single result.

Looking at our Variable above, we can see that finding its value requires looking it up in our env.variables dict, which is where we're keeping the global state. When we get it, we evaluate it too for good measure. Our eval calls traverse down the tree as needed, evaluating if-expressions and variables and function calls and combining these until we end up with a literal, for example a Boolean.

class Boolean(BaseBox):
    def __init__(self, value):
        self.value = bool(value)
 
    def eval(self, env):
        return self
 
    def to_string(self):
        return str(self.value).lower()

The Boolean class is essentially a wrapper for a Python bool.

So at this stage you can combine your lexing, parsing, and "interpreting" steps like so:

>>> parser.parse(lexer.lex("if true: 1 end"), Environment())
<ast.Integer object at 0x7fe4fdd85790>
>>> parser.parse(lexer.lex("if true: 1 end"), Environment()).to_string()
'1'

Why the scare quotes around "interpreting"? Well, this isn't how we're meant to do it, and in fact I got stuck trying to travel too far down this road. Why? The short answer is that you can't really handle state in this simple way with one big dict. Real languages have more than one big global bag of state, otherwise anything you defined within a function would leak out into the global state too, for example. However, when making a start this is a great way to make easy wins. You can try out syntax quickly, and many different language constructs, like branches and loops — so long as you can deal with a single global state.

Back to RPython

The point of writing in RPython, not just regular Python, was that I could compile my interpreter to a binary.

This part was hard.

The RPython authors recommend your development process looks like this:

Write valid Python
Modify it until it's valid RPython too.

I stopped and attempted to compile my code with the RPython interpreter every few commits — any more frequently would have been just too frustrating. RPython's errors are fairly verbose, and always told me where the issue was, but the source of frustration was the fact that it couldn't explain why.

[translation:ERROR] UnionError:
[translation:ERROR]
[translation:ERROR] Offending annotations:
[translation:ERROR]   SomeString()
[translation:ERROR]   SomeInstance(can_be_None=False, classdef=rply.token.BaseBox)
[translation:ERROR]
[translation:ERROR]
[translation:ERROR]     v1 = getattr(v0, ('value'))
[translation:ERROR]
[translation:ERROR] In <FunctionGraph of (ast:569)Not.eval at 0x7f9198e8ff10>:
[translation:ERROR] Happened at file ast.py line 570
[translation:ERROR]
[translation:ERROR] ==>         return Boolean(not self.value.eval(env).value)
[translation:ERROR]
[translation:ERROR] Known variable annotations:
[translation:ERROR]  v0 = SomeInstance(can_be_None=True, classdef=rply.token.BaseBox)
[translation:ERROR]
[translation:ERROR] Processing block:
[translation:ERROR]  block@6 is a <class 'rpython.flowspace.flowcontext.SpamBlock'>
[translation:ERROR]  in (ast:569)Not.eval
[translation:ERROR]  containing the following operations:
[translation:ERROR]        v2 = getattr(self_0, ('value'))
[translation:ERROR]        v3 = getattr(v2, ('eval'))
[translation:ERROR]        v0 = simple_call(v3, env_0)
[translation:ERROR]        v1 = getattr(v0, ('value'))
[translation:ERROR]        v4 = bool(v1)
[translation:ERROR]  --end--

So here I'm trying to evaluate a Not AST node. I do this by returning a Boolean type where its value is not the result of evaluating its inner node (self.value). But RPython is having none of it, because it's decided it's possible that the type of self.value could be a String as well as being a BaseBox instance (remember this is our AST node base class). How could this be? Well, figuring out the answer to that is the tricky part.

In most cases, I solved these errors by either:

Changing inner property names to things like boolvalue and stringvalue so that RPython didn't get confused about our instances all having a value property with a different type, and
Liberal use of isinstance.

RPython will use if isinstance(value, Boolean) and assert isinstance(value,Boolean) checks as part of its own type annotations, and this will resolve the majority of typing errors. Still, I wasted many hours in this step.

Compiling

We can compile our target.py with RPython like so:

PYTHONPATH=python rpython --output=braid target.py

If RPython is happy, we'll get a lot of compiler output, and finally, a binary called braid... We did it!

>>> let a = 5
= 5
>>> a + 5
= 10
>>> "hi" * 2
= "hihi"

At this stage of the project, it only operates as a REPL, but of course we can turn this into a traditional interpreter later.

Again, if you're interested you can view and play with the Braid source code on GitHub. In part two, I'll cover bytecode compiling and interpreting, bringing us up to date with the current progress in the Git repo.

Useful resources

Without these references I would have been very stuck. May they also aid you on your journey:

Kermit, example interpreter from Pypy
Cycy, a tiny C interpreter in RPython
The Magic of RPython
RPython docs, particularly RPython specific modules if you do anything funky like regular expressions or unicode support