Josh

I'm a developer in Melbourne, Australia, and co-founder of Hello Code.

Published Sun 19 June 2016

← Home

There's never been an easier time to write your own language

As programming matures and shifts to higher-level abstractions, it becomes more and more about assembling pre-made building blocks and less about writing new, novel solutions of your own. When I create web apps (the bread and butter of my day job) I'm using an open-source OS, programming language, database, database connector library, web server, web framework, authentication libraries, API libraries, payment service libraries... You get the idea. Increasingly the code I write is "glue" code, simply binding together my building blocks in the correct order, like a more complicated IFTTT, or textual Lego.

It's not surprising that this collection of building blocks is expanding to encompass ever-more-complicated areas of programming. Today, it's possible to build a simple programming language in the same way you'd build a web app.

What's involved in writing a language?

The components of a language usually boil down to a "front-end" and a "back-end".

The front-end traditionally consists of a lexer, for turning raw source code into tokens based on grammar rules, and a parser, for turning tokens into an Abstract Syntax Tree, or AST — nested blocks that represent operations.

The back-end is the "compiler" which performs steps like type-checking, optimisation, and finally converting the AST into a lower-level representation. If you're writing an interpreted language, this lower-level is probably bytecode, rather than an executable, and a corresponding interpreter which reads and performs bytecode instructions. For example, Python is an interpreted language where .py denotes the raw source code file, and .pyc the compiled bytecode. When first running a .py file, Python will build a bytecode-compiled equivalent it can run next time (until your source changes, anyway). If instead you're writing a compiled language, your back-end builds native executables that can be run directly by the OS.

So once you have a grammar for your syntax, to plug into your front-end, and you can compile the resulting AST with your back-end, what else is there to do? Well, don't forget about writing a standard library — defining your built-in types, classes (if you have those), and all of the standard functions to operate on them. This step is at least as much work.

Why is it so easy then?

If you're writing a language today, many of the parts we defined above already exist for you. Your back-end and your front-end can both be built out of existing parts.

There are plenty of lexers and parsers available — for example, Lexx and Yacc are a venerable lexing and parsing combo that have spawned countless modern interpretations. RPly (RPython Lexx/Yacc) is my current tool of choice in this area. If you use a lexer and parser, the work left for you is to write your grammar rules, objects representing your AST, and simple functions telling the parser how to turn grammar into AST. Defining a grammar and an AST are fun! If your syntax is simple, this step probably won't be that hard either.

For the back-end, LLVM is the most well-known and respected tool for the job. LLVM powers Clang, a C and C++ compiler, and other well-known languages like Rust and Swift. The beauty of LLVM is its Intermediate Representation, or IR. IR looks a lot like assembly language, or really ugly C, but if you're up for translating your language to LLVM IR, LLVM will take care of optimising and compiling this to a native executable for you! This is huge. It can even be run in interpreted mode for debugging, or used as a Just-In-Time (JIT) compiler within your interpreter, if that's your jam.

Fortunately you needn't write IR by hand. Because LLVM has a C interface, there are many LLVM libraries in many languages to help you create your IR. Llvm-lite is the most mature Python binding I found, and what I'm currently using. Essentially, using llvm-lite means you can ask it to create a new function definition for you, for example, and it'll take care of generating the right IR syntax, and give you back a higher-level interface for managing variables and code blocks within.

So with this combination of pieces, many of the trickiest parts of writing a language can be wholly outsourced.

Let me give you a concrete example. I'll use my current toy language, written in Python, to demonstrate how it all goes together:

  • I write a list of tokens (essentially a set of strings) in Python, and feed them to RPly's lexer
  • I create a Python class to represent each type of AST node or block of code (a function, an if statement, and so on)
  • I write a set of grammar rules in Python, telling it how to turn tokens into AST instances, and feed them to RPly's parser
  • I write a compiler class which can turn each AST type into LLVM IR using llvm-lite

Then, to compile some source code:

  • The lexer turns my source into tokens
  • The parser turns tokens into AST
  • The compiler turns AST into IR, and dumps it to a file
  • I use the llc tool to compile from IR to an object file
  • I use gcc to compile that file into an executable

Voila, a working program I can run. It can't do much without a standard library, but if I define the grammar and AST for some "binary operations" like addition and substraction, I can compile simple arithmetic and return or print the result.

What I'm trying to get at is that once you decide on a grammar for your language, most of the code you write is glue, turning output from one library into input for another. Of course, this glue code can get quite complex, but for the modern language developer, so much is already done for you and ready to plug in. If you have some ideas for a language, it's never been easier to try them out.

To top