Last time I said I’d be writing about kalerl, my Kaleidoscope implementation targeting BEAM. However, I’ve decided I’d rather build a new language targeting BEAM, and document it as I go along. It’s going to use an Ocaml-like syntax, because I figure Erlang’s syntax is too mainstream. The purpose of the project is eventually play with a ML type system. I’ve named the language Egret, because most BEAM targeted languages start with the letter “e,” and because it sounds a lot like “regret,” which is likely how this project will end.

For this initial post, my goal is to get a skeleton compiler up and running. I started with the kalerl compiler and stripped it down to the bare minimum. By the end of this post, it’ll be able to build the following:


# This is a comment
123_456.789 # <- that’s a number! Be still my heart!

…into a BEAM module. That’s right: the compiler can handle one line comments, and one — and only one — floating point number. Future posts will add small pieces of functionality one at a time.

The structure of this post will follow the basic components of a compiler, highlighting the mildly interesting bits. And the boring bits too, because I’ve got that sort of time, turns out. In case you forgot, it works like this: the lexer takes a string and generates tokens, which the parser uses to generate an AST, which the codegen uses to generate code, which is finally assembled into a binary.

Lexer

For the lexer, I’m using Erlang’s Leex, which is a lexer generator much like flex or lex. First, we start with “definitions,” by which I mean “dumb macros.”


Definitions.

DIGIT = [0-9]
DIGIT_ = [0-9_]
WHITESPACE = [\r\n\t\s]

These are standard regular expressions that we’ll use in the rules. A couple of somewhat non-obvious things: the names of the macros must be in all uppercase, and there must be spaces around the equal signs.

The most interesting lexical rule is for real numbers:


(-)?{DIGIT}{DIGIT_}*(\.{DIGIT_}*)?((E|e)(\+|-)?{DIGIT}{DIGIT_}*)? : {token, validate_real(TokenLine, TokenChars)}.

…which is actually not all that interesting; if you’ve seen one regex, you’ve seen them all and probably have the therapy bill to prove it. This one is copied more or less directly from the Ocaml docs. The one mildly interesting thing is Ocaml allows underscores (_) to appear in numbers for readability purposes. (e.g. in place of commas like 1_000_000 )

In the associated code, we use two predefined values: TokenLine and TokenChars. As you might expect, TokenLine contains the line number the token appears on, and TokenChars is a list of character codes that the regex matched.

Leex requires the token to be returned in a standard format: a tuple of {token, TokenData}. Here TokenData is itself a tuple of either {TokenCode, LineNumber, TokenValue} or {TokenCode, LineNumber}. TokenCode is an atom used to uniquely identify the type of token, the LineNumber is the line number, and TokenValue is the parsed value the token represents. In the second form of TokenData, the TokenValue is assumed to the same as TokenCode. This can be helpful for punctuation characters.

The validate_real function does the necessary conversion for real values into token tuples:


validate_real(Line, Characters) ->
  StrippedCharacters = lists:filter(fun is_used_char/1, Characters),
  try list_to_float(StrippedCharacters) of
    Float -> {real, Line, Float}
  catch
    error:badarg ->
      {real, Line, float(list_to_integer(StrippedCharacters))}
  end.

This is easy since we use Erlang’s built-in string to float conversion functions. They can’t handle random underscores, so we strip those out using lists:filter. list_to_float is a built-in Erlang function, which has the annoying habit of throwing an exception if it sees anything resembling an integer. So in that case, we catch the exception, convert the string to an integer via list_to_integer, then cast it to a float.

Comments are where we get lazy. The lexical rules:


{WHITESPACE}+                 : skip_token.
#.*                           : skip_token. %% comments

It’s pretty trivial to support one line comments in Leex. It does not appear to be possible to support multiline comments in Leex, because there’s no such thing as states like you might find in flex. You’ll note this is not very Ocaml-like, which only has multiline comments and actually uses the # character as a method dispatch operator. My reasoning behind this is as follows: I’m not interested in exploring how to scan multiline comments because I’ve done many times before. I may revisit this later, which would mean hand rolling my own scanner. Finally, the skip_token value means what you would think it means. Duh.

Parser

Like the lexer, the parser is simple. There are some placeholder productions, because I couldn’t bring myself to strip everything out from kalerl. However, the net effect of all the rules is parsing one real number literal.

I’m using Yecc, Erlang’s yacc-like parser generator. To begin with, we need to declare some boiler plate information:


Nonterminals
expression primary_expr number_expr toplevel.

Terminals real.

Rootsymbol toplevel.

We have to tell Yecc which identifiers are nonterminals (i.e. parser definitions) and which are terminals (i.e. tokens from the lexer). I was hoping it could infer that since it has the parsing rules defined next, but alas, it cannot. Keeping these lists up-to-date is tedious. Finally, we have to let yecc know what the root nonterminal is; i.e. what everything will reduce down to.

Here are the parser rules:


toplevel -> expression                    : {toplevel, [], ['$1']}.

expression -> primary_expr                : '$1'.

primary_expr -> number_expr               : '$1'.

number_expr -> real                       : {real, line('$1'), unwrap('$1')}.

Summarized: the toplevel can have one expression, the only kind of expression is a number expression, and the only kind of number expression is a real number literal. The right hand side (after the colon), is what’s returned if the left hand side (after the arrow) is matched. It generates the AST, which I’ll cover in the next section. The '$1' is a special placeholder used in the associated code that will be replaced by the value returned in the corresponding terminal/nonterminal. It’s a 1-based index, for whatever reason. So:


primary_expr -> number_expr               : '$1'.

is essentially a pass-through. primary_expr returns whatever number_expr returned.

There are a couple of helper functions used to extract data from the token tuples:


unwrap({_,_,V}) -> V.

line({_,Line}) -> Line;
line({_,Line,_}) -> Line.

These are trivial. unwrap extracts the token’s value, and line returns the line the token appeared on. Note that line handles both forms of token data, while unwrap only covers the three element form. That’s because I’m lazy.

Finally, there’s a bit of code defined in the parser module to deal with converting what the parser returns into a proper AST. The structure of Egret’s AST isn’t that convenient to being built up incrementally. So we use an intermediate data structure in the parser (which we call toplevel), and provide a function to convert that into the top level AST (a module, to be precise). Here’s the code:


-spec toplevel_to_module(toplevel(), string()) -> {ok, egret_ast:egret_module()}.
toplevel_to_module({toplevel, Funcs, MainExprs}, ModuleName) ->
  Main = {function, 1, {prototype, 1, "main", []}, main_exprs(MainExprs), none},
  {ok, {module, 1, ModuleName, Funcs ++ [Main]}}.

-spec main_exprs([egret_ast:expr()]) -> [egret_ast:expr()].
main_exprs([]) ->
  [{real, 1, 1.0}];
main_exprs(MainExprs) ->
  MainExprs.

It does one other thing, too: it defines a main/0 function. The Egret AST (and Erlang compiler) doesn’t actually allow expressions to be hanging out there all by themselves; they have to be inside a function. Therefore, it’ll take all the top level expressions (i.e. those that aren’t in a defined function) and dump them into main/0. This means to invoke our Egret expressions after the source file has been compiled, we’ll have to call the main/0 function in the resulting BEAM file.

AST

The AST, or abstract syntax tree, is an in memory representation of the language. At this stage, Egret’s is almost trivial. The only thing that’s a little weird is we carry around line numbers. That’s because the codegen will need them. Let’s walk through the AST, starting at the top:


-type egret_module() :: {module, lineno(), string(), [func()]}.

At the very top is the egret_module, so called because “module” is a reserved word and the Erlang compiler is very fussy about that sort of thing. The egret_module has a name represented as a string, and then a list of functions. In turn, functions are defined as:


-type variable() :: {variable, lineno(), string()}.

%% Function prototype. Name and arguments
-type proto() :: {prototype, lineno(), string(), [variable()]}.

%% Function definition
-type func() :: {function, lineno(), proto(), [expr()], atom() | none}.

A function is a prototype and a list of expressions that make up the body of the function. A prototype is a name as a string and a list of parameters as variables. Finally, variables are just strings with line numbers. Note there is no type information in the AST for this version. You’ll also note that right now functions cannot be created be created in Egret source code, but they’re still represented in the AST.

Finally, we come to expressions:


%% Base type for all expression nodes.
-type expr() ::
  %% variant for numeric literals like "1.0".
    {real, lineno(), float()}.

Expressions can only be real number literals. For now.

Codegen

The codegen is a bit of a misnomer in that we’re actually translating the Egret AST into the Erlang AST a.k.a. parse tree a.k.a. abstract form. The top level code generation:


module({module, Line, ModuleName, Functions}) ->
  ModuleForm = {attribute, Line, module, list_to_atom(ModuleName)},
  {ok, [ModuleForm | functions(Functions, [])]}.

The parameter is the egret_module AST, and returns an Erlang abstract form. The first line in the function defines the Erlang module, using the same name as the Egret module. It then walks all the functions in the Egret module and appends the results onto the module abstract form. The function walker is an accumulator:


functions([], Accumulator) ->
  Accumulator;
functions([Function | Rest], Accumulator) ->
  functions(Rest, [function(Function) | Accumulator]).

It walks each func AST, calls function on it, and accumulates the results. Converting a function is a bit more involved:


function(F = {function, Line, Prototype, _Exprs, _Module}) ->
  Args = egret_ast:prototype_args(Prototype),
  {function, Line, prototype_name_mangle(Prototype), length(Args), [function_clause(F)]}.

function_clause({function, Line, Prototype, Exprs, _Module}) ->
  Args = egret_ast:prototype_args(Prototype),
  {clause, Line, pattern_sequence(Args), [], body(Exprs)}.

Part of the complication is because in Erlang a function can have multiple clauses for pattern matching (e.g. the functions function just shown). That’s not currently expressible in Egret, but it still must be generated. So for each Egret function, we generate one Erlang function AST and one function clause AST. The function AST needs the name as an atom, the arity, and a list of the function clauses. Each function clause, in turn, needs a list of the patterns to match (in our case the formal arguments) and a list of expressions that form the body of the function.

(The empty list you see in the function clause AST is for the clause’s conditionals — the expressions after the when reserved word in Erlang. Egret doesn’t use them yet, so it’s left empty.)

Despite the name, prototype_name_mangle doesn’t currently do any mangling:


prototype_name_mangle({prototype, _Line, Name, _FormalArgs}) ->
  list_to_atom(Name).

This is a bit of future-proofing in that once we do implement user defined operators, we’ll need some mangling.

Since Egret doesn’t really support any pattern matching, generating the Erlang ASTs for the function parameters is simple:


pattern_sequence(Args) ->
  lists:map(fun pattern/1, Args).

pattern({variable, Line, Arg}) ->
  {var, Line, list_to_atom(Arg)}.

Each parameter is mapped, almost directly, onto an Erlang var AST. Finally, we come to the codegen for expressions:


body(Exprs) ->
  lists:map(fun expr/1, Exprs).

expr({real, Line, Number}) ->
  {float, Line, Number}.

body is just a list of expressions. Since there’s only one type of expr AST in Egret, that’s the only one we need to handle. And again, it maps almost directly onto Erlang’s AST. Egret calls them “real” where Erlang calls them “float.”

Binarygen

Binarygen is a term I made up because I was tired and a better option didn’t occur to me in the five minutes I dedicated to the task. Since our codegen just generates a different AST, as opposed to assembly or an intermediate language, I didn’t want to call it an assembler. But that’s more or less what it is. We give it an Erlang AST from codegen and it gives back an executable BEAM module. Here’s the code that does that:


forms(AbsForms) ->
  process_binary(compile:forms(AbsForms, [verbose, return, export_all])).

This is a basic wrapper around compile:forms, which is an Erlang standard library function. The only thing we do here is in the process_binary function, and all that does is normalize the errors into a standard format. And that’s even more boring that it sounds.

Compiler

The compiler is a module that wraps up the lexer, parser, codegen, and binarygen into a simpler API.


file(Filename) ->
  do([error_m ||
    Contents <- read_error(file:read_file(filename:absname(Filename)), Filename),
    Tokens <- lexer_error(egret_lexer:string(binary_to_list(Contents)), Filename),
    Toplevel <- parser_error(egret_parser:parse(Tokens), Filename),
    IRModule <- egret_parser:toplevel_to_module(Toplevel, filename:rootname(filename:basename(Filename))),
    AbsForms <- egret_codegen:module(IRModule),
    egret_binarygen:forms(AbsForms)
  ]).

This is straight forward, mostly. We’re using Erlando to make using the error monad a little cleaner. We take a file name, read in the contents, tokenize the contents, parse the tokens, convert the parse tree into the Egret AST, codegen the Egret AST into the Erlang abstract form, and finally convert the abstract forms into a binary. The read_error, lexer_error, and parser_error functions normalize the errors into a standard format that can be used by the driver.

Trying it out

To build the egret compiler just run make in the project directory.


    $ make

This will place the egret compiler in the bin subdirectory. The egret compiler can only take one parameter, which is the file to compile. For example:


    $ ./bin/egret test/real.egret

This will generate a BEAM file in the same directory as the source file. In this case it will be test/real.beam. From here, the BEAM file can be used just like any other; all functions are exported by default.


    $ cd test
    $ erl
    Eshell V7.2  (abort with ^G)
    $ 1> l(real).
    {module,real}

The above moves to the “test” directory, starts the Erlang shell, and loads the real BEAM file just built. To call the main/0 function:

$ 2> real:main().
112398.7

As you can see, our real number literal is returned.

Conclusion

In this post, we’ve put together a skeleton compiler, with all the necessary parts: a lexer, parser, codegen, and assembler. We explored a bit of the lexer and parser generators Leex and Yecc. We also looked into Erlang’s abstract forms and how to generate them from Egret’s AST. Finally, with all these parts put together, we have a foundation to incrementally add more features to the Egret language. The final code for this post can be found in the Immediate Egret git branch.