Toys » Safe from the Losing Fight

Archive for the 'Toys' Category

Immediate Egret

Last time I said I’d be writing about kalerl, my Kaleidoscope implementation targeting BEAM. However, I’ve decided I’d rather build a new language targeting BEAM, and document it as I go along. It’s going to use an Ocaml-like syntax, because I figure Erlang’s syntax is too mainstream. The purpose of the project is eventually play with a ML type system. I’ve named the language Egret, because most BEAM targeted languages start with the letter “e,” and because it sounds a lot like “regret,” which is likely how this project will end.

For this initial post, my goal is to get a skeleton compiler up and running. I started with the kalerl compiler and stripped it down to the bare minimum. By the end of this post, it’ll be able to build the following:


# This is a comment
123_456.789 # <- that’s a number! Be still my heart!

…into a BEAM module. That’s right: the compiler can handle one line comments, and one — and only one — floating point number. Future posts will add small pieces of functionality one at a time.

The structure of this post will follow the basic components of a compiler, highlighting the mildly interesting bits. And the boring bits too, because I’ve got that sort of time, turns out. In case you forgot, it works like this: the lexer takes a string and generates tokens, which the parser uses to generate an AST, which the codegen uses to generate code, which is finally assembled into a binary.

Lexer

For the lexer, I’m using Erlang’s Leex, which is a lexer generator much like flex or lex. First, we start with “definitions,” by which I mean “dumb macros.”


Definitions.

DIGIT = [0-9]
DIGIT_ = [0-9_]
WHITESPACE = [\r\n\t\s]

These are standard regular expressions that we’ll use in the rules. A couple of somewhat non-obvious things: the names of the macros must be in all uppercase, and there must be spaces around the equal signs.

The most interesting lexical rule is for real numbers:


(-)?{DIGIT}{DIGIT_}*(\.{DIGIT_}*)?((E|e)(\+|-)?{DIGIT}{DIGIT_}*)? : {token, validate_real(TokenLine, TokenChars)}.

…which is actually not all that interesting; if you’ve seen one regex, you’ve seen them all and probably have the therapy bill to prove it. This one is copied more or less directly from the Ocaml docs. The one mildly interesting thing is Ocaml allows underscores (_) to appear in numbers for readability purposes. (e.g. in place of commas like 1_000_000 )

In the associated code, we use two predefined values: TokenLine and TokenChars. As you might expect, TokenLine contains the line number the token appears on, and TokenChars is a list of character codes that the regex matched.

Leex requires the token to be returned in a standard format: a tuple of {token, TokenData}. Here TokenData is itself a tuple of either {TokenCode, LineNumber, TokenValue} or {TokenCode, LineNumber}. TokenCode is an atom used to uniquely identify the type of token, the LineNumber is the line number, and TokenValue is the parsed value the token represents. In the second form of TokenData, the TokenValue is assumed to the same as TokenCode. This can be helpful for punctuation characters.

The validate_real function does the necessary conversion for real values into token tuples:


validate_real(Line, Characters) ->
  StrippedCharacters = lists:filter(fun is_used_char/1, Characters),
  try list_to_float(StrippedCharacters) of
    Float -> {real, Line, Float}
  catch
    error:badarg ->
      {real, Line, float(list_to_integer(StrippedCharacters))}
  end.

This is easy since we use Erlang’s built-in string to float conversion functions. They can’t handle random underscores, so we strip those out using lists:filter. list_to_float is a built-in Erlang function, which has the annoying habit of throwing an exception if it sees anything resembling an integer. So in that case, we catch the exception, convert the string to an integer via list_to_integer, then cast it to a float.

Comments are where we get lazy. The lexical rules:


{WHITESPACE}+                 : skip_token.
#.*                           : skip_token. %% comments

It’s pretty trivial to support one line comments in Leex. It does not appear to be possible to support multiline comments in Leex, because there’s no such thing as states like you might find in flex. You’ll note this is not very Ocaml-like, which only has multiline comments and actually uses the # character as a method dispatch operator. My reasoning behind this is as follows: I’m not interested in exploring how to scan multiline comments because I’ve done many times before. I may revisit this later, which would mean hand rolling my own scanner. Finally, the skip_token value means what you would think it means. Duh.

Parser

Like the lexer, the parser is simple. There are some placeholder productions, because I couldn’t bring myself to strip everything out from kalerl. However, the net effect of all the rules is parsing one real number literal.

I’m using Yecc, Erlang’s yacc-like parser generator. To begin with, we need to declare some boiler plate information:


Nonterminals
expression primary_expr number_expr toplevel.

Terminals real.

Rootsymbol toplevel.

We have to tell Yecc which identifiers are nonterminals (i.e. parser definitions) and which are terminals (i.e. tokens from the lexer). I was hoping it could infer that since it has the parsing rules defined next, but alas, it cannot. Keeping these lists up-to-date is tedious. Finally, we have to let yecc know what the root nonterminal is; i.e. what everything will reduce down to.

Here are the parser rules:


toplevel -> expression                    : {toplevel, [], ['$1']}.

expression -> primary_expr                : '$1'.

primary_expr -> number_expr               : '$1'.

number_expr -> real                       : {real, line('$1'), unwrap('$1')}.

Summarized: the toplevel can have one expression, the only kind of expression is a number expression, and the only kind of number expression is a real number literal. The right hand side (after the colon), is what’s returned if the left hand side (after the arrow) is matched. It generates the AST, which I’ll cover in the next section. The '$1' is a special placeholder used in the associated code that will be replaced by the value returned in the corresponding terminal/nonterminal. It’s a 1-based index, for whatever reason. So:


primary_expr -> number_expr               : '$1'.

is essentially a pass-through. primary_expr returns whatever number_expr returned.

There are a couple of helper functions used to extract data from the token tuples:


unwrap({_,_,V}) -> V.

line({_,Line}) -> Line;
line({_,Line,_}) -> Line.

These are trivial. unwrap extracts the token’s value, and line returns the line the token appeared on. Note that line handles both forms of token data, while unwrap only covers the three element form. That’s because I’m lazy.

Finally, there’s a bit of code defined in the parser module to deal with converting what the parser returns into a proper AST. The structure of Egret’s AST isn’t that convenient to being built up incrementally. So we use an intermediate data structure in the parser (which we call toplevel), and provide a function to convert that into the top level AST (a module, to be precise). Here’s the code:


-spec toplevel_to_module(toplevel(), string()) -> {ok, egret_ast:egret_module()}.
toplevel_to_module({toplevel, Funcs, MainExprs}, ModuleName) ->
  Main = {function, 1, {prototype, 1, "main", []}, main_exprs(MainExprs), none},
  {ok, {module, 1, ModuleName, Funcs ++ [Main]}}.

-spec main_exprs([egret_ast:expr()]) -> [egret_ast:expr()].
main_exprs([]) ->
  [{real, 1, 1.0}];
main_exprs(MainExprs) ->
  MainExprs.

It does one other thing, too: it defines a main/0 function. The Egret AST (and Erlang compiler) doesn’t actually allow expressions to be hanging out there all by themselves; they have to be inside a function. Therefore, it’ll take all the top level expressions (i.e. those that aren’t in a defined function) and dump them into main/0. This means to invoke our Egret expressions after the source file has been compiled, we’ll have to call the main/0 function in the resulting BEAM file.

AST

The AST, or abstract syntax tree, is an in memory representation of the language. At this stage, Egret’s is almost trivial. The only thing that’s a little weird is we carry around line numbers. That’s because the codegen will need them. Let’s walk through the AST, starting at the top:


-type egret_module() :: {module, lineno(), string(), [func()]}.

At the very top is the egret_module, so called because “module” is a reserved word and the Erlang compiler is very fussy about that sort of thing. The egret_module has a name represented as a string, and then a list of functions. In turn, functions are defined as:


-type variable() :: {variable, lineno(), string()}.

%% Function prototype. Name and arguments
-type proto() :: {prototype, lineno(), string(), [variable()]}.

%% Function definition
-type func() :: {function, lineno(), proto(), [expr()], atom() | none}.

A function is a prototype and a list of expressions that make up the body of the function. A prototype is a name as a string and a list of parameters as variables. Finally, variables are just strings with line numbers. Note there is no type information in the AST for this version. You’ll also note that right now functions cannot be created be created in Egret source code, but they’re still represented in the AST.

Finally, we come to expressions:


%% Base type for all expression nodes.
-type expr() ::
  %% variant for numeric literals like "1.0".
    {real, lineno(), float()}.

Expressions can only be real number literals. For now.

Codegen

The codegen is a bit of a misnomer in that we’re actually translating the Egret AST into the Erlang AST a.k.a. parse tree a.k.a. abstract form. The top level code generation:


module({module, Line, ModuleName, Functions}) ->
  ModuleForm = {attribute, Line, module, list_to_atom(ModuleName)},
  {ok, [ModuleForm | functions(Functions, [])]}.

The parameter is the egret_module AST, and returns an Erlang abstract form. The first line in the function defines the Erlang module, using the same name as the Egret module. It then walks all the functions in the Egret module and appends the results onto the module abstract form. The function walker is an accumulator:


functions([], Accumulator) ->
  Accumulator;
functions([Function | Rest], Accumulator) ->
  functions(Rest, [function(Function) | Accumulator]).

It walks each func AST, calls function on it, and accumulates the results. Converting a function is a bit more involved:


function(F = {function, Line, Prototype, _Exprs, _Module}) ->
  Args = egret_ast:prototype_args(Prototype),
  {function, Line, prototype_name_mangle(Prototype), length(Args), [function_clause(F)]}.

function_clause({function, Line, Prototype, Exprs, _Module}) ->
  Args = egret_ast:prototype_args(Prototype),
  {clause, Line, pattern_sequence(Args), [], body(Exprs)}.

Part of the complication is because in Erlang a function can have multiple clauses for pattern matching (e.g. the functions function just shown). That’s not currently expressible in Egret, but it still must be generated. So for each Egret function, we generate one Erlang function AST and one function clause AST. The function AST needs the name as an atom, the arity, and a list of the function clauses. Each function clause, in turn, needs a list of the patterns to match (in our case the formal arguments) and a list of expressions that form the body of the function.

(The empty list you see in the function clause AST is for the clause’s conditionals — the expressions after the when reserved word in Erlang. Egret doesn’t use them yet, so it’s left empty.)

Despite the name, prototype_name_mangle doesn’t currently do any mangling:


prototype_name_mangle({prototype, _Line, Name, _FormalArgs}) ->
  list_to_atom(Name).

This is a bit of future-proofing in that once we do implement user defined operators, we’ll need some mangling.

Since Egret doesn’t really support any pattern matching, generating the Erlang ASTs for the function parameters is simple:


pattern_sequence(Args) ->
  lists:map(fun pattern/1, Args).

pattern({variable, Line, Arg}) ->
  {var, Line, list_to_atom(Arg)}.

Each parameter is mapped, almost directly, onto an Erlang var AST. Finally, we come to the codegen for expressions:


body(Exprs) ->
  lists:map(fun expr/1, Exprs).

expr({real, Line, Number}) ->
  {float, Line, Number}.

body is just a list of expressions. Since there’s only one type of expr AST in Egret, that’s the only one we need to handle. And again, it maps almost directly onto Erlang’s AST. Egret calls them “real” where Erlang calls them “float.”

Binarygen

Binarygen is a term I made up because I was tired and a better option didn’t occur to me in the five minutes I dedicated to the task. Since our codegen just generates a different AST, as opposed to assembly or an intermediate language, I didn’t want to call it an assembler. But that’s more or less what it is. We give it an Erlang AST from codegen and it gives back an executable BEAM module. Here’s the code that does that:


forms(AbsForms) ->
  process_binary(compile:forms(AbsForms, [verbose, return, export_all])).

This is a basic wrapper around compile:forms, which is an Erlang standard library function. The only thing we do here is in the process_binary function, and all that does is normalize the errors into a standard format. And that’s even more boring that it sounds.

Compiler

The compiler is a module that wraps up the lexer, parser, codegen, and binarygen into a simpler API.


file(Filename) ->
  do([error_m ||
    Contents <- read_error(file:read_file(filename:absname(Filename)), Filename),
    Tokens <- lexer_error(egret_lexer:string(binary_to_list(Contents)), Filename),
    Toplevel <- parser_error(egret_parser:parse(Tokens), Filename),
    IRModule <- egret_parser:toplevel_to_module(Toplevel, filename:rootname(filename:basename(Filename))),
    AbsForms <- egret_codegen:module(IRModule),
    egret_binarygen:forms(AbsForms)
  ]).

This is straight forward, mostly. We’re using Erlando to make using the error monad a little cleaner. We take a file name, read in the contents, tokenize the contents, parse the tokens, convert the parse tree into the Egret AST, codegen the Egret AST into the Erlang abstract form, and finally convert the abstract forms into a binary. The read_error, lexer_error, and parser_error functions normalize the errors into a standard format that can be used by the driver.

Trying it out

To build the egret compiler just run make in the project directory.


    $ make

This will place the egret compiler in the bin subdirectory. The egret compiler can only take one parameter, which is the file to compile. For example:


    $ ./bin/egret test/real.egret

This will generate a BEAM file in the same directory as the source file. In this case it will be test/real.beam. From here, the BEAM file can be used just like any other; all functions are exported by default.


    $ cd test
    $ erl
    Eshell V7.2  (abort with ^G)
    $ 1> l(real).
    {module,real}

The above moves to the “test” directory, starts the Erlang shell, and loads the real BEAM file just built. To call the main/0 function:

$ 2> real:main().
112398.7

As you can see, our real number literal is returned.

Conclusion

In this post, we’ve put together a skeleton compiler, with all the necessary parts: a lexer, parser, codegen, and assembler. We explored a bit of the lexer and parser generators Leex and Yecc. We also looked into Erlang’s abstract forms and how to generate them from Egret’s AST. Finally, with all these parts put together, we have a foundation to incrementally add more features to the Egret language. The final code for this post can be found in the Immediate Egret git branch.

Christmas and Compilers

Like most developers I suspect, I enjoy having a side project to work on during vacations. Almost always it’s something completely different from the day job; something I’m interested in, or never done before, or allows me to learn and explore something new. That’s the major attraction for me to software development: there’s always something new to learn. It’s also why my attention is span is like that of a labrador retriever’s in a field full of squirrels. My subconscious can never commit to which squirrel to chase for the next five minutes, let alone day.

This Christmas vacation is no different. Although I’m spending it with my immediate family, we’re all introverts (save Dad), so there’s plenty of down time for side projects. During these times, I usually work on a project from a set of domains that I, for whatever reason, find most interesting. And one of those domains happens to be compilers. I’m not sure why I settled on that this time, as opposed to, say, an RSS sync engine. Maybe my subconscious likes alliteration.

I’m not a compiler expert by any stretch of the imagination, but I’ve written a few of small ones before. They’re like a data structures winter wonderland. (Or maybe like a fat, slow moving squirrel, my subconscious helpfully points out.) I’ve read through the LLVM Kaleidoscope tutorial many times before and even used LLVM itself as a backend for a (half finished) Pascal compiler (codenamed “Pasco” for “Pascal & Company” or after the noted guitar amplifier photojournalist). So as I was reading through the tutorial recently, I thought: what can I do with this? It’s actually a great step-by-step tutorial for building a small, but interesting language, regardless of the chosen compiler toolset. But what toolset? Erlang seemed the most interesting, because my conscious self hates alliteration and/or my subconscious.

I taught myself Erlang a couple of years ago, I think. It was an easy dip into the functional languages pool, just to see what that was all about. Erlang has strict evaluation, a simple syntax, and a lack of pretentious academics (also known as monads), which made learning it easier than, say, Haskell. I love the actor model of concurrency and BEAM (the Erlang VM), but always wondered if there were a way to put a more powerful front end on it. (And apparently I’m not the only one who’s wondered this, given the existence of Elixir and other BEAM targeted languages.)

So that’s what I did. I used the Kaleidoscope tutorial to gradually implement the Kaleidoscope language in Erlang, targeting BEAM. Since the language didn’t map perfectly to the new runtime, I renamed the compiler and language to kalerl, which can be found on Github. (kalerl is pronounced “Kal-Earl”, like Superman’s redneck cousin. My subconscious thinks that’s hilarious.) I learned a lot from this exercise, mostly about generating Erlang parse trees, which are called “abstract forms” for some reason. I also had a lot of fun, a) because of learning of new stuff and b) because there’s something strangely satisfying about a tool that takes a programmer’s textual ramblings and spits out an executable binary. There’s magic going on inside of it, and I made that magic.

I’m contemplating writing a series of blog posts about the project (my subconscious likes love and acceptance, and has deluded itself into thinking it is achievable through these means). The difficulty is deciding how to break it up into digestible chunks, and figuring out what is interesting that someone (other than my subconscious) would consider worthwhile. Then I need to find the time and energy to do it, which is severely lacking in my current phase of life. But here’s to optimism, blog posts, and fat, slow moving squirrels.

(P.S. My subconscious also likes parenthetical asides and unedited prose, but I’d wager that’s obvious by now.)

Out with Wordpress in with Postword

At the risk of becoming part of something trendy (which my nerd contract expressly prohibits), I’ve finally ditched Wordpress in favor of static solution. Unlike everyone else though I’m not ditching it for performance reasons, because honestly, I don’t get enough traffic to bring down London bridge, which, according to a song I once heard is constantly falling down.

No, what prompted this change was my Wordpress install getting hacked for a third time (to inject spam links, of course). It’s not entirely Wordpress’s fault; each time my version was out of date, despite Wordpress making updating to the latest version super easy. Unfortunately I don’t go the Wordpress admin panel every day to check for updates, so I’m routinely out of date.

All of which to say, I needed something other than Wordpress to get around my update delinquency. Specifically here’s what I wanted:

Little to no code running on the server, with a strong preference for none. No database, either.
It should generate a static site, from flat text files.
It should be able to import all my Wordpress entries
Support the Wordpress site/link structure so that it wouldn’t break existing links into the site.

And a couple of nice to haves:

Support some version of Wordpress themes, or at least make it easy to port them.
Work with MarsEdit

In other words, I wanted Wordpress without the Wordpress.

There were a couple of solutions that got close, but none that did all that I wanted. So, being an engineer, I decided to write my own, which I codenamed Postword. I actually started the project back the last time I got hacked, and even used an early prototype for the blog on Hearts & Tactics.

Postword is more or less a collection of Ruby scripts. It has one to take a Wordpress XML file and convert it into a bunch of flat files (one for each post) that can be processed later by the build script. The flat files are nothing special; they have mail like headers for the categories and title, and then body which is just text with some HTML markup allowed. The build script takes these flat files and builds a full site that looks like Wordpress site structure wise. The script is unique in that it supports the idea of themes, whose API just happens to be the exact subset of Wordpress APIs that my custom Wordpress theme happens to use.

And that’s all Postword was up until a couple of days ago. But if was going to use Postword on this site, which I update somewhat frequently, I needed to reduce the friction. The first thing to do was to write a publish script that pushed the local static site to my server via FTP. I wrote my own script in Ruby, which actually turned out to be not that bad, since I could make a lot of assumptions. The important one was that the local site was always the truth. Which meant if the local and remote ever differed, the local overwrites the remote, and if the remote had a file which the local didn’t the remote gets deleted. In effect, the site becomes self-healing.

The last piece of the puzzle was create a local server that MarsEdit could talk to. MarsEdit supports a lot of different APIs, so I had to decide which one to implement. Wordpress was an obvious choice, but it offered a lot of functionality to I didn’t have or want, plus it was more complex than the others. I decided on MetaWeblog, with the Blogger API as a fall back for when MarsEdit tried to delete a post. It was surprisingly easy to implement, since almost all of the functionality was already in the other scripts, and Ruby provides an XMLRPC server out of the box.

The really nice part of all this is I now have the entire source to this site (flat post files + images + build scripts) in one Mercurial repository. I can download that and publish from any machine, as well as have instant back ups.

Not to say Postword isn’t without its flaws. The build script is slow; it takes a little while to generate all those archive pages that Wordpress creates on the fly. And if the build script is slow the publish script is glacial. When posting from MarsEdit, I can’t actually call the publish function after building the site, because MarsEdit will time out on the HTTP connection. That said, the pros easily outweigh the cons for me.

All in all, I’m tickled pink about how Postword turned out.

Safe from the Losing Fight

mac and ios development

About