Using ANTLRv3 with C/C++

Introduction

Recently I had to build a parser and lexer as part of a compilers group project in college. Out group had read around a bit, and decided we would use ANTLR for our project. We also wanted to write our compiler in C++. However, although the documentation for using ANTLR with C was fairly decent, concrete examples of using ANTLR with C or C++ are few and far between. Hopefully this short guide will help you do what we found very difficult - taking the first steps towards creating a powerful lexer and parser using ANTLR in C++.

A few notes before we get into the actual code:

This example will use ANTLRv3.2. Although this is not the latest version, it works very well with the C target and tools like ANTLRWorks and the ANTLR Eclipse plugin. Some functions and keywords have changed in the latest version of ANTLR, I'll try to point out the v3.2 specific stuff as we go along.
We used the C target library, NOT the C++ target library. This might seem a tad strange, but it was done for good reason - the lack of documentation and features in the C++ target (it does not have any tree building features) make it very difficult to use as part of a larger project, which is what we needed to do. The C target allows for fun stuff like re-write rules and parse tree generation. If you are coding a compiler in C++ and using ANTLR then I would recommend using the C target for your parser and lexer generation with ANTLR, then writing C++ on top of that. This means that some of the code in your C++ main function will be quite "C-like", but we had no issues with this.
This is not a tutorial for writing ANTLR grammar. I will use a very simple grammar and I assume the reader has prior knowledge of creating grammars. If you do not, then check out this to get you going. Once you have some code that correctly invokes your lexer and parser, you can extend the grammar as much as necessary and no code modifications will be required at this end. I would also recommend the Stack Overflow ANTLR tag, this was the resource that helped me most when I was learning ANTLR grammars.

The Grammar

Here is the grammar that I will be using for the example:

grammar SimpleCalc;
options {
    language=C;
}
tokens {
    PLUS  = '+';
    MINUS = '-';
    MULT  = '*';
    DIV   = '/';
}
expr: term ( ( PLUS | MINUS ) term )*;
term: factor ( ( MULT | DIV ) factor )*;
factor: NUMBER;
NUMBER: (DIGIT)+;
WHITESPACE: ( '\t' | ' ' | '\r' | '\n'| '\u000C' )+ {$channel = HIDDEN;};
fragment
DIGIT: '0'..'9';

This grammar is taken from the 5 Minute Introduction article; all I've done is make the layout more readable (in my opinion). A full explanation of this grammar can be found there. Basically, this grammar will lex and parse expressions of any length with addition, subtraction, multiplication and division operators. It auto-magically sorts out operator precedence using nested rules (which I think is the neatest was to do this).

The Code

Let's start by looking at the code in full:

#include "SimpleCalcLexer.h"
#include "SimpleCalcParser.h"
int main(int argc, char* argv[]) {
  pANTLR3_UINT8 filename = (pANTLR3_UINT8) argv[1];
  // Set up our Lexer and Parser
  pANTLR3_INPUT_STREAM input;
  pSimpleCalcLexer lex;
  pANTLR3_COMMON_TOKEN_STREAM tokens;
  pSimpleCalcParser parser;
  // Get the input from the file
  input = antlr3AsciiFileStreamNew(filename);
  // Create the lexer using the input stream
  lex = SimpleCalcLexerNew(input);
  // Create the token stream using the lexer
  tokens = antlr3CommonTokenStreamSourceNew(ANTLR3_SIZE_HINT, TOKENSOURCE(lex));
  // Create the parser using the token stream
  parser = SimpleCalcParserNew(tokens);
  // Parse the program from the top level rule (expr)
  SimpleCalcParser_program_return r = parser->expr(parser);
  // Create the parse tree
  pANTLR3_BASE_TREE tree = r.tree;
  // Do what you want with the parse tree now
  // Remember to free things, C style
  parser->free(parser);
  tokens->free(tokens);
  lex->free(lex);
  input->close(input);
  return 0;
}

This code is a chopped, changed and compacted version of actual code from our compiler. The code is C++, but I think it's actually valid C (if the reader would like to hook up their parser in C). The main function will create an instance of your lexer and parser, parse the input file from the expr rule down, then create the parse tree in the form of a pANTLR3_BASE_TREE. Obviously you will need to add error checking on all the file stuff and the creation of the objects, but I have not included this here since I wanted to keep the code compact. Now I shall go through all the important lines of code, explaining what each does in more detail.

#include "SimpleCalcLexer.h"
#include "SimpleCalcParser.h"

These files are generated by ANTLR when your grammar is built (more on that later). There is no need to include any files from the ANTLR C Library, since everything that will be needed is included in these files.

pANTLR3_UINT8 filename = (pANTLR3_UINT8) argv[1];

We get the file name from the command line argument. This requires a cast to the custom type pANTLR3_UINT8 (which is actually just an alias of uint8_t), so the file name can be passed to various structs later on.

pANTLR3_INPUT_STREAM input;
pSimpleCalcLexer lex;
pANTLR3_COMMON_TOKEN_STREAM tokens;
pSimpleCalcParser parser;

We create the structs we need for the lexing and parsing. The input stream and the token stream are always called this, but the lexer and parser are named based on the name of your grammar (the top line in your grammar file).

input = antlr3AsciiFileStreamNew(filename);

Call the special function which creates the input stream from the input file. Note: This is one of the things that has changed in ANTLRv3.4 - in 3.4 you can use antlr3FileStreamNew(filename, ANTLR3_ENC_8BIT) or something similar.

lex = SimpleCalcLexerNew(input);

This creates the lexer, and lexes the input stream. Yes, it really is that simple!

tokens = antlr3CommonTokenStreamSourceNew(ANTLR3_SIZE_HINT, TOKENSOURCE(lex));

Create the token stream using the token source from the lexer. Note: I have read stuff about ANTLR3_SIZE_HINT being depreciated in the ANTLR C target documentation and other places. However, I haven't been able to find an alternative for this, and it worked fine for me, so I see no reason not to use it here

parser = SimpleCalcParserNew(tokens);

Create the parser and from the token stream.

SimpleCalcParser_program_return r = parser->expr(parser);

Parse the input from the token stream, from the top level rule (which in this grammar is expr). This returns the state of the parser which we save in r.

pANTLR3_BASE_TREE tree = r.tree;

Get the tree from the parser state, which we can use to create an AST and perform semantic analysis if we so wish.

Building the Lexer, Parser and your code

So that covers the code you need to write to get a basic solution working. However, building your code is not as simple as you might think. This section will look at how to build the lexer and the parser together with your code, which creates an executable that you can run to parse input files.

The first thing you'll need to do is download the ANTLR3 C Runtime Library. Un-tar this archive, and move it into your project directory. We're going to include the library in our build. In the example makefile below I've put the library in a folder called libantlr3c, in the same folder as the makefile.

Now you need to generate your parser and lexer. This is done by executing the command java -jar antlr-3.2.jar SimpleCalc.g. Note that this requires you to have the ANTLRv3.2 binary in the current directory, which can be downloaded from here. This generates .h and .c files for our SimpleCalcLexer and SimpleCalcParser

Next we build the parser and the lexer. Here I do this with gcc, using the -I flag to include the directory where we put the runtime library. We can also use g++ to build our main file (here I called it SimpleCalc.cpp) at the same time.

gcc -I libantlr3c -c -o SimpleCalcLexer.o SimpleCalcLexer.c
gcc -I libantlr3c -c -o SimpleCalcParser.o SimpleCalcParser.c
g++ -c -o SimpleCalc.o SimpleCalc.cpp

Finally, chuck it all together in an executable (called calc in this example):

g++ -I libantlr3c calc SimpleCalc.o SimpleCalcLexer.o SimpleCalcParser.o

And there you have it! We can put this all together in a makefile:

CPP = g++
C   = gcc -I libantlr3c
all: grammar calc
grammar:
  java -jar antlr-3.2.jar MAlice.g
  
calc: SimpleCalc.o SimpleCalcLexer.o SimpleCalcParser.o
    g++ -I libantlr3c -o $@ $^
  .cpp.o:
    $(CPP) -c -o $*.o $<
  .c.o:
    $(C) -c -o $*.o $<

clean:
  rm -rf codegen *.o libantlr3c/*.o ast/*.o idents/*.o

.phony: all clean grammar

Now we can parse files by running ./calc file, which will output nothing if the file is parsed successfully, and will output (somewhat ugly) ANTLR parse errors if there are syntax errors in the file.

Hopefully this will be enough to get people started using this very powerful but somewhat murky tool. If you have any questions or suggestions of how to improve this post then please don't hesitate to comment.