Language Parser

Overview

In this project, you will implement a parser for a small C-like language. Your parser functions will consume tokens generated by the lexer to produce an abstract syntax tree (AST). The input could be an entire program, or a fragment of a program, and your parser will produce an AST that represents one or more statements (stmt) or an expression (expr). This document describes the tokens and grammar you will use.

The only requirement for error handling is that input that cannot be lexed or parsed according to the provided rules should raise an InvalidInputException. Informative error messages should be used when raising these exceptions to make debugging easier.

All tests will be run on direct calls to your code, comparing your return values to the expected return values. Any other output (for example, debugging information) will be ignored. You are free and encouraged to have additional output.

Here is an example program fragment:

int x;
x = 2 * 3 ^ 5 + 4;
print(x > 100);

An example use of the parser in the OCaml top-level:

parse_stmt (tokenize "int x; x = 2 * 3 ^ 5 + 4; print(x > 100);")

Testing

You can run your parser directly on a program by running

  dune exec bin/main.exe <filename>

where the <filename> argument is required.

You should write your own tests which only test the parser by feeding it a custom token list. For example, to see how the expression 1 + 2 ^ 3 * 4 would be parsed, you can construct the token list manually:

parse_expr [Tok_Int 1; Tok_Add; Tok_Int 2; Tok_Pow;
            Tok_Int 3; Tok_Mult; Tok_Int 4; EOF];;

This way, you can work on the parser even if your lexer is not complete yet.

The Parser

The parser consists of two main functions which parse expressions and statements (parse_expr and parse_stmt.) The two functions work together to parse a full program.

The two functions you need to implement are:

`parse_expr`

Type: token list -> token list * expr
Description: Takes a list of tokens and returns an AST representing the expression corresponding to the given tokens, along with the new, reduced list of tokens
Exceptions: If the next tokens in the token list do not represent an expression, raise InvalidInputException.

Examples:

parse_expr [Tok_Int(1); Tok_Add; Tok_Int(2); Tok_Semi; EOF] =
    ([Tok_Semi; EOF], Add (Int 1, Int 2))
parse_expr [Tok_Int(1); EOF] = ([EOF], Int 1)
parse_expr [Tok_Int_Type; EOF] (* InvalidInputException *)

`parse_stmt`

Type: token list -> token list * stmt
Description: Takes a list of tokens and returns an AST representing the statement corresponding to the given tokens, along with the new, reduced list of tokens
Exceptions: If the next tokens in the token list do not represent a statement, raise InvalidInputException.

Examples:

parse_stmt [Tok_Int_Type; Tok_ID("x"); Tok_Semi; EOF] =
    ([EOF], Seq (Declare (Int_Type, "x"), NoOp))
parse_stmt [Tok_ID("x"); Tok_Assign; Tok_Int(3); Tok_Semi; EOF] =
    ([EOF], Seq (Assign ("x", Int 3), NoOp))
parse_stmt [Tok_Int(3); Tok_Add; Tok_Int(4); EOF] = InvalidInputException

parser.ml is the only file you will write code in. The functions should be left in the order they are provided, as a good implementation will rely heavily on earlier functions.

Note: the function match_token is provided for you which takes the list of tokens and a single token as arguments, and will either:

Return a new token list with the first token removed IF the first token matches the second argument, or
Raise an exception IF the first token does not match the second argument to the function.

The lookahead function is also provided for you which returns the first token in the list of tokens, but does NOT modify the token list. This function will raise an exception if the token list is empty.

The CFG below is for the language of C-like expressions. This CFG is right-recursive, so something like 1 + 2 + 3 will parse as Add (Int 1, Add (Int 2, Int 3)), essentially implying parentheses in the form (1 + (2 + 3)).)

`parse_expr`

Expressions represent mathematical and boolean formulas that typically evaluate to either an integer or a boolean. Because expressions are a self-contained subset of the grammar, we can implement them first, and build the rest of the language on top of them later.

The unambiguous CFG of expressions, from which you should produce a value of expr AST type, is as follows:

$$ \begin{array}{rcl} Expr &\rightarrow& OrExpr \\ OrExpr &\rightarrow& AndExpr \; \texttt{||} \; OrExpr \\ &|& AndExpr \\ AndExpr &\rightarrow& EqualityExpr \; \texttt{\&\&} \; AndExpr \\ &|& EqualityExpr \\ EqualityExpr &\rightarrow& RelationalExpr \; EqualityOp \; EqualityExpr \\ &|& RelationalExpr \\ RelationalExpr &\rightarrow& AdditiveExpr \; RelationalOp \; RelationalExpr \\ &|& AdditiveExpr \\ AdditiveExpr &\rightarrow& MultiplicativeExpr \; AdditiveOp \; AdditiveExpr \\ &|& MultiplicativeExpr \\ MultiplicativeExpr &\rightarrow& PowerExpr \; MultiplicativeOp \; MultiplicativeExpr \\ &|& PowerExpr \\ PowerExpr &\rightarrow& UnaryExpr \; \texttt{\string^} \; PowerExpr \\ &|& UnaryExpr \\ UnaryExpr &\rightarrow& \texttt{!} \; UnaryExpr \\ &|& PrimaryExpr \\ EqualityOp &\rightarrow& \texttt{==} \; | \; \texttt{!=} \\ RelationalOp &\rightarrow& \texttt{<} \; | \; \texttt{>} \; | \; \texttt{<=} \; | \; \texttt{>=} \\ AdditiveOp &\rightarrow& \texttt{+} \; | \; \texttt{-} \\ MultiplicativeOp &\rightarrow& \texttt{*} \; | \; \texttt{/} \\ PrimaryExpr &\rightarrow& Tok\_Int \; | \; Tok\_Bool \; | \; Tok\_ID \; | \; \texttt{(} \; Expr \; \texttt{)} \end{array} $$

As an example, see how the parser will break down an input mixing a few different operators with different precedence:

Input:

2 * 3 ^ 5 + 4

Output (after lexing and parsing):

Add(
  Mult(
    Int(2),
    Pow(
      Int(3),
      Int(5))),
  Int(4))

`parse_stmt`

Statements typically represent side-effecting actions, for example, control flow. When parsing, a sequence of statements should be terminated as a NoOp.

The stmt type isn’t self-contained like the expr type, and instead refers both to itself and to expr; use your parse_expr function to avoid duplicate code. The following CFG is for statements:

$$ \begin{array}{rcl} Stmt &\rightarrow& StmtOptions \; Stmt \\ &|& \epsilon \\ StmtOptions &\rightarrow& DeclareStmt \\ &|& AssignStmt \\ &|& PrintStmt \\ &|& IfStmt \\ &|& ForStmt \\ &|& WhileStmt \\ DeclareStmt &\rightarrow& BasicType \; Tok\_ID \; \texttt{;} \\ BasicType &\rightarrow& Tok\_Int \; | \; Tok\_Bool \\ AssignStmt &\rightarrow& Tok\_ID \; \texttt{=} \; Expr \; \texttt{;} \\ PrintStmt &\rightarrow& \texttt{print} \; \texttt{(} \; Expr \; \texttt{)} \; \texttt{;} \\ IfStmt &\rightarrow& \texttt{if} \; \texttt{(} \; Expr \; \texttt{)} \; \texttt{\{} \; Stmt \; \texttt{\}} \; ElseBranch \\ ElseBranch &\rightarrow& \texttt{else} \; \texttt{\{} \; Stmt \; \texttt{\}} \\ &|& \epsilon \\ ForStmt &\rightarrow& \texttt{for} \; \texttt{(} \; Tok\_ID \; \texttt{from} \; Expr \; \texttt{to} \; Expr \texttt{)} \; \texttt{\{} \; Stmt \; \texttt{\}} \\ WhileStmt &\rightarrow& \texttt{while} \; \texttt{(} \; Expr \texttt{)} \; \texttt{\{} \; Stmt \; \texttt{\}} \end{array} $$

If we expand on our previous example, we can see how the expression parser integrates directly into the statement parser:

Input:

int x;
x = 2 * 3 ^ 5 + 4;
print(x > 100);

Output (after lexing and parsing):

Seq(Declare(Int_Type, "x"),
Seq(Assign("x",
  Add(
    Mult(
      Int 2,
      Pow(
        Int 3,
        Int 5)),
    Int 4)),
Seq(Print(Greater(ID "x", Int 100)), NoOp)))

Turning in the Assignment

To submit your assignment, create a zip file of a DIRECTORY named project4-handin containing ONLY the project related source files. Then submit that file to the appropriate folder on D2L.

Grading Criteria

OCaml Style Guide

Programming Project Rubric