Language Parser
Overview
In this project, you will implement a parser for a small C-like language. Your parser functions will consume tokens generated by the lexer to produce an abstract syntax tree (AST). The input could be an entire program, or a fragment of a program, and your parser will produce an AST that represents one or more statements (stmt) or an expression (expr). This document describes the tokens and grammar you will use.
The only requirement for error handling is that input that cannot be lexed or
parsed according to the provided rules should raise an InvalidInputException.
Informative error messages should be used when raising these exceptions to make
debugging easier.
All tests will be run on direct calls to your code, comparing your return values to the expected return values. Any other output (for example, debugging information) will be ignored. You are free and encouraged to have additional output.
Here is an example program fragment:
int x;
x = 2 * 3 ^ 5 + 4;
print(x > 100);
An example use of the parser in the OCaml top-level:
parse_stmt (tokenize "int x; x = 2 * 3 ^ 5 + 4; print(x > 100);")
Testing
You can run your parser directly on a program by running
dune exec bin/main.exe <filename>
where
the <filename> argument is required.
You should write your own tests which only test the parser by feeding
it a custom token list. For example, to see how the expression 1 + 2 ^ 3 * 4 would be parsed, you can construct the token list manually:
parse_expr [Tok_Int 1; Tok_Add; Tok_Int 2; Tok_Pow;
Tok_Int 3; Tok_Mult; Tok_Int 4; EOF];;
This way, you can work on the parser even if your lexer is not complete yet.
The Parser
The parser consists of two main functions which parse expressions and
statements (parse_expr and parse_stmt.) The two functions work
together to parse a full program.
The two functions you need to implement are:
parse_expr
Type:
token list -> token list * exprDescription: Takes a list of tokens and returns an AST representing the expression corresponding to the given tokens, along with the new, reduced list of tokens
Exceptions: If the next tokens in the token list do not represent an expression, raise
InvalidInputException.Examples:
parse_expr [Tok_Int(1); Tok_Add; Tok_Int(2); Tok_Semi; EOF] = ([Tok_Semi; EOF], Add (Int 1, Int 2)) parse_expr [Tok_Int(1); EOF] = ([EOF], Int 1) parse_expr [Tok_Int_Type; EOF] (* InvalidInputException *)
parse_stmt
Type:
token list -> token list * stmtDescription: Takes a list of tokens and returns an AST representing the statement corresponding to the given tokens, along with the new, reduced list of tokens
Exceptions: If the next tokens in the token list do not represent a statement, raise
InvalidInputException.Examples:
parse_stmt [Tok_Int_Type; Tok_ID("x"); Tok_Semi; EOF] = ([EOF], Seq (Declare (Int_Type, "x"), NoOp)) parse_stmt [Tok_ID("x"); Tok_Assign; Tok_Int(3); Tok_Semi; EOF] = ([EOF], Seq (Assign ("x", Int 3), NoOp)) parse_stmt [Tok_Int(3); Tok_Add; Tok_Int(4); EOF] = InvalidInputException
parser.ml is the only file you will write code in. The functions
should be left in the order they are provided, as a good
implementation will rely heavily on earlier functions.
Note: the function match_token is provided for you which takes
the list of tokens and a single token as arguments, and will either:
- Return a new token list with the first token removed IF the first token matches the second argument, or
- Raise an exception IF the first token does not match the second argument to the function.
The lookahead function is also provided for you which returns the
first token in the list of tokens, but does NOT modify the token list.
This function will raise an exception if the token list is empty.
The CFG below is for the language of C-like expressions. This CFG is
right-recursive, so something like 1 + 2 + 3 will parse as Add (Int 1, Add (Int 2, Int 3)), essentially implying parentheses in the form
(1 + (2 + 3)).)
parse_expr
Expressions represent mathematical and boolean formulas that typically evaluate to either an integer or a boolean. Because expressions are a self-contained subset of the grammar, we can implement them first, and build the rest of the language on top of them later.
The unambiguous CFG of expressions, from which you should produce a
value of expr AST type, is as follows:
$$ \begin{array}{rcl} Expr &\rightarrow& OrExpr \\ OrExpr &\rightarrow& AndExpr \; \texttt{||} \; OrExpr \\ &|& AndExpr \\ AndExpr &\rightarrow& EqualityExpr \; \texttt{\&\&} \; AndExpr \\ &|& EqualityExpr \\ EqualityExpr &\rightarrow& RelationalExpr \; EqualityOp \; EqualityExpr \\ &|& RelationalExpr \\ RelationalExpr &\rightarrow& AdditiveExpr \; RelationalOp \; RelationalExpr \\ &|& AdditiveExpr \\ AdditiveExpr &\rightarrow& MultiplicativeExpr \; AdditiveOp \; AdditiveExpr \\ &|& MultiplicativeExpr \\ MultiplicativeExpr &\rightarrow& PowerExpr \; MultiplicativeOp \; MultiplicativeExpr \\ &|& PowerExpr \\ PowerExpr &\rightarrow& UnaryExpr \; \texttt{\string^} \; PowerExpr \\ &|& UnaryExpr \\ UnaryExpr &\rightarrow& \texttt{!} \; UnaryExpr \\ &|& PrimaryExpr \\ EqualityOp &\rightarrow& \texttt{==} \; | \; \texttt{!=} \\ RelationalOp &\rightarrow& \texttt{<} \; | \; \texttt{>} \; | \; \texttt{<=} \; | \; \texttt{>=} \\ AdditiveOp &\rightarrow& \texttt{+} \; | \; \texttt{-} \\ MultiplicativeOp &\rightarrow& \texttt{*} \; | \; \texttt{/} \\ PrimaryExpr &\rightarrow& Tok\_Int \; | \; Tok\_Bool \; | \; Tok\_ID \; | \; \texttt{(} \; Expr \; \texttt{)} \end{array} $$
As an example, see how the parser will break down an input mixing a few different operators with different precedence:
Input:
2 * 3 ^ 5 + 4
Output (after lexing and parsing):
Add(
Mult(
Int(2),
Pow(
Int(3),
Int(5))),
Int(4))
parse_stmt
Statements typically represent side-effecting actions, for example,
control flow. When parsing, a sequence of statements should be
terminated as a NoOp.
The stmt type isn’t self-contained like the expr type, and instead
refers both to itself and to expr; use your parse_expr function to
avoid duplicate code. The following CFG is for statements:
$$ \begin{array}{rcl} Stmt &\rightarrow& StmtOptions \; Stmt \\ &|& \epsilon \\ StmtOptions &\rightarrow& DeclareStmt \\ &|& AssignStmt \\ &|& PrintStmt \\ &|& IfStmt \\ &|& ForStmt \\ &|& WhileStmt \\ DeclareStmt &\rightarrow& BasicType \; Tok\_ID \; \texttt{;} \\ BasicType &\rightarrow& Tok\_Int \; | \; Tok\_Bool \\ AssignStmt &\rightarrow& Tok\_ID \; \texttt{=} \; Expr \; \texttt{;} \\ PrintStmt &\rightarrow& \texttt{print} \; \texttt{(} \; Expr \; \texttt{)} \; \texttt{;} \\ IfStmt &\rightarrow& \texttt{if} \; \texttt{(} \; Expr \; \texttt{)} \; \texttt{\{} \; Stmt \; \texttt{\}} \; ElseBranch \\ ElseBranch &\rightarrow& \texttt{else} \; \texttt{\{} \; Stmt \; \texttt{\}} \\ &|& \epsilon \\ ForStmt &\rightarrow& \texttt{for} \; \texttt{(} \; Tok\_ID \; \texttt{from} \; Expr \; \texttt{to} \; Expr \texttt{)} \; \texttt{\{} \; Stmt \; \texttt{\}} \\ WhileStmt &\rightarrow& \texttt{while} \; \texttt{(} \; Expr \texttt{)} \; \texttt{\{} \; Stmt \; \texttt{\}} \end{array} $$
If we expand on our previous example, we can see how the expression parser integrates directly into the statement parser:
Input:
int x;
x = 2 * 3 ^ 5 + 4;
print(x > 100);
Output (after lexing and parsing):
Seq(Declare(Int_Type, "x"),
Seq(Assign("x",
Add(
Mult(
Int 2,
Pow(
Int 3,
Int 5)),
Int 4)),
Seq(Print(Greater(ID "x", Int 100)), NoOp)))
Turning in the Assignment
To submit your assignment, create a zip file of a DIRECTORY named
project4-handin containing ONLY the project related source
files. Then submit that file to the appropriate folder on D2L.