Language Lexer
Overview
In this project, you will implement a lexer for a small C-like language. Your lexer functions will convert the source text of a program The input could be an entire program, or a fragment of a program This document describes the lexical specification.
The only requirement for error handling is that input that cannot be
lexed or lexed according to the specification should raise an
InvalidInputException. Informative error messages should be used
when raising these exceptions to make debugging easier.
Testing
You can run your lexer directly on a program by running
dune exec bin/main.exe <filename>
where the <filename> argument is required.
All of the tests will compare the output of the reference implementation to your implementation.
The Lexer (aka Scanner or Tokenizer)
The lexer transforms source text into tokens. The goal is to transform
a program, represented as a string, into a list of tokens that capture
the different elements of the program. This process can be handled by
using regular expressions. Information about OCaml’s regular
expressions library can be found in the Str module. You are not
required to use it, but you may find it useful.
Your lexer must be written in lexer.ml. For unit testing purposes,
you may want to implement a pure function with a type signature of:
string -> token list.
The token type is implemented in token.ml.
A few important notes to consider:
Tokens can be separated by arbitrary amounts of whitespace, which your lexer should discard. Spaces, tabs (
\t) and newlines (\n) are all considered whitespace.Tokens are case sensitive.
Lexer output must be terminated by the
EOFtoken, meaning that the shortest possible output from the lexer is[EOF].If the beginning of a string could be multiple things, the longest match should be preferred, for example:
- “while1” should not be lexed as
Tok_While, but asTok_ID("while1"), since it is an identifier - “2-1” should be lexed as
Tok_Int(2)andTok_Int(-1), since “-1” is a valid integer.
- “while1” should not be lexed as
The following table shows all mappings of tokens to their lexical
representations. Tok_Bool, Tok_Int, and Tok_ID are listed as
regular expressions.
| Token Name | Lexical Representation |
|---|---|
Tok_LParen |
( |
Tok_RParen |
) |
Tok_LBrace |
{ |
Tok_RBrace |
} |
Tok_Equal |
== |
Tok_NotEqual |
!= |
Tok_Assign |
= |
Tok_Greater |
> |
Tok_Less |
< |
Tok_GreaterEqual |
>= |
Tok_LessEqual |
<= |
Tok_Or |
|| |
Tok_And |
&& |
Tok_Not |
! |
Tok_Semi |
; |
Tok_Int_Type |
int |
Tok_Bool_Type |
bool |
Tok_Print |
print |
Tok_If |
if |
Tok_Else |
else |
Tok_For |
for |
Tok_From |
from |
Tok_To |
to |
Tok_While |
while |
Tok_Add |
+ |
Tok_Sub |
- |
Tok_Mult |
* |
Tok_Div |
/ |
Tok_Pow |
^ |
Tok_Bool |
/true|false/ |
Tok_Int |
/-?[0-9]+/ |
Tok_ID |
/[a-zA-Z][a-zA-Z0-9]*/ |
Turning in the Assignment
To submit your assignment, create a zip file of a DIRECTORY named
project3-handin containing ONLY the project related source
files. Then submit that file to the appropriate folder on D2L.