Language Lexer

Overview

In this project, you will implement a lexer for a small C-like language. Your lexer functions will convert the source text of a program The input could be an entire program, or a fragment of a program This document describes the lexical specification.

The only requirement for error handling is that input that cannot be lexed or lexed according to the specification should raise an InvalidInputException. Informative error messages should be used when raising these exceptions to make debugging easier.

Testing

You can run your lexer directly on a program by running

  dune exec bin/main.exe <filename>

where the <filename> argument is required.

All of the tests will compare the output of the reference implementation to your implementation.

The Lexer (aka Scanner or Tokenizer)

The lexer transforms source text into tokens. The goal is to transform a program, represented as a string, into a list of tokens that capture the different elements of the program. This process can be handled by using regular expressions. Information about OCaml’s regular expressions library can be found in the Str module. You are not required to use it, but you may find it useful.

Your lexer must be written in lexer.ml. For unit testing purposes, you may want to implement a pure function with a type signature of: string -> token list.

The token type is implemented in token.ml.

A few important notes to consider:

Tokens can be separated by arbitrary amounts of whitespace, which your lexer should discard. Spaces, tabs (\t) and newlines (\n) are all considered whitespace.
Tokens are case sensitive.
Lexer output must be terminated by the EOF token, meaning that the shortest possible output from the lexer is [EOF].
If the beginning of a string could be multiple things, the longest match should be preferred, for example:
- “while1” should not be lexed as Tok_While, but as Tok_ID("while1"), since it is an identifier
- “2-1” should be lexed as Tok_Int(2) and Tok_Int(-1), since “-1” is a valid integer.

The following table shows all mappings of tokens to their lexical representations. Tok_Bool, Tok_Int, and Tok_ID are listed as regular expressions.

Token Name	Lexical Representation
`Tok_LParen`	`(`
`Tok_RParen`	`)`
`Tok_LBrace`	`{`
`Tok_RBrace`	`}`
`Tok_Equal`	`==`
`Tok_NotEqual`	`!=`
`Tok_Assign`	`=`
`Tok_Greater`	`>`
`Tok_Less`	`<`
`Tok_GreaterEqual`	`>=`
`Tok_LessEqual`	`<=`
`Tok_Or`	`\|\|`
`Tok_And`	`&&`
`Tok_Not`	`!`
`Tok_Semi`	`;`
`Tok_Int_Type`	`int`
`Tok_Bool_Type`	`bool`
`Tok_Print`	`print`
`Tok_If`	`if`
`Tok_Else`	`else`
`Tok_For`	`for`
`Tok_From`	`from`
`Tok_To`	`to`
`Tok_While`	`while`
`Tok_Add`	`+`
`Tok_Sub`	`-`
`Tok_Mult`	`*`
`Tok_Div`	`/`
`Tok_Pow`	`^`
`Tok_Bool`	`/true\|false/`
`Tok_Int`	`/-?[0-9]+/`
`Tok_ID`	`/[a-zA-Z][a-zA-Z0-9]*/`

Turning in the Assignment

To submit your assignment, create a zip file of a DIRECTORY named project3-handin containing ONLY the project related source files. Then submit that file to the appropriate folder on D2L.

Grading Criteria

OCaml Style Guide

Programming Project Rubric