Markov Chains

Due:
11:00pm, Friday February 18, 20222

Description

Write a program named p2.py that trains a 1st order Markov chain based on a text file and then generates text from the trained model. The program must take the file name for training the model as the first command line argument and the number of words to generate for the output as the second command line argument.

Here is a description of the process. Given some text, for example

Jack and Jill went up the hill
To fetch a pail of water.
Jack fell down and broke his crown,
And Jill came tumbling after.
Then up got Jack and said to Jill,
As in his arms he took her,
Brush off that dirt for you’re not hurt,
Let’s fetch that pail of water.
So Jack and Jill went up the hill
To fetch the pail of water,
And took it home to Mother dear,
Who thanked her son and daughter.

We analyze the source text and for each word we keep track of the words that follow and how many times those words appear. The words are typically converted to lowercase before analysis to get more accurate counts. Based on the example source text above, here is the analysis of the word “and”:

and
- broke: 1
- daughter: 1
- jill: 3
- said: 1
- took: 1

Once all the words are analyzed, the numbers are normalized to probabilities:

and
- broke: 0.14
- daughter: 0.14
- jill: 0.42
- said: 0.14
- took: 0.14

To generate text, we randomly determine the first word and then use the probabilities to randomly choose the next word, and so on until the number of desired words is generated.

The program must create a text file named statistics.txt that contains the computed probabilities. The words must be in alphabetical order separated by newlines with the probabilities in the following form:

a
- pail: 1.00

and
- broke: 0.14
- daughter: 0.14
- jill: 0.42
- said: 0.14
- took: 0.14

as
- in: 1.00

...

The program must print the generated text to standard out.

Turning in the Assignment

For this assignment, you must turn in a zip file of a directory named project2 containing the following files:

  1. p2.py
  2. corpus.txt – a body of text that you used for testing your program

Submit the zip file to the appropriate folder on D2L.

Note: if your solution is based on pseudo-code (or code) from a source other than the textbook or the lecture slides, then you must cite the source of the pseudo-code. Otherwise, you will receive a grade of zero for this assignment.