Final Project

Due:: TBD

Description

You will build a system to solve a well-defined task. Which task you choose is completely open-ended, but the methods you use should draw on the ones from the course.

The final project must consist of the following stages:

Task definition: What does your system do (what is its input and output)? What real-world problem does this system try to solve?

Make sure that the scope of the project is not too narrow or broad. For example, building a system to answer any natural language question is too broad, whereas answering short factoid questions about movies is more reasonable. This is probably the most important part of the project.

The first task you might come up with is to apply binary classification on some standard dataset. This is probably not enough. If you are thinking in terms of binary classification, you are probably thinking too narrowly about the task. For example, when recommending news articles, you might not want to make predictions for individual articles, but might benefit from choosing a diverse set of articles.

An important part of defining the task is the evaluation. In other words, how will you measure the success of your system? For this, you need to obtain a reasonably sized dataset of example input-output pairs, either from existing sources, collecting one from scratch, or via simulation. A natural evaluation metric is accuracy, but it could be memory or running time. How big the dataset is depends on your task at hand.
Infrastructure: In order to do something interesting, you have to set up the infrastructure. For machine learning tasks, this involves collecting data (either by scraping, using crowdsourcing, or hand labeling). For game-based tasks, this involves building the game engine/simulator. While infrastructure is necessary, try not to spend too much time on it. You can sometimes take existing datasets or modify existing simulators to save time, but if you want to solve a task you care about, this is not always an option. Note that if you download existing datasets which are already preprocessed (for example, Kaggle), then you will be expected to do more with the project.
Approach: Identify the challenges of building the system and the phenomena in the data that you’re trying to capture. How should you model the task (for example, MDPs, machine learning, Bayes’ nets, etc.)? There will be many ways to do this, but you should pick one or two and explain how the methods address the challenges as well as any pros and cons. What algorithms are appropriate for handling the models that you came up with, and what are the tradeoffs between accuracy and efficiency? Are there any implementation choices specific to your problem?

Before developing your primary approach, you should implement baselines and oracles. These are really important as they give you intuition for how easy or hard the problem you’re solving is. Intuitively, baselines give lower bounds on the performance you will obtain and oracles give upper bounds. If this gap is too small, then you probably don’t have a good task. Importantly, baselines and oracles should be relatively easy to implement and can be done before you invest a lot of time in implementing a fancier approach. This can prune out problems early and save you a lot of time!

Baselines are simple algorithms, which might include using a small set of hand-crafted rules, training a simple classifier, etc. (Note that baselines are extremely simple, but you might be surprised at how effective they are.) While a predictor that guesses randomly provides a lower bound (and can be reported in the paper), it is too simple and doesn’t give you much information. Predicting the majority label is a slightly less trivial baseline, and whether it’s acceptable depends on how insightful it is. For classification, if the different labels have very different proportions, then it could be useful; otherwise it won’t be. You are encouraged to have multiple baselines.

Oracles are algorithms that “cheat” and look at the correct answer or involve humans. For human-like classification problems (for example, sentiment classification), you can try to annotate ~50 examples and measuring the agreement rate. Note that some tasks are subjective, so even though humans are providing ground truth labels, human accuracy will not be 100%. When the classification problem is not human-like, you can try to use the training error of an expressive classifier (for example, nearest neighbors) as a proxy for oracle error. The idea is that if you cannot fit the training data using an expressive classifier, there is probably a lot of noise in your dataset, and you have a slim chance of building any classifier that does well on test. While returning 100% is an upper bound, it is not a valid oracle since it is vacuously an upper bound. Sometimes, oracles might be difficult to come by. If you think that no good oracles exist, explain why.

Overall, the main point of baselines and oracles is to get you to look at the problem carefully and think about what’s possible. The accuracies of state-of-the-art systems on the dataset could be either a baseline or an oracle. Sometimes, there are data points that are neither baselines nor oracles: for example, in a two component system, you use an oracle for one and a baseline for another.
Error analysis: Design a few experiments to show the properties (both pros and cons) of your system. For example, if your system is supposed to deal with graphs with lots of cycles, then construct both examples with lots of cycles and ones without to test your hypothesis. Each experiment should ask a concise question, such as: Do we need to model the interactions between the ghosts in Pac-Man? or How well does the system scale up to large datasets? Analyze the data and show either graphs or tables to illustrate your point. What’s the take-away message? Were there any surprises?

An Example Strategy

This is a suggestion of how to approach the final project with an example.

Pick a topic that you’re passionate about (for example, food, language, energy, politics, sports, card games, robotics). As a running example, say we are interested in how people read the news to get their information.
Brainstorm to find some tasks on that topic: ask “wouldn’t it be nice to have a system that does X?” or “wouldn’t it be nice to understand X?” A good task should not be too easy (sorting a list of numbers) and not too hard (building a system that can automatically solve programming assignments). Please ask me for feedback on finding the right balance. Let us focus on recommending news to people.
Define the task you’re trying to solve clearly and convince yourself (and a few friends) that it’s important/interesting. Also state your evaluation metric – how will you know if you have succeeded or not? Concentrate on a small set of popular news sites: nytimes.com, slashdot.org, sfgate.com, onion.com, etc. For each user and each day, assume we have acquired a set of articles that the user is interested in reading (training data). Our task is to predict for a new day, given the full set of articles, the best subset to show the user; evaluation metric would be prediction accuracy.
Gather and clean the necessary data (this might involve scraping websites, filtering outliers, etc.). This step can often take an annoyingly large amount of time if you’re not careful, so do not try to get bogged down here. Simplify the task or focus on a subset of the data if necessary. You might find yourself adjusting the task you’re trying to solve based on new empirical insights you get by looking at the data. Notice that even if you’re not doing machine learning, it’s necessary to have data for evaluation purposes. Write some scripts that download the RSS feeds from the news sites, run some basic NLP processing (for example, tokenization), say, using NLTK.
Implement a baseline algorithm. For a classification task, this would be always predicting the most common label. If your baseline is too high, then your task is probably too easy. One baseline is to always produce the first document from each news site. Also implement an oracle, for example, recommending the document based on the number of comments. This is an oracle because you would not have the number of comments at the time you actually wanted to recommend the article!
Formulate a model and implement the algorithm for that model. You should try several variants and compare them. Remember to try as much as possible to separate model (what you want to compute) from algorithms (how you do it). You might train a classifier to predict, for each news article, whether to include it or not. You might try to include these predictions as factors in a weighted CSP and try to find a set of articles that balance diversity and relevance.
Perhaps the most important part of the project is the final step, which is to analyze the results. It’s more important that you do a thorough analysis and interpret your results rather than implement a huge number of complicated heuristics in trying to eke out the maximum performance. The analysis should begin with basic facts, for example, how much time/memory did the algorithm take, how does the accuracy vary with the amount of training data? What are the instances that your system does the worst on? Give concrete examples and try to understand why. Is there a bottleneck? Is it due to lack of training data?

Some Project Ideas

Predict the price of airline ticket prices given day, time, location, etc.
Predict the amount of electricity consumed over the course of a day.
Auto-complete code when you’re programming.
Answer natural language questions for a restricted domain (for example, movies, sports).
Search for a mathematical theorem based on an expression which normalizes over variable names.
Find the optimal way to get from one place in town to another place, taking into account uncertain travel times due to traffic.
Build an engine to play Go, chess, poker, etc.
Break substitution codes based on knowledge of English.
Automatically generate the harmonization of a melody.
Generate poetry on a given topic.

Turning in the Assignment

For this assignment, you must turn in a zip file of a directory named final-project.zip containing the following files:

final-project.ipynb
any additional files (libraries, datasets, etc.)

Submit the zip file to the appropriate folder on D2L.

Grading

Your project will be graded on the following criteria:

Task definition: is the task precisely defined and is the motivation for the task clear?
Approach: was a baseline, an oracle, and an advanced method described clearly, well justified, and tested?
Data and experiments: have you explained the data clearly, performed systematic experiments, and reported concrete results?
Analysis: did you interpret the results and try to explain why things worked (or didn’t work) the way they did? Do you show concrete examples?

Of course, the experiments may not always be successful. Getting negative results is normal, and as long as you make a reasonably well-motivated attempt and you explained why the results came out negative, you will get credit.

Note that these criteria must be described using markdown cells in the Jupyter notebook submission.