Welcome to Assignment 3!

In this assignment, your primary goal is to implement a bag-of-words Naive Bayes classifier with Laplace smoothing and evaluate its performance on a few datasets. While you're working on this, notice the interesting similarities to the language modeling we did last week!

Your primary source of inspiration for implementing the classifier will be Figure 4.2 in Chapter 4 of SLP, which gives pseudocode for training and testing functions. The pseudocode makes sense conceptually for understanding what needs to be done, but the transfer to actual running python code will be less direct than for the pseudocode in A1. Make sure to think about what each piece is doing and why at each stage, as well as how to best store the intermediate computations.

Your evaluation function will use the equations for precision, recall, and F1 score (equation 4.16).

The primary dataset you will use is a corpus of text messages from the Haiti Earthquake in 2010, and the classification task will be to determine whether the message is requesting help (relevant) or not (irrelevant). This data was collected in a real-world implementation of crowdsourcing and text classification called Mission 4636, and you can read more about it in an associated paper here as well.

Like in the last assignment, the primary code you'll be working on is in a NaiveBayesClassifier class.

There is a small interface given so you can test your program by running:

python naive_bayes.py

This will instantiate the classifier class, train it on the training set, and print out its performance on the development set. I have also implemented a print_top_features function which will display features your model learns as important. (If you want the challenge of this for yourself you can delete this function and re-implement it as an extension!)

I have the test set set aside, which will be run with the autograder on Quest.

Your Jobs

train

Follow the TrainNaiveBayes pseudocode to update the relevant class variables: self.vocabulary, self.logprior, and self.loglikelihood. Note that to match up with the autograder, the keys for self.loglikelihood should be tuples in the form of (w, c) where w is the string for the word and c is the string for the class.

score

Return the summed log-probability for a given document; this is analogous to the inside of the for loop in the TestNaiveBayes pseudocode.

predict

Return the most likely class for a given document; this should use your score function in a loop over the possible classes.

evaluate

This function evaluates the performance (precision, recall, F1 score) of the model on a given test set; note that this is with reference to a certain 'target' class which we will call Positive in the True/False Positive/Negative notation mentioned in the book.

Autograder

The autograder will create an instance of your NaiveBayesClassifier class, train it on the training data, and test it on a hidden test set.

python /projects/e31408/autograders/a3.py

At a minimum, your assignment should run and pass all tests using this autograder before you submit!

Alternative Models (Optional)

For those of you who have done Naive Bayes before and want to dig your teeth into something new, or just want a more substantial extension for the base assignment, I recommend you try out the Perceptron, specifically the Averaged Perceptron given in Algorithm 4 of Chapter 2 in NLP Notes. The Perceptron is interesting as a precursor to modern neural networks - it is like implementing a one-neuron network!

We'll discuss in class, but to reiterate: this is a discriminative model, rather than generative like Naive Bayes. This basically means it is more concerned about getting the right answer than with faithfully estimating the distribution of the training data.

So in Naive Bayes, you can think of the probabilities the model learns for each word as weights on that word; each time the word appears in a document, we add that log-probability (e.g., weight) to our running sum for the current prediction.

By contrast in a Perceptron, things are mistake-driven. Each word has a weight still, but the weights don't mean anything inherently, we just learn them by updating them in the appropriate direction when we make an incorrect prediction.

The NLP Notes description might be a little hard to follow. In class I will talk through these slides from Graham Neubig, which are somewhat more direct and I think will help you get the idea.

The assignment zip provides starter code for the Perceptron starter code as well, which has some commented explanations regarding the general strategy. One way to know you're doing it right is that this model should actually achieve better performance than Naive Bayes!

Extensions

For those about to (continue to) rock, we salute you.

As usual, please leave your working classifier class intact and do e.g. cp naive_bayes.py naive_bayes_ext.py once your initial implementation works, and edit the _ext.py file instead of the original.

  1. Show High- and Low-Probability Documents. Create a function that prints out documents in the dev set that your model thinks are particularly likely to be relevant or irrelevant. Do they look reasonable? Do the model predictions match the real labels?

  2. Error Analysis. Look qualitatively at the cases that your model(s) get wrong in the dev set - you could use an adapted version of the above extension to help find examples - and write up a short explanation of what you find. Why are you getting them wrong? Do you see any systematic issues, and/or have any hypotheses about what you could do to improve performance?

  3. Use your LM. We discussed in class the strong similarities between n-gram language models and the Naive Bayes classifier. Try copying over your language_modeling.py code and directly importing and using it to estimate the log-likelihood portion of the Naive Bayes equation. You can train one instance per class and then use your predict_ functions to return log-probabilities that capture $P(d|c)$.

  4. Implement Bigram Naive Bayes. Augment your model to use bigram features instead of (or in addition to) unigram features. Do they improve performance? If you did the above extension this would be as simple as a one-variable change in your code (e.g. a call to predict_unigram becomes predict_bigram).

  5. Try Other Datasets. A larger IMDB Movie Review dataset for sentiment is also given in an analogous format in /projects/e31048/data/a3/imdb. Try running your models on this new dataset - do they do better or worse at predicting positive reviews than they do at predicting requests for aid? Why do you think they differ? We talked in class about a few simple manipulations (binarization, handling negation) you can do to optimize for sentiment in particular - do these tweaks help?

  6. Implement Cross-Validation. The book discusses the idea of cross-validation, getting a performance estimate by splitting up the training data into windowed subsets and train/testing multiple times. Try implementing this and see if your average performance across folds is similar to or worse than your dev set performance.

  7. Read and report. Read a relevant article, post about it on Ed!

And as usual, whatever else you can dream up is also welcome!

Submission

To officially submit your assignment, please flip the complete variable to True!

As always, if it has to be late, edit the expected_completion_date.

Lastly, if you did any extensions, please briefly explain what you did and where we should look to check out your work in the extensions_description variable.