Welcome to Assignment 6! (aka Alternative Final Project)

In this assignment you'll implement a Latent Dirichlet Allocation (LDA) topic model using collapsed Gibbs sampling for the learning procedure, and try it out on a toy dataset as well as some more real-world data.

If you aren't already, please ensure you're running module load python/anaconda3.6 each time you work on this assignment on Quest (or simply put that command as a line in your .bashrc file).

The starter code for this assignment is available at https://faculty.wcas.northwestern.edu/robvoigt/courses/2024_winter/ling334/assignments/a6.zip. Rob will discuss topic models in class, but to do this assignment you will also need to read and refer to Applications of Topic Models Chapter 1, and most specifically sections 1.4 and 1.5.

Your Jobs

Here you will fill in the LDATopicModel class given in lda.py to implement the functions necessary to train our topic model.

I recommend proceeding in the following order:

Familiarize yourself - __init__, filter_docs, print_topics

You don't need to edit these functions, but they set up the class attributes that will be available to you in running the code, so it's worth looking through and understanding what you're working with. print_topics is set up to show you the top associated words for each topic as your model trains.

initialize_counts, increment_counts, decrement_counts

The initialize_counts function has a long comment at the top explaining what you're intended to do here. Essentially to calculate what we need to calculate later on, we need to be keeping track of a series of counts, e.g. how long is each document, how many times has a certain topic been assigned to a certain word in a document, etc. Read this carefully and try to understand exactly what we're wanting to keep track of and why.

increment_counts and decrement_counts are helper functions you need to implement, that do most of this counting, specifically for self.doc_topic_counts, self.topic_word_counts, self.topic_counts, and self.doc_lengths. In initialize_counts it might be helpful to use your increment_counts function; you'll use decrement_counts at training time.

theta_d_i and phi_i_v

These correspond to equations 1.2 and 1.3 in on pg. 15 of Applications of Topic Models. If you understand what each of our counters should be accumulating, you should be able to turn these equations into some code here!

train

The comment at the top here explains the core procedure, and the loop over iterations (with topics printing out) is set up for you. Having set up all your structure above, you won't need very many lines of code here, but be careful with them!

A conceptual reminder on collapsed Gibbs sampling: the core idea is that we go through the entire corpus word-by-word multiple times, re-sampling topic assignments for each word. So for as many iterations as we're doing, for each document, for each word, we want to calculate equation 1.4 on pg. 16 of Applications of Topic Models for each possible topic, and use the results as the weights argument in taking a sample from the possible topics using the random.choices function.

Two important notes:

  • Notice that equation 1.4 is simply multiplying together the results of equations 1.2 and 1.3, so you'll want to use your theta_d_i and phi_i_v functions here.
  • We want to do our Gibbs sampling estimates without reference to the current word, so we want to use our decrement_counts function before doing the sampling for the current word, and then use our increment_counts function after we've made the new sample.

Over time, the contrasting goals encoded in that equation (documents should have a few key topics, topics should have a few key words) will lead our assignments to become better and better representations of the proposed latent topic distribution.

Autograder

The assignment file has a main method at the bottom which you can edit to try out the model in different ways and see what it's able to learn. The basic dataset that will run is a very small list of constructed examples about animals; 20 newsgroups is a classic dataset for the task of topic modeling and the code already has functionality to load it in and run on it or a subset.

The autograder will also check pieces of your LDATopicModel class to try and ensure things look reasonable, as well as the final outputs.

python /projects/e31408/autograders/a6.py

At a minimum, your assignment should run and pass all tests using this autograder before you submit!

Extensions

To infinity and beyond, no?

As usual, please leave your working code intact and add _ext.py for any isntance in which you need to change basic functionality for extensions.

  1. Run your model on the full 20 newsgroups dataset and write up some qualitative notes on your observations as to what the topics are able to pick up. Does it help to increase the number of iterations?
  2. Topic models have many interesting applications. Read one of the later chapters of Applications of Topic Models to learn more about a possible application area, and if inspired write up a note on Ed to share what you learned with other students in the class!
  3. Play around with filter_documents; how does it improve or hurt the subjective quality of the induced topics if we, for instance, don't filter the documents at all? Or let more words through the filter?
  4. Edit __init__ to have your model load and run on some other dataset. Search around on the web and see if you can find any interesting datasets, put them in your Quest dir, write some functionality to load them in, and see how your model can do.
  5. One thing to think about is that topic models can be used to generate feature representations. E.g., instead of a bag-of-words representation of a document as its features, each document is represented as its topic proportions. This would take some slightly complicated wiring together of your code, but you could try going back to the Haiti earthquake messages task from A3, and instead of using bag-of-words features use document-topic representations generated with self.theta_d_i. Does this help or hurt performance?
  6. Check out this paper on using clustered embeddings as a topic model. The basic idea is to train embeddings on your corpus and run an algorithm like K-Means Clustering on those embeddings to find semantic clusters. You can then compare documents to these centroids in your embedding space. You could try this out as an alternative strategy and see if you get more coherent-seeming topics. This is probably reaching more the level of final project possibilities, but something to consider!

And as usual, whatever else you can dream up is also welcome!

Submission

To officially submit your assignment, please flip the complete variable to True!

This assignment still has an expected_completion_date variable but please note that for this assignment the date is a hard deadline for our grading purposes.

Lastly, if you did any extensions, please briefly explain what you did and where we should look to check out your work in the extensions_description variable.