In this assignment you'll implement a Latent Dirichlet Allocation (LDA) topic model using collapsed Gibbs sampling for the learning procedure, and try it out on a toy dataset as well as some more real-world data.
If you aren't already, please ensure you're running module load python/anaconda3.6 each time you work on this assignment on Quest (or simply put that command as a line in your .bashrc file).
The starter code for this assignment is available at https://faculty.wcas.northwestern.edu/robvoigt/courses/2023_spring/ling334/assignments/a6.zip. Rob will discuss topic models in the last week of class, but to do this assignment you will also need to read and refer to Applications of Topic Models Chapter 1, and most specifically sections 1.4 and 1.5.
Here you will fill in the LDATopicModel class given in lda.py to implement the functions necessary to train our topic model.
I recommend proceeding in the following order:
__init__, filter_docs, print_topics¶You don't need to edit these functions, but they set up the class attributes that will be available to you in running the code, so it's worth looking through and understanding what you're working with. print_topics is set up to show you the top associated words for each topic as your model trains.
initialize_counts, increment_counts, decrement_counts¶The initialize_counts function has a long comment at the top explaining what you're intended to do here. Essentially to calculate what we need to calculate later on, we need to be keeping track of a series of counts, e.g. how long is each document, how many times has a certain topic been assigned to a certain word in a document, etc. Read this carefully and try to understand exactly what we're wanting to keep track of and why.
increment_counts and decrement_counts are helper functions you need to implement, that do most of this counting, specifically for self.doc_topic_counts, self.topic_word_counts, self.topic_counts, and self.doc_lengths. In initialize_counts it might be helpful to use your increment_counts function; you'll use decrement_counts at training time.
theta_d_i and phi_i_v¶These correspond to equations 1.2 and 1.3 in on pg. 15 of Applications of Topic Models. If you understand what each of our counters should be accumulating, you should be able to turn these equations into some code here!
train¶The comment at the top here explains the core procedure, and the loop over iterations (with topics printing out) is set up for you. Having set up all your structure above, you won't need very many lines of code here, but be careful with them!
A conceptual reminder on collapsed Gibbs sampling: the core idea is that we go through the entire corpus word-by-word multiple times, re-sampling topic assignments for each word. So for as many iterations as we're doing, for each document, for each word, we want to calculate equation 1.4 on pg. 16 of Applications of Topic Models for each possible topic, and use the results as the weights argument in taking a sample from the possible topics using the random.choices function.
Two important notes:
theta_d_i and phi_i_v functions here.decrement_counts function before doing the sampling for the current word, and then use our increment_counts function after we've made the new sample.Over time, the contrasting goals encoded in that equation (documents should have a few key topics, topics should have a few key words) will lead our assignments to become better and better representations of the proposed latent topic distribution.
The assignment file has a main method at the bottom which you can edit to try out the model in different ways and see what it's able to learn. The basic dataset that will run is a very small list of constructed examples about animals; 20 newsgroups is a classic dataset for the task of topic modeling and the code already has functionality to load it in and run on it or a subset.
The autograder will also check pieces of your LDATopicModel class to try and ensure things look reasonable, as well as the final outputs.
python /projects/e31408/autograders/a6.py
At a minimum, your assignment should run and pass all tests using this autograder before you submit!
To infinity and beyond, no?
As usual, please leave your working code intact and add _ext.py for any isntance in which you need to change basic functionality for extensions.
filter_documents; how does it improve or hurt the subjective quality of the induced topics if we, for instance, don't filter the documents at all? Or let more words through the filter?__init__ to have your model load and run on some other dataset. Search around on the web and see if you can find any interesting datasets, put them in your Quest dir, write some functionality to load them in, and see how your model can do.self.theta_d_i. Does this help or hurt performance?And as usual, whatever else you can dream up is also welcome!
To officially submit your assignment, please flip the complete variable to True!
This assignment still has an expected_completion_date variable but please note that for this assignment the date is a hard deadline for our grading purposes.
Lastly, if you did any extensions, please briefly explain what you did and where we should look to check out your work in the extensions_description variable.