In class we talked about the crowdsourcing of annotations, a now-ubiquitous approach in both CL/NLP and AI more generally for obtaining human judgements to use as training data. In this assignment, we'll implement the PMI metric for measuring word or phrase association, and then apply it to a standard, crowdsourced natural language understanding dataset to see how human bias might creep into our models.
This assignment is heavily inspired by HW1 from Yulia Tsvetkov's Computational Ethics for NLP course, which is in turn inspired by Rudinger et al. 2017. Once you've done the basic assignment, feel free to look to these sources for more inspiration, but I want you to explore this data yourself first and develop your own impressions, especially before reading the Rudinger paper.
The starter code for this assignment is available at http://faculty.wcas.northwestern.edu/robvoigt/courses/2024_winter/ling334/assignments/a4.zip.
The Stanford Natural Language Inference corpus collects pairs of human-generated sentences with labels for their entailment relations. In each pair, the first sentence is called the premise and the second sentence is called the hypothesis. Given that the premise is true, the question is whether the hypothesis:
entailment
)neutral
), orcontradiction
).The dataset was collected with Flickr images as real-world grounding. The images were, in an earlier crowdsourcing effort, annotated with human captions describing their contents. Then for SNLI, crowdworkers were asked to provide an alternative caption that was either an entailment, a contradiction, or neutral. Annotators had a lot of freedom in providing these new captions (which became the hypotheses in the dataset).
There is a version of the dataset on Quest here:
/projects/e31408/data/a4/snli_1.0
You can also see the SNLI website or paper for more information, or download the data to your home computer for development. If you're not doing it on Quest, you'll want to change the infile
argument to your local path.
The starter code establishes a PMICalculator
class, which sets up a structure for reading the data and running analyses.
preprocess
¶A huge part of most projects in computational linguistics is understanding the dataset you're working with and developing processes to read it into a workable format. Therefore in this assignment you'll need to start by looking at the SNLI data and figuring out how to meaningfully read it in, and you'll do that in this function.
You should use the '.jsonl' files, which are in JSON Lines format. To a first approximation, you want to read the input file line by line, and turn each line into a python dictionary with json.loads
. It will also potentially be useful to look at the README.txt
file in the SNLI directory, which explains various aspects of the formatting.
Importantly when doing this preprocessing, since we are dealing with short documents of one sentence each, we're only going to be interested in whether words appear or not in documents (binary count). Therefore we will use a representation that will be very efficient for calculating PMI, namely dictionaries that map word strings to a set
of document ids in which those words appear. This allows us to use operations like set intersection to determine how in how many documents a given pair of words co-occur.
Important things to note:
self.premise_vocab_to_docs
and self.hypothesis_vocab_to_docs
objects respectively), for reasons that will become clear when we calculate PMI next.label_filter
variable can be passed in as any of 'entailment', 'contradiction', 'neutral', or None
. You have to implement your function such that when the label_filter
variable is not None
, you skip any documents where the gold label doesn't match the filter. This allows us to later compare whether different categories of hypothesis sentences lead to different sorts of associations.pmi
¶In this function use your accumulated counts to calculate pointwise mutual information. The equation for PMI is given in equation 6.17 in Chapter 6 of SLP. This provides the general form of PMI, but in this context with a simplified representation (binary count of words in documents), we can simplify the PMI equation to operate on only counts:
$ PMI = log_2 \frac{N * c(w_i, w_j)} {c(w_i) * c(w_j) } $
Where $N$ is the number of premise-hypothesis pairs (i.e., the number of documents), and $c$ just refers to the counts of a word or joint co-occurence counts of a pair of words. For the mathy among you, feel free to convince yourself why this works. In Python you can use the math.log2
function to do the log calculation.
We want to be able to calculate two types of PMI:
cross_analysis
argument is False
, we look only at the hypothesis sentences. This asks, "how likely were annotators to use these two words together in generating a hypothesis sentence?" cross_analysis
argument is True
, we'll look for the first word among the premises, and the second word among the hypotheses. This asks something more like, "given that an annotator saw word1
in the original caption, what words were they likely to use in the hypothesis that they generated in response?"You should actually be able to do this without much duplicated code. One strategy is to make new variables containing the sets of documents in which word1
appears (where you look being contingent on the cross_analysis
argument) and in which word2
appears (always look at the hypotheses). Use these sets to calculate PMI.
print_top_associations
¶Here you'll make a loop over all the hypothesis words and calculate their PMI with some given target
word, printing out the top n
values. You can use this function to show yourself associations for qualitative purposes; it is not checked in the autograder and is up to you to figure out how you want to do! Of course feel free to make other functions to manipulate the data in other sorts of ways if you're interested.
Now that you have a working program that can read the data, calculate PMI scores, and show you top word associations. Your job now is to investigate the possibility of associational biases in the SNLI dataset, and write up your qualitative findings in the qualitative_notes.txt
file contained in the starter code. Your writeup does not have to be extensive, but it should represent a significant effort to use your code to look through the dataset and ask meaningful questions of it.
You can either edit and add to the main function at the bottom of the starter code, make a new script of your own that imports your class (e.g. from bias_audit import PMICalculator
), or play with your class in interactive python mode to examine associations.
Many social biases relate to stereotyped associations with identity classes. The assignment zip also contains a .txt file (compiled by Tsvetkov) with a list of identity terms you can explore. You do not have to be limited in your analyses to identity-based biases, but this is a very useful starting point. It may be informative to compare contrasting classes, for instance "men" vs "women", or "american" vs some other nationality.
You can also read more about how the data was collected in the SNLI paper, in particular Figure 1 shows the annotation task. This is, in the first place, a very clever annotation setup! It allows untrained annotators to make high-quality linguistic judgments about entailment, a concept they've probably never heard of. But consider how this setup may have influenced annotators to make certain sorts of judgments.
Be sure to use the things you've implmented - use the label_filter
to consider different slices of the data by entailment type, the cross_analysis
to consider the different questions those different analyses types imply, and so on.
dev
$\rightarrow$ train
¶For development purposes, the starter code points to the file for the development set, which is small and will allow you to try things out quickly. Once your code works, change this file to point to the training set (or pass in the training set file as an argument when running the constructor, e.g. PMICalculator(infile='/projects/e31408/data/a4/snli_1.0/snli_1.0_train.jsonl')
. When doing all these analyses, you should apply your methods to the training set - this is much larger than than dev set, and is what models trained on this data would actually be learning from.
The autograder will create a few instances of your PMICalculator class and do some sanity checks that they seem to be doing what they should.
python /projects/e31408/autograders/a4.py
At a minimum, your assignment should run and pass all tests using this autograder before you submit!
Keep on keepin' on?
As usual, please leave your working calculator class intact and do e.g. cp bias_audit.py bias_audit_ext.py
once your initial implementation works, and edit the _ext.py
file instead of the original.
And as usual, whatever else you can dream up is also welcome!
To officially submit your assignment, please flip the complete
variable to True
!
As always, if it has to be late, edit the expected_completion_date
.
Lastly, if you did any extensions, please briefly explain what you did and where we should look to check out your work in the extensions_description
variable.