Welcome to Assignment 4!¶

In class we talked about the crowdsourcing of annotations, a now-ubiquitous approach in both CL/NLP and AI more generally for obtaining human judgements to use as training data. In this assignment, we'll implement the PMI metric for measuring word or phrase association, and then apply it to a standard, crowdsourced natural language understanding dataset to see how human bias might creep into our models.

This assignment is heavily inspired by HW1 from Yulia Tsvetkov's Computational Ethics for NLP course, which is in turn inspired by Rudinger et al. 2017. Once you've done the basic assignment, feel free to look to these sources for more inspiration, but I want you to explore this data yourself first and develop your own impressions, especially before reading the Rudinger paper.

The starter code for this assignment is available at http://faculty.wcas.northwestern.edu/robvoigt/courses/2024_winter/ling334/assignments/a4.zip.

The SNLI Dataset¶

The Stanford Natural Language Inference corpus collects pairs of human-generated sentences with labels for their entailment relations. In each pair, the first sentence is called the premise and the second sentence is called the hypothesis. Given that the premise is true, the question is whether the hypothesis:

is therefore necessarily true (entailment)
is ambiguously true or has no relation to the hypothesis (neutral), or
is necessarily not true (contradiction).

The dataset was collected with Flickr images as real-world grounding. The images were, in an earlier crowdsourcing effort, annotated with human captions describing their contents. Then for SNLI, crowdworkers were asked to provide an alternative caption that was either an entailment, a contradiction, or neutral. Annotators had a lot of freedom in providing these new captions (which became the hypotheses in the dataset).

There is a version of the dataset on Quest here:

/projects/e31408/data/a4/snli_1.0

You can also see the SNLI website or paper for more information, or download the data to your home computer for development. If you're not doing it on Quest, you'll want to change the infile argument to your local path.

Your Jobs - Programming¶

The starter code establishes a PMICalculator class, which sets up a structure for reading the data and running analyses.

`preprocess`¶

A huge part of most projects in computational linguistics is understanding the dataset you're working with and developing processes to read it into a workable format. Therefore in this assignment you'll need to start by looking at the SNLI data and figuring out how to meaningfully read it in, and you'll do that in this function.

You should use the '.jsonl' files, which are in JSON Lines format. To a first approximation, you want to read the input file line by line, and turn each line into a python dictionary with json.loads. It will also potentially be useful to look at the README.txt file in the SNLI directory, which explains various aspects of the formatting.

Importantly when doing this preprocessing, since we are dealing with short documents of one sentence each, we're only going to be interested in whether words appear or not in documents (binary count). Therefore we will use a representation that will be very efficient for calculating PMI, namely dictionaries that map word strings to a set of document ids in which those words appear. This allows us to use operations like set intersection to determine how in how many documents a given pair of words co-occur.

Important things to note:

You'll collect these counts separately for the premises and the hypotheses (in the self.premise_vocab_to_docs and self.hypothesis_vocab_to_docs objects respectively), for reasons that will become clear when we calculate PMI next.
The label_filter variable can be passed in as any of 'entailment', 'contradiction', 'neutral', or None. You have to implement your function such that when the label_filter variable is not None, you skip any documents where the gold label doesn't match the filter. This allows us to later compare whether different categories of hypothesis sentences lead to different sorts of associations.
PMI is biased towards infrequent words - to account for this, after you've read everything, remove any words that don't appear at least self.COUNT_THRESHOLD times total.

`pmi`¶

In this function use your accumulated counts to calculate pointwise mutual information. The equation for PMI is given in equation 6.17 in Chapter 6 of SLP. This provides the general form of PMI, but in this context with a simplified representation (binary count of words in documents), we can simplify the PMI equation to operate on only counts:

$ PMI = log_2 \frac{N * c(w_i, w_j)} {c(w_i) * c(w_j) } $

Where $N$ is the number of premise-hypothesis pairs (i.e., the number of documents), and $c$ just refers to the counts of a word or joint co-occurence counts of a pair of words. For the mathy among you, feel free to convince yourself why this works. In Python you can use the math.log2 function to do the log calculation.

We want to be able to calculate two types of PMI:

If the cross_analysis argument is False, we look only at the hypothesis sentences. This asks, "how likely were annotators to use these two words together in generating a hypothesis sentence?"
If the cross_analysis argument is True, we'll look for the first word among the premises, and the second word among the hypotheses. This asks something more like, "given that an annotator saw word1 in the original caption, what words were they likely to use in the hypothesis that they generated in response?"

You should actually be able to do this without much duplicated code. One strategy is to make new variables containing the sets of documents in which word1 appears (where you look being contingent on the cross_analysis argument) and in which word2 appears (always look at the hypotheses). Use these sets to calculate PMI.

`print_top_associations`¶

Here you'll make a loop over all the hypothesis words and calculate their PMI with some given target word, printing out the top n values. You can use this function to show yourself associations for qualitative purposes; it is not checked in the autograder and is up to you to figure out how you want to do! Of course feel free to make other functions to manipulate the data in other sorts of ways if you're interested.

Your Jobs - Qualitative¶

Now that you have a working program that can read the data, calculate PMI scores, and show you top word associations. Your job now is to investigate the possibility of associational biases in the SNLI dataset, and write up your qualitative findings in the qualitative_notes.txt file contained in the starter code. Your writeup does not have to be extensive, but it should represent a significant effort to use your code to look through the dataset and ask meaningful questions of it.

You can either edit and add to the main function at the bottom of the starter code, make a new script of your own that imports your class (e.g. from bias_audit import PMICalculator), or play with your class in interactive python mode to examine associations.

Many social biases relate to stereotyped associations with identity classes. The assignment zip also contains a .txt file (compiled by Tsvetkov) with a list of identity terms you can explore. You do not have to be limited in your analyses to identity-based biases, but this is a very useful starting point. It may be informative to compare contrasting classes, for instance "men" vs "women", or "american" vs some other nationality.

You can also read more about how the data was collected in the SNLI paper, in particular Figure 1 shows the annotation task. This is, in the first place, a very clever annotation setup! It allows untrained annotators to make high-quality linguistic judgments about entailment, a concept they've probably never heard of. But consider how this setup may have influenced annotators to make certain sorts of judgments.

Be sure to use the things you've implmented - use the label_filter to consider different slices of the data by entailment type, the cross_analysis to consider the different questions those different analyses types imply, and so on.

Very Important Note! `dev` $\rightarrow$ `train`¶

For development purposes, the starter code points to the file for the development set, which is small and will allow you to try things out quickly. Once your code works, change this file to point to the training set (or pass in the training set file as an argument when running the constructor, e.g. PMICalculator(infile='/projects/e31408/data/a4/snli_1.0/snli_1.0_train.jsonl'). When doing all these analyses, you should apply your methods to the training set - this is much larger than than dev set, and is what models trained on this data would actually be learning from.

Autograder¶

The autograder will create a few instances of your PMICalculator class and do some sanity checks that they seem to be doing what they should.

python /projects/e31408/autograders/a4.py

At a minimum, your assignment should run and pass all tests using this autograder before you submit!

Extensions¶

Keep on keepin' on?

As usual, please leave your working calculator class intact and do e.g. cp bias_audit.py bias_audit_ext.py once your initial implementation works, and edit the _ext.py file instead of the original.

Show Relevant Examples. Create a function that prints out hypotheses or premise-hypothesis pairs that display some of the associations you identified.
Expand to N-grams. We've only been working with unigrams, but it shouldn't be too complicated to scale up to bigrams or beyond. Allowing for bigrams in the premise will allow you to look for associations with joint categories like "Asian women" rather than just "Asian" or "women" separately. Allowing for them in the hypothesis might give us a more rich sense of the associations we're seeing arise. One important thing to note here is you'll want to make sure when returning the top associations by PMI to rule out n-grams that contain the target word, because the PMI between "man" and "black man" will inherently be very high.
Try Other Datasets. In Tsvetkov's version of this assignment, she provides ideas for a few other datasets that could be analyzed similarly - check it out here. You'll likely have to add a new preprocessing function for each new dataset since the formats will always be different. Are the associations you find similar or different? Feel free to go beyond these options - there are tons of datasets out there. For starters, you can check out the Hugging Face dataset hub.
Further Linguistic Analysis. The biases we've been focusing on here so far are social in nature - stereotypical associations between certain words and identity categories, for example. But there are also other sorts of biases that hypotheses could be picking up. You could look at some of the resources and approaches described in SLP Chapters 18, 19, or 20 to think about other semantic ways to slice this data up, or run an NLP package like spaCy on the data and look at syntactic relations, etc. Note that PMI in general can be used to measure association between any sorts of "events", so those events could be things like WordNet categories or part-of-speech tags. You could ask arbitrarily complicated questions like: how does a mention of a person in the premise relate to the part-of-speech tags in the hypothesis?
Try Other Measures of Association. PMI is not the only nor necessarily best way to do this measurement, you could try some other measures - many are given in the Pecino (2010) article among the relevant readings.
Read and report. Read a relevant article, post about it on Ed!

And as usual, whatever else you can dream up is also welcome!

Submission¶

To officially submit your assignment, please flip the complete variable to True!

As always, if it has to be late, edit the expected_completion_date.

Lastly, if you did any extensions, please briefly explain what you did and where we should look to check out your work in the extensions_description variable.