In this assignment, we'll get some hands-on experience using word vectors to calculate semantic similarity in various ways: word similarity, analogies, and sentence similarity.
The starter code for this assignment is available at https://faculty.wcas.northwestern.edu/robvoigt/courses/2024_winter/ling334/assignments/a5.zip.
In class we discussed the conceptual foundations of word2vec static embeddings. For this assignment we'll use a different but related model called GloVe. Specifically we'll use a subset of the most common words in the 50-dimensional GloVe vectors pretrained on Wikipedia.
For the assignment we'll use only the 50k most common words, and they're in a file on Quest here:
/projects/e31408/data/a5/glove_top50k_50d.txt
This directory also contains several other smaller files used in the assignment and described below.
If you want to work on your local machine, you'll want to either scp
these file to your local machine. For the GloVe vectors you can alternatively download the vectors, unzip them, and run head -50000 glove.6B.50d.txt > glove_top50k_50d.txt
to make the same file yourself. Either way if you work locally you'll have to be careful changing the appropriate file paths throughout.
Numpy is an incredibly useful and very commonly used library for scientific computing, machine learning, and vector-based math in Python. If you're not familiar, check out the quickstart in the official Numpy documentation - for this assignment you'll only need to know what's given here in "The Basics."
For more info, you can check out this cheatsheet, this written tutorial, or this short video.
Briefly, Numpy is based around the idea of an "array," which is like a python list but it can have multiple axes and mathematical operations are performed elementwise. So whereas with lists you get:
>>> a = [2,3,4]
>>> b = [5,6,7]
>>> a + b
[2, 3, 4, 5, 6, 7]
With Numpy you'll get:
>>> x = np.array([2,3,4])
>>> y = np.array([5,6,7])
>>> x + y
array([ 7, 9, 11])
By convention, Numpy is imported with import numpy as np
, so you refer to its functions with np
rather than numpy
.
embeddings.py
¶In this assignment you'll be editing four different files, each representing a different task. In this first file your task is to provide some basic machinery for working with embeddings that we'll use in other tasks, in an Embeddings
class.
Code for reading in the GloVe embeddings is already given for you in the init (check it out for your reference). Note that the definitions in the __getitem__
and __contains__
allow us to refer to embeddings like how we would in a normal Python dictionary, even though this class is not itself a dictionary:
embeddings = Embeddings()
embeddings['word'] # directly gives a numpy array
Your jobs here are to implement functions to return a vector norm (Equation 6.8 in SLP Ch. 6), calculate the cosine similarity between two vectors (Equation 6.10 in SLP Ch. 6), and find the most similar words to a given vector in the vocabulary (e.g., the words that generate the highest cosine similarity with the given input). Once you've done this, play around a bit and see which words are most similar to other words using your function. Is there anything you notice?
word_similarity.py
¶In this module, you'll compare how well word embeddings capture human judgments of word similarity on two datasets: WordSim-353 and SimLex-999. WordSim was earlier chronologically, and SimLex was created in part to address some perceived issues with it, in particular that the WordSim ratings captured more "relatedness" than "similarity," as discussed in class.
Functions are provided to read in WordSim and SimLex, in both cases into dictionaries with (word, word) tuples as the keys and similarity ratings as the values. Here you'll simply implement a function to score either dataset relative to word vector similarity calculated with your cosine similarity function.
sentence_similarity.py
¶In this module you'll investigate the task of "semantic textual similarity." This task is related to natural language inference, which we looked at in the last assignment with SNLI, but more about general semantic similarity (e.g., do the sentences "mean the same thing") rather than logical entailment. You can read more about the task here.
Code is provided to read the STS benchmark providing gradient ratings of similarity, like in Task 2. Like the word similarity case, the dataset is read into a dictionary with (sentence, sentence) tuples as keys and the corresponding similarity ratings as values.
To test how well our embeddings agree with human judgments, we'll need to generate sentence embeddings. There are many possible ways to do this, but in this assignment we'll try two:
Once you've done this, you'll write code to calculate performance in a manner very similar to in Task 2.
analogies.py
¶We talked in class about the common analogy-solving paradigm for intrinsic evaluation of word vector quality, as well as some pitfalls with it. Examples in this paradigm frequently consist of relatively simple cases, e.g. morphological relations like plurization (hat:hats::bat:bats
) or geographic relations like world capitals (London:England::Tokyo:Japan
).
Here we'll look at a much more challenging testbed: SAT analogy questions. You can see some more information and state of the art results here. One thing to notice is that random chance performance would be 20% accuracy since there are five possible answers for each question, and average human performance is not super high - only 57%.
For clarity, we'll refer to these analogies as being of the form a:b::aa:bb
.
Code is provided to read in the questions in a list. Each question is represented as a dictionary, where the 'question' key gives the two words a
and b
in a tuple. The 'choices' key gives an ordered list of the possible answers, each also a tuple of two possible words aa
and bb
. Your job is to return the index of the most likely answer. The 'answer' key for each question gives the true index.
You'll compare two strategies: the first is to use the analogy paradigm by returning the answer that maximizes $cos(a - b + bb, aa)$. The second is to look at linear relations more directly with a method we'll call parallelism, by returning the answer that maximizes $cos(a - b, aa - bb)$. You'll calculate the performance of these two strategies over the dataset.
qualitative_notes.txt
¶This is a space to write out your overall qualitative analysis of your findings in the assignment. Like in the last assignment, this doesn't need to be super long, but you should provide at least some analysis for each section regarding how your hands-on findings relate to what we've talked about in class. What sorts of semantics are these dense vectors picking up, and what are they missing? What sorts of tasks might they be better or worse at? How do considerations of the construction of particular datasets impact whether we can be successful at their corresponding task?
In both tasks 2 and 3, you'll evalute the performance of the embeddings using Spearman's Rho with the spearmanr
function, which calculates to what extent the models rank the items in the same order as the humans do. In task 4 you'll evaluate with raw accuracy: $\frac{num\_correct}{num\_total}$.
Each file has a main method which will run the relevant task and print out corresponding outputs. Definitely run each of the files throughout as you're working to see what sorts of results you're getting.
The autograder will also check tidbits of your various modules to make sure things look like they're working about right.
python /projects/e31408/autograders/a5.py
At a minimum, your assignment should run and pass all tests using this autograder before you submit!
$\Uparrow$ Onward and upward! $\Uparrow$
As usual, please leave your working code intact and add _ext.py
for any files where you need to change basic functionality for extensions.
And as usual, whatever else you can dream up is also welcome!
To officially submit your assignment, please flip the complete
variable to True!
As always, if it has to be late, edit the expected_completion_date
.
Lastly, if you did any extensions, please briefly explain what you did and where we should look to check out your work in the extensions_description
variable.