Welcome to Assignment 5!

In this assignment, we'll get some hands-on experience using word vectors to calculate semantic similarity in various ways: word similarity, analogies, and sentence similarity.

GloVe Vectors

In class we discussed the conceptual foundations of word2vec static embeddings. For this assignment we'll use a different but related model called GloVe. Specifically we'll use a subset of the most common words in the 50-dimensional GloVe vectors pretrained on Wikipedia.

For the assignment we'll use only the 50k most common words, and they're in a file on Quest here:

/projects/e31408/data/a5/glove_top50k_50d.txt

This directory also contains several other smaller files used in the assignment and described below.

If you want to work on your local machine, you'll want to either scp these file to your local machine. For the GloVe vectors you can alternatively download the vectors, unzip them, and run head -50000 glove.6B.50d.txt > glove_top50k_50d.txt to make the same file yourself. Either way if you work locally you'll have to be careful changing the appropriate file paths throughout.

Numpy

Numpy is an incredibly useful and very commonly used library for scientific computing, machine learning, and vector-based math in Python. If you're not familiar, check out the quickstart in the official Numpy documentation - for this assignment you'll only need to know what's given here in "The Basics."

For more info, you can check out this cheatsheet, this written tutorial, or this short video.

Briefly, Numpy is based around the idea of an "array," which is like a python list but it can have multiple axes and mathematical operations are performed elementwise. So whereas with lists you get:

>>> a = [2,3,4]
>>> b = [5,6,7]
>>> a + b
[2, 3, 4, 5, 6, 7]

With Numpy you'll get:

>>> x = np.array([2,3,4])
>>> y = np.array([5,6,7])
>>> x + y
array([ 7,  9, 11])

By convention, Numpy is imported with import numpy as np, so you refer to its functions with np rather than numpy.

Task 1: embeddings.py

In this assignment you'll be editing four different files, each representing a different task. In this first file your task is to provide some basic machinery for working with embeddings that we'll use in other tasks, in an Embeddings class.

Code for reading in the GloVe embeddings is already given for you in the init (check it out for your reference). Note that the definitions in the __getitem__ and __contains__ allow us to refer to embeddings like how we would in a normal Python dictionary, even though this class is not itself a dictionary:

embeddings = Embeddings()
embeddings['word'] # directly gives a numpy array 

Your jobs here are to implement functions to return a vector norm (Equation 6.8 in SLP Ch. 6), calculate the cosine similarity between two vectors (Equation 6.10 in SLP Ch. 6), and find the most similar words to a given vector in the vocabulary (e.g., the words that generate the highest cosine similarity with the given input). Once you've done this, play around a bit and see which words are most similar to other words using your function. Is there anything you notice?

Task 2: word_similarity.py

In this module, you'll compare how well word embeddings capture human judgments of word similarity on two datasets: WordSim-353 and SimLex-999. WordSim was earlier chronologically, and SimLex was created in part to address some perceived issues with it, in particular that the WordSim ratings captured more "relatedness" than "similarity," as discussed in class.

Functions are provided to read in WordSim and SimLex, in both cases into dictionaries with (word, word) tuples as the keys and similarity ratings as the values. Here you'll simply implement a function to score either dataset relative to word vector similarity calculated with your cosine similarity function.

Task 3: sentence_similarity.py

In this module you'll investigate the task of "semantic textual similarity." This task is related to natural language inference, which we looked at in the last assignment with SNLI, but more about general semantic similarity (e.g., do the sentences "mean the same thing") rather than logical entailment. You can read more about the task here.

Code is provided to read the STS benchmark providing gradient ratings of similarity, like in Task 2. Like the word similarity case, the dataset is read into a dictionary with (sentence, sentence) tuples as keys and the corresponding similarity ratings as values.

To test how well our embeddings agree with human judgments, we'll need to generate sentence embeddings. There are many possible ways to do this, but in this assignment we'll try two:

  • Simple sum. Define the sentence embedding to be the elementwise sum of the word embeddings making up the sentence.
  • Weighted sum. Like simple sum, but multiply each word vector by a scalar representing its importance. Inspired by tf-idf weighting, for each word we will multiply all the values in the vector by $log(word\_rank)$. Word rank is how common this word is in rank order, so "the" is rank 1, "it" is rank 21, "see" is rank 254, "innocent" is rank 4115, and so on. This scalar will be small for common words and large for rare words, so this weighting will bias our model to pay more attention to the less common words in each sentence when making a similarity determination.

Once you've done this, you'll write code to calculate performance in a manner very similar to in Task 2.

Task 4: analogies.py

We talked in class about the common analogy-solving paradigm for intrinsic evaluation of word vector quality, as well as some pitfalls with it. Examples in this paradigm frequently consist of relatively simple cases, e.g. morphological relations like plurization (hat:hats::bat:bats) or geographic relations like world capitals (London:England::Tokyo:Japan).

Here we'll look at a much more challenging testbed: SAT analogy questions. You can see some more information and state of the art results here. One thing to notice is that random chance performance would be 20% accuracy since there are five possible answers for each question, and average human performance is not super high - only 57%.

For clarity, we'll refer to these analogies as being of the form a:b::aa:bb.

Code is provided to read in the questions in a list. Each question is represented as a dictionary, where the 'question' key gives the two words a and b in a tuple. The 'choices' key gives an ordered list of the possible answers, each also a tuple of two possible words aa and bb. Your job is to return the index of the most likely answer. The 'answer' key for each question gives the true index.

You'll compare two strategies: the first is to use the analogy paradigm by returning the answer that maximizes $cos(a - b + bb, aa)$. The second is to look at linear relations more directly with a method we'll call parallelism, by returning the answer that maximizes $cos(a - b, aa - bb)$. You'll calculate the performance of these two strategies over the dataset.

Task 5: qualitative_notes.txt

This is a space to write out your overall qualitative analysis of your findings in the assignment. Like in the last assignment, this doesn't need to be super long, but you should provide at least some analysis for each section regarding how your hands-on findings relate to what we've talked about in class. What sorts of semantics are these dense vectors picking up, and what are they missing? What sorts of tasks might they be better or worse at? How do considerations of the construction of particular datasets impact whether we can be successful at their corresponding task?

Note on Evaluation

In both tasks 2 and 3, you'll evalute the performance of the embeddings using Spearman's Rho with the spearmanr function, which calculates to what extent the models rank the items in the same order as the humans do. In task 4 you'll evaluate with raw accuracy: $\frac{num\_correct}{num\_total}$.

Autograder

Each file has a main method which will run the relevant task and print out corresponding outputs. Definitely run each of the files throughout as you're working to see what sorts of results you're getting.

The autograder will also check tidbits of your various modules to make sure things look like they're working about right.

python /projects/e31408/autograders/a5.py

At a minimum, your assignment should run and pass all tests using this autograder before you submit!

Extensions

$\Uparrow$ Onward and upward! $\Uparrow$

As usual, please leave your working code intact and add _ext.py for any files where you need to change basic functionality for extensions.

  1. At each stage there are many qualitative questions you could ask about why the results are coming out as they are. Implement some additional functions to print out particularly relevant examples or other information that will help you to understand where these models are going wrong. For instance:
    • For word similarity, you could find the instances where humans disagree most strongly with the model. Why does this disagreement seem to be happening?
    • For sentence similarity, if one method is performing better than another, you could find cases where the better-performing model succeeds but the worse-performing model does not. Why could this be happening?
    • For analogies, you could write code to print out a random question, corresponding choices, correct answer, and model-predicted answer from the dataset. Look at a number of examples and consider where the model is going wrong. Is it making reasonable mistakes, or is it totally off base?
  2. For all of the above experiments, you could try comparing different embedding models - for instance, you could download and use higher dimensionality vectors downloaded from GloVe, or compare GloVe to another static embedding method like word2vec or fasttext.
  3. For the more ambitious, you could train your own sparse PPMI or tf-idf vectors and see how they perform on these tasks when compared to dense vectors like GloVe.
  4. For the very ambitious, consider how you could increase performance on any of these tasks, but particularly the challenging SAT analogy task. Peter Turney has a long and interesting article on a strong approach to this problem from which you could draw inspiration!

And as usual, whatever else you can dream up is also welcome!

Submission

To officially submit your assignment, please flip the complete variable to True!

As always, if it has to be late, edit the expected_completion_date.

Lastly, if you did any extensions, please briefly explain what you did and where we should look to check out your work in the extensions_description variable.