Welcome to Assignment 1!

In this assignment, your primary goal is simple - implement the minimum edit distance algorithm. The pseudocode for this algorithm is given in Figure 2.17 on page 24 of Chapter 2 of SLP.

The skeleton code for this assignment is available at http://faculty.wcas.northwestern.edu/robvoigt/courses/2021_spring/ling334/assignments/a1.zip.

In this code the min_edit_distance function is defined for you, as well as del_cost, ins_cost, and sub_cost functions you should use. The costs are currently defined as the Levenshtein distances given in the book. Note that the sub_cost function requires you to pass in two characters to compare - presently this returns 0 if the characters are the same, and 2 otherwise.

There is a small interface given as well so you can test your program by running:

python edit_distance.py word1 word2

and it will print to the console what your program calculates as the edit distance between the two words you provide.

Assorted (Important) Logistics

As I said in class, you can work on your assignment anywhere and anyhow, but for this and all future assignments, we require you to submit your assignment on Quest, and require that it runs there. You should by now have figured out how to log into Quest; if you're still having trouble with this, please let us know right away.

Permissions

Once on Quest, please edit your .bashrc file to contain the following line:

umask 002

Before moving on, log out and log back into Quest once. This will ensure that we can edit your assignments to provide inline feedback.

Python 2 vs 3

The assignment and autograder assume you use Python 3. On Quest, the default Python is Python 2, to make your default be Python 3, run:

module load python/anaconda3.6

You have to do this each time you log in, or put it in your .bashrc. Alternatively when running things you can simply use python3 as the command rather than using python.

Course Directory

We have a course directory on Quest located at /projects/e31408. In there is a users subfolder. Please go in there and create a folder for your work named after your NetID. Then, inside that folder create a folder called a1 to hold this first assignment. E.g., in my case, to get started I might do all the following:

cd /projects/e31408/users/
mkdir rfj5679
cd rfj5679
mkdir a1
cd a1
wget http://faculty.wcas.northwestern.edu/robvoigt/courses/2021_spring/ling334/assignments/a1.zip
unzip a1.zip

Now I'll be in our course user dir, in my user subdir, in my a1 subdir, with the edit_distance.py skeleton code in there ready to go.

Autograder

The autograder will run your min_edit_distance function on a few pre-defined solutions to check its accuracy. To check this for yourself, run the following command on Quest while in the directory with your edit_distance.py file:

python /projects/e31408/autograders/a1.py

At a minimum, your assignment should run and pass all tests using this autograder before you submit!

Extensions

If this was quick or easy for you and you're ready for more of a challenge, here are some additional things to explore!

When implementing any of these, please leave your working min_edit_distance function intact and perhaps copy-paste it to a new function or new script to be modified, so the autograder still works. I suggest doing cp edit_distance.py edit_distance_ext.py once your initial function works, and editing the _ext.py file instead of the original.

  1. Print a table. In figure 2.18, the book shows a very helpful table of all the sub-problems that are solved along the way as part of the edit distance calculation. Implement an edit distance function which prints out to the console a table that looks like this. I actually recommend this extension for everyone as a sanity check to look at your outputs. (Hint: In Python 3, the print function has an end argument which determines what comes after the string that's printed - the default is a newline, but in this case it might be useful to do something like print(char, end='\t').

  2. Backtrace and alignment. Augment your edit distance function to also record the backtrace as described in the book. Use this backtrace to print an alignment between the strings (like Figure 2.14) or a path between them (like Figure 2.16). This is also a very useful exercise, because it requires you to understand precisely what each cell in the table actually means.

  3. Adjacency costs. The costs in this calculation are ultimately arbitrary decisions we get to make. Try modifying the cost functions to calculate a more intelligent version of edit distance that could be applicable to the task of spelling correction, or at least that could better handle messy web text. There are a few ways to do this. One could be to say that the substitution cost for hitting an adjacent key is lower than the cost of hitting a non-adjacent key. Assuming a QWERTY keyboard, there's some data on key adjacency here.

  4. Spelling error costs. Another, more complicated way to modify the costs could be to look at real spelling mistakes. There are some corpora of spelling errors here. So for instance, you could look at this corpus of spelling errors in Wikipedia, and say that the cost of an error is inversely proportional to how many examples we see containing that error. This would require also implementing a backtrace, because you would want to run each example through your plain edit distance, determine via alignment and record which characters were inserted/deleted/substituted, and then create new costs proportional to these real-world instances. For example, you might see that "currently" is misspelled as "currenly"; this is one example of a real-world deletion of "t". Alternatively, you could make direct use of the error matrices from the relevant readings paper on spelling correction (Kernighan et al. 1990), which are available in this repository.

  5. More dynamic programming. If you want to really go wild here, you could implement one of the other dynamic programming algorithms used in computational linguistics, such as the HMM decoder (SLP Chapter 8) or the CYK syntactic parser (SLP Chapter 13). This would be a lot of work with lots of moving parts; also something to think about for your final project if you're interested in the idea.

  6. Read and report. Read some of the resources in the relevant readings section for this week on the syllabus, or find something else that seems relevant and interesting to you, for instance on Google Scholar or in the ACL Anthology. This can be an academic paper or book chapter but can also be more broad - blogs and articles are fine too. Then make a short post on Ed (or contribute to an existing one) sharing a bit about what you learned and why it was interesting (or not).

And as usual, whatever else you can dream up is also welcome!

Submission

To officially submit your assignment, please fill out this form:

https://forms.gle/EJcdMvVWXzPEkMkL7

This will let us know your assignment is ready for us to go check out!