In this assignment, your primary goal is simple - implement the minimum edit distance algorithm. The pseudocode for this algorithm is given in Figure 2.17 on page 24 of Chapter 2 of SLP.
The skeleton code for this assignment is available at http://faculty.wcas.northwestern.edu/robvoigt/courses/2024_winter/ling334/assignments/a1.zip.
In this code the min_edit_distance
function is defined for you, as well as del_cost
, ins_cost
, and sub_cost
functions you should use. The costs are currently defined as the Levenshtein distances given in the book. Note that the sub_cost
function requires you to pass in two characters to compare - presently this returns 0 if the characters are the same, and 2 otherwise.
There is a small interface given as well so you can test your program by running:
python edit_distance.py word1 word2
and it will print to the console what your program calculates as the edit distance between the two words you provide.
As I said in class, for this and all future assignments, we require you to submit your assignment on Quest, and require that it runs there.
Make sure your permissions and python version are set properly as in a0, specifically with these lines in your .bashrc
:
umask 002
module load python/anaconda3.6
As usual, make an a1
subdirectory in your user folder in our course directory. E.g., the assignment should live in:
/projects/e31408/users/your_netid/a1
And then you can wget
the assignment:
wget http://faculty.wcas.northwestern.edu/robvoigt/courses/2024_winter/ling334/assignments/a1.zip
unzip a1.zip
The autograder will run your min_edit_distance
function on a few pre-defined solutions to check its accuracy. To check this for yourself, run the following command on Quest while in the directory with your edit_distance.py
file:
python /projects/e31408/autograders/a1.py
At a minimum, your assignment should run and pass all tests using this autograder before you submit!
If this was quick or easy for you and you're ready for more of a challenge, here are some additional things to explore!
When implementing any of these, please leave your working min_edit_distance
function intact, and perhaps copy-paste it to a new function or new script to be modified, so the autograder still works. I suggest doing cp edit_distance.py edit_distance_ext.py
once your initial function works, and editing the _ext.py
file instead of the original.
Print a table. In figure 2.18, the book shows a very helpful table of all the sub-problems that are solved along the way as part of the edit distance calculation. Implement an edit distance function which prints out to the console a table that looks like this. I actually recommend this extension for everyone as a sanity check to look at your outputs. (Hint: In Python 3, the print
function has an end
argument which determines what comes after the string that's printed - the default is a newline, but in this case it might be useful to do something like print(char, end='\t')
.
Backtrace and alignment. Augment your edit distance function to also record the backtrace as described in the book. Use this backtrace to print an alignment between the strings (like Figure 2.14) or a path between them (like Figure 2.16). This is also a very useful exercise, because it requires you to understand precisely what each cell in the table actually means.
Adjacency costs. The costs in this calculation are ultimately arbitrary decisions we get to make. Try modifying the cost functions to calculate a more intelligent version of edit distance that could be applicable to the task of spelling correction, or at least that could better handle messy web text. There are a few ways to do this. One could be to say that the substitution cost for hitting an adjacent key is lower than the cost of hitting a non-adjacent key. Assuming a QWERTY keyboard, there's some data on key adjacency here.
Spelling error costs. Another, more complicated way to modify the costs could be to look at real spelling mistakes. There are some corpora of spelling errors here. So for instance, you could look at this corpus of spelling errors in Wikipedia, and say that the cost of an error is inversely proportional to how many examples we see containing that error. This would require also implementing a backtrace, because you would want to run each example through your plain edit distance, determine via alignment and record which characters were inserted/deleted/substituted, and then create new costs proportional to these real-world instances. For example, you might see that "currently" is misspelled as "currenly"; this is one example of a real-world deletion of "t". Alternatively, you could make direct use of the error matrices from the relevant readings paper on spelling correction (Kernighan et al. 1990), which are available in this repository.
More dynamic programming. If you want to really go wild here, you could implement one of the other dynamic programming algorithms used in computational linguistics, such as the HMM decoder (SLP Chapter 8) or the CYK syntactic parser (SLP Chapter 17). This would be a lot of work with lots of moving parts; also something to think about for your final project if you're interested in the idea.
Read and report. Read some of the resources in the relevant readings section for this week on the syllabus, or find something else that seems relevant and interesting to you, for instance on Google Scholar or in the ACL Anthology. This can be an academic paper or book chapter but can also be more broad - blogs and articles are fine too. Then make a short post on Ed (or contribute to an existing one) sharing a bit about what you learned and why it was interesting (or not).
And as usual, whatever else you can dream up is also welcome!
To officially submit your assignment, please flip the complete
variable to True
!
As always, if it has to be late, edit the expected_completion_date
.
Lastly, if you did any extensions, flip extensions
to True
and please briefly explain what you did and where we should look to check out your work in the extensions_description
variable.