Welcome to the Final Project!¶

Hi all, as you know this course includes a final project component. This is an opportunity to take a self-directed approach to more deeply explore a topic you're interested in in this field of study - I'm very open as to what exactly that looks like.

Initial Proposal (due 5/28)¶

By Friday May 28th, please send Wes and I an email explaining what you intend to do for this project, as well as any questions you might have on how to get started if you have any. For group projects please only send one email with all group members cc'd.

Group Policy¶

For this project you are welcome to work individually or in small groups of size up to 3. These do not have to map on to your breakout rooms or anything else; if you have a topic you're interested in finding a collaborator for I suggest making a post on Ed.

Effort should scale roughly linearly with the number of group members, and group projects must include a statement in the writeup of who did what.

Possible Structures¶

Below I'll list some possibilities of what this final project could look like. Again none of these are binding and you are free to think creatively about what would be interesting to pursue.

Literature Survey¶

You could think of this option as being like an expanded, written version of your A6 group presentations (you can do the same topic or a different one). The idea is to pick a task, question, or area of interest, and do a write-up about it where you investigate more thoroughly its history, importance, and the methods associated with solving it. Is there a lot of work in this area? What have people tried? What perhaps remains to be tried? You're free to limit this however you want, for instance you could do "Non-neural Machine Learning Approaches to X" or whatever sub-category of approaches you're interested in.

System Error Analysis¶

Pick a task or set of tasks, and find an existing system that does those tasks, run it on some data, and do an error analysis. Error analysis in computational linguistics involves identifying, for a given system: what things does it do well, what does it struggle with, and why? The systems could be any publicly available NLP system, or one you train yourself.

There are many ways to do this. For instance, you could run the system on standard datasets and look at the outputs. You could also construct your own set of inputs which target some linguistic feature of interest and try running it on those to see how it does. For instance, does the system do fine on active voice constructions but fail on passive voice ones?

Error analysis is often a jointly qualitative-quantitative endeavor. You'll need to look at particular examples, but also potentially write code or calculate statistics to attempt to characterize more generally the system behavior. One neat place to look is this info about Errudite, a recent proposal and codebase for a systematic approach to error analysis.

Paper Replication¶

Find an existing CL/NLP with an available dataset and attempt to replicate the author's methods and findings. This can be a tricky task! It's okay if you aren't able to fully do this for one reason or another, but in your writeup you should document the process of attempting to do so and why it might have been challenging. This includes partial replications - for instance, in a paper with five different feature sets perhaps you could implement one of the featuresets and see how well you can do.

Some possible papers you could imagine trying this for:

Using LMs to quantify gender bias in tennis post-game interviews: https://www.cs.cornell.edu/~liye/tennis.html
Power and agency in films: https://homes.cs.washington.edu/~msap/movie-bias/
Linguistic indicators of betrayal in an online boardgame: https://www.cs.cornell.edu/~cristian/Betrayal.html
How to ask for a favor (e.g., do you get pizza if you ask for it on Reddit): https://ojs.aaai.org/index.php/ICWSM/article/view/14547/14396 and data

Another place to look is NLP "shared tasks" because they have existing datasets and even baseline systems, here is a good summary of some possibilities there: https://cicl-iscl.github.io/

This cannot just mean running someone else's code. The point here in replication is to attempt to follow the "recipe" given in the paper to reproduce the results yourself from scratch. If you want to run existing code, you need to do much more detailed error analysis and then your project becomes a "system error analysis" project as described above.

Your Own Project¶

This could be a new idea, or a follow-on to work elsewhere using techniques or ideas we learned in class to advance your existing projects.

For ideas, refer to any of the materials from throughout the class, other chapters of the SLP or NLP Notes textbooks (and especially the exercises they provide), the A6 description with links to various sub-areas, or even the final project description from my other LING 300 class which has links to lots of potentially interesting datasets.

Deliverable 1. A writeup in ACL format.¶

Roughly, this should include at least an abstract (short one-paragraph summary of what was done); a brief introduction covering the task, its importance, and any relevant references; a data section describing any datasets used; a methods section describing what precisely was done; a results section describing the findings, and a discussion qualitatively exploring what those results might mean and any associated error analysis or thoughts on future work.

Length

If your project has a substantial coding component, 2-4 pages is a fine length. If your project is more writing-focused (e.g. literature review or system error analysis) I expect more like 6-8 pages, but definitely no more than 10 pages. Note that the ACL format fits a good amount of text on the page, so these are not like your traditional double-spaced Word document paper.

LaTeX

You'll note that the ACL format linked above is an Overleaf LaTeX template. LaTeX is a system for scientific and mathematical typesetting which is widely used in academic writing of various sorts, for making textbooks and more (e.g., the SLP textbook is written in LaTeX). It's fundamentally a markup language that defines a rough structure, and the LaTeX compiler takes those instructions and produces a pretty document. It's a pretty amazing system, and can look intimidating at first but it's really not so bad and definitely worth learning.

If you click "Open as Template" in that link, it will create a new document starting from the template, which you can then edit. Fundamentally you'll want to simply edit the text in the document. Overleaf is a very friendly editing interface which shows the compiled document on the right, periodically updating it. You can collaboratively edit on there and track changes etc. They have great documentation too, for instance if you want to insert images read here.

Probably one the trickiest parts is references - but also, once you learn it, one of the most useful! This ultimately saves a ton of headaches formatting references in the bibliography because it does it all for you. You'll notice when you create a new template on the left in Overleaf there is a list of files. One is called acl2020.bib, which if you open it contains a list of (example) references in BibTeX format. Most citation management software (e.g. Zotero, Mendeley, etc) can export references in this format, and Google Scholar generates these as well - click the little hollow double-quote to the bottom left of an item, then in the popup click BibTeX. You can simply copy-paste references in this format into acl2020.bib (or type them in / edit them yourself), and then cite them in the document with \citep{id} for parenthetical citations or \citet{id} when the authors name should appear outside of the parens. The id is whatever you put at the top of the BibTeX entry right after the open curly brace - check out how they do it in the template and you'll understand.

If you have even the slightest passing interest in doing any scientific writing, typesetting, etc, I strongly recommend you use LaTeX for this writeup. If you're dead certain you will never do that and just want to use Word, there is an ACL 2020 Word template available here. Either way, submit a pdf of your writeup.

Deliverable 2. Any associated code or data you used or created for the project.¶

Please have this be runnable somewhere on Quest (e.g., in one group member's user directory), or in a github repository that we can clone and run on Quest. This can be .py files or a Jupyter notebook. Either way, please do at least some level of commenting in the code so we can understand what's going on.

If you need to use packages that are not included in the default conda install, please follow the directions in the "Environments" section here to create a conda environment in your user directory that we can activate to run your code.

You are not required to do this, but I strongly recommend and would love to see folks write a short post on Ed with their final project, so we can all see what each other got up to!

This could literally be as simple as pasting your abstract and attaching your writeup pdf to your post. Check out what others did and comment if you're interested!

Submission¶

Please fill out this form when you're finished with your final project (you'll attach the pdf writeup here): https://forms.gle/JUeU71fL9oUKdwSd7