Welcome to the Final Project!

Hi all, as you know this course includes a final project component. This is an opportunity to take a self-directed approach to more deeply explore a topic you're interested in in this field of study - I'm very open as to what exactly that looks like.

Initial Proposal (due 11/23)

By Tuesday November 23rd, please send Thomas and I an email explaining what you intend to do for this project, as well as any questions you might have on how to get started if you have any. For group projects please only send one email with all group members cc'd.

Group Policy

For this project you are welcome to work individually or in small groups of size up to 3. These do not have to map on to your breakout rooms or anything else; if you have a topic you're interested in finding a collaborator for I suggest making a post on Ed.

Effort should scale roughly linearly with the number of group members, and group projects must include a statement in the writeup of who did what.

Possible Structures

Below I'll list some possibilities of what this final project could look like. Again none of these are binding and you are free to think creatively about what would be interesting to pursue.

Literature Survey

You could think of this option as being like an expanded, written version of your A6 group presentations (you can do the same topic or a different one). The idea is to pick a task, question, or area of interest, and do a write-up about it where you investigate more thoroughly its history, importance, and the methods associated with solving it. Is there a lot of work in this area? What have people tried? What perhaps remains to be tried? You're free to limit this however you want, for instance you could do "Non-neural Machine Learning Approaches to X" or whatever sub-category of approaches you're interested in.

System Error Analysis

Pick a task or set of tasks, and find an existing system that does those tasks, run it on some data, and do an error analysis. Error analysis in computational linguistics involves identifying, for a given system: what things does it do well, what does it struggle with, and why? The systems could be any publicly available NLP system, or one you train yourself.

There are many ways to do this. For instance, you could run the system on standard datasets and look at the outputs. You could also construct your own set of inputs which target some linguistic feature of interest and try running it on those to see how it does. For instance, does the system do fine on active voice constructions but fail on passive voice ones?

Error analysis is often a jointly qualitative-quantitative endeavor. You'll need to look at particular examples, but also potentially write code or calculate statistics to attempt to characterize more generally the system behavior. One neat place to look is this info about Errudite, a recent proposal and codebase for a systematic approach to error analysis.

Paper Replication

Find an existing CL/NLP with an available dataset and attempt to replicate the author's methods and findings. This can be a tricky task! It's okay if you aren't able to fully do this for one reason or another, but in your writeup you should document the process of attempting to do so and why it might have been challenging. This includes partial replications - for instance, in a paper with five different feature sets perhaps you could implement one of the featuresets and see how well you can do.

Some possible papers you could imagine trying this for:

Another place to look is NLP "shared tasks" because they have existing datasets and even baseline systems, here is a good summary of some possibilities there: https://cicl-iscl.github.io/

This cannot just mean running someone else's code. The point here in replication is to attempt to follow the "recipe" given in the paper to reproduce the results yourself from scratch. If you want to run existing code, you need to do much more detailed error analysis and then your project becomes a "system error analysis" project as described above.

Your Own Project

This could be a new idea, or a follow-on to work elsewhere using techniques or ideas we learned in class to advance your existing projects.

For ideas, refer to any of the materials from throughout the class, other chapters of the SLP or NLP Notes textbooks (and especially the exercises they provide), the topic descriptions below, or even the final project description from my other LING 300 class which has links to lots of potentially interesting datasets.

Possible Topics

If you're interested in exploring a new area of NLP to come up with a topic for your project or find an existing paper to replicate, here are some resources. Check out the papers cited in the SLP chapters, since they are often canonical and important ones that are worth replicating for example.

Sequence tagging.

Rather than classify an entire sentence or document, many applications require giving a label to each individual word in a sequence; these include tagging for parts of speech as well as named entity recognition (distinguishing between types of proper nouns like people, countries, and organizations). SLP Chapter 8.

Syntactic parsing.

Given a sentence, under some theory of syntax, provide a parse tree that represents its hierarchical structure. Historically, CL/NLP has focused on constitutency grammars (SLP Chapters 12 and 13), but contemporarily dependency grammars are more prevalent (SLP Chapter 14).

Semantic representations and semantic parsing.

The word vector semantics approach we've been using has benefits, but so too does a more hand-built logical representation that explicitly captures various aspects of meaning. What makes a good representation? The task of inferring these logical representations is called semantic parsing. SLP Chapters 15 and 19.

Information extraction.

How do we take free-form text and extract information about relations between entities and events, or turn free text into a relational database upon which queries can be executed? SLP Chapter 17.

Word sense disambiguation.

When a word has multiple possible meanings, how do we determine which meaning is intended in a given instance? SLP Chapter 18.

Coreference resolution.

An individual entity could be referred to in any number of ways in a text: Kamala, VP Harris, she, the Vice President, etc. How can we "resolve" these different surface forms of reference in context to understand that they refer to the same person? SLP Chapter 21.

Discourse parsing.

How can we represent and computationally infer the way in which clauses and sentences are meaningfully tied together into larger, coherent discourses? SLP Chapter 22.

Question answering.

How can we build a system capable of intelligently answering natural language questions from a user? SLP Chapter 23.

Chatbots and dialogue systems.

How can we build a system that can engage in a coherent, interesting, or entertaining conversation? SLP Chapter 24.

Topic modeling.

Given a corpus of text documents but no labels, can we determine the broad topics the documents are about? Textbook from Boyd-Graber et al.

Common sense reasoning.

Does a cat have fur? Does a toaster? Is the moon bigger than a breadbox? Is it actually made of cheese? These sorts of commonsense questions are easy for humans but extremely non-trivial for computational models to infer from language. How do we approach this problem? ACL Tutorial from Sap et al.

Cross-lingual word embeddings.

We learned about embeddings in one language - is it possible to make embeddings for multiple languages but for which the dimensions are comparable, so "pants" in English would be near "pantalones" in Spanish? Textbook from Sogaard et al.

Text summarization.

Given a long document, can we make it shorter and still keep most of the important information? Older article by Radev et al., newer neural paper by See et al.

... and many more!

Feel free to explore beyond! In addition to the resources in the "places to look for information" section above, another good place to look for topics of interest is the Synthesis Lectures on Human Language Technologies book series - these are relatively short, freely available textbooks that attempt to synthesize the work in some area of CL/NLP. You could do this assignment by picking one of those, reading the introduction to get a sense of the problem, and going from there.

Additionally, the section of Workshop Proceedings in the ACL Anthology is, in its own way, a fantastic summary of areas of interest in CL/NLP, and the proceedings of each workshop could be a great place to start for learning about that topic. In particular most workshops have a summary paper (usually the first one in the proceedings) that explains why the workshop is a thing, and this is often a great starting point too.

Another interesting place to look is the world of shared tasks. These are tasks proposed by people in the CL/NLP community who collect data and set up an evaluation scheme, and then people compete to see what they can discover and who can get the best performance on the task. One long-running example of this is SemEval, which has yearly tasks related to semantic understanding. These are great places to start since the setup is often provided for you in a nice way. This is especially useful for tasks from previous years where you can already read the papers and see precisely what people did.

Deliverable 1. A writeup in ACL format.

Roughly, this should likely include at least an abstract (short one-paragraph summary of what was done); a brief introduction covering the task, its importance, and any relevant references; a data section describing any datasets used; a methods section describing what precisely was done; a results section describing the findings, and a discussion qualitatively exploring what those results might mean and any associated error analysis or thoughts on future work. You can deviate from this basic structure if your project calls for it, for instance potentially projects doing a literature review or other writing-focused approach might be better served by a different structure and this is okay.

Length

If your project has a substantial coding component, 2-4 pages is a fine length. If your project is more writing-focused (e.g. literature review or system error analysis) I expect more like 6-8 pages, but definitely no more than 10 pages. Note that the ACL format fits a good amount of text on the page, so these are not like your traditional double-spaced Word document paper.

LaTeX

You'll note that the ACL format linked above is an Overleaf LaTeX template. LaTeX is a system for scientific and mathematical typesetting which is widely used in academic writing of various sorts, for making textbooks and more (e.g., the SLP textbook is written in LaTeX). It's fundamentally a markup language that defines a rough structure, and the LaTeX compiler takes those instructions and produces a pretty document. It's a pretty amazing system, and can look intimidating at first but it's really not so bad and definitely worth learning.

If you click "Open as Template" in that link, it will create a new document starting from the template, which you can then edit. Fundamentally you'll want to simply edit the text in the document. Overleaf is a very friendly editing interface which shows the compiled document on the right, periodically updating it. You can collaboratively edit on there and track changes etc. They have great documentation too, for instance if you want to insert images read here.

Probably one the trickiest parts is references - but also, once you learn it, one of the most useful! This ultimately saves a ton of headaches formatting references in the bibliography because it does it all for you. You'll notice when you create a new template on the left in Overleaf there is a list of files. One is called acl2020.bib, which if you open it contains a list of (example) references in BibTeX format. Most citation management software (e.g. Zotero, Mendeley, etc) can export references in this format, and Google Scholar generates these as well - click the little hollow double-quote to the bottom left of an item, then in the popup click BibTeX. You can simply copy-paste references in this format into acl2020.bib (or type them in / edit them yourself), and then cite them in the document with \citep{id} for parenthetical citations or \citet{id} when the authors name should appear outside of the parens. The id is whatever you put at the top of the BibTeX entry right after the open curly brace - check out how they do it in the template and you'll understand.

If you have even the slightest passing interest in doing any scientific writing, typesetting, etc, I strongly recommend you use LaTeX for this writeup. If you're dead certain you will never do that and just want to use Word, there is an ACL 2020 Word template available here. Either way, submit a pdf of your writeup.

Deliverable 2. Any associated code or data you used or created for the project.

Please have this be runnable somewhere on Quest (e.g., in one group member's user directory), or in a github repository that we can clone and run on Quest. This can be .py files or a Jupyter notebook. Either way, please do at least some level of commenting in the code so we can understand what's going on. Please be aware of plagarism concerns - it's fine to e.g. use an implementation of a useful function from a StackOverflow post, put in that case cite the URL of the post in a comment in your code.

If you need to use packages that are not included in the default conda install, please follow the directions in the "Environments" section here to create a conda environment in your user directory that we can activate to run your code.

Deliverable 3. Sharing findings (optional).

You are not required to do this, but I strongly recommend and would love to see folks write a short post on your final project on Ed, so we can all see what each other got up to!

This could literally be as simple as copy-pasting your abstract and attaching your writeup pdf to your post. Check out what others did and comment if you're interested!

Submission

Please fill out this form when you're finished with your final project (you'll attach the pdf writeup here):

https://forms.gle/aRWTQTSbCbWZoDCEA

Also, once you're done, don't forget to fill out your final self-evaluation here:

https://forms.gle/5Tyz3jNVZZPCXBeZ6