Welcome to the Final Project!

Hi all, as you know this course includes a final project component. This is an opportunity to take a self-directed approach to more deeply explore a topic you're interested in in this field of study - I'm very open as to what exactly that looks like.

Group Policy

For this project you are welcome to work individually or in small groups of size up to 3. If you want to have a larger group you need to ask in advance and have a strong justification for why it's necessary - but large-scale, ambitious projects are welcomed so do go for it! Just let us know. If you have a topic you're interested in finding a collaborator for I suggest making a post on Ed.

Effort should scale roughly linearly with the number of group members, and group projects must include a statement in the writeup of who did what.

Possible Structures

Below I'll list some possibilities of what this final project could look like. Again none of these are binding and you are free to think creatively about what would be interesting to pursue.

Literature Survey

The idea here is to pick a task, question, or area of interest, and do a write-up about it where you investigate more thoroughly its history, importance, and the methods associated with solving it. Is there a lot of work in this area? What have people tried? What perhaps remains to be tried? You're free to limit this however you want, for instance you could do "Non-neural Machine Learning Approaches to X" or whatever sub-category of approaches you're interested in. Relevant chapters from the textbook can be very good starting points.

System Error Analysis

Pick a task or set of tasks, and find an existing system that does those tasks, run it on some data, and do an error analysis. Error analysis in computational linguistics involves identifying, for a given system: what things does it do well, what does it struggle with, and why? The systems could be any publicly available NLP system, or one you train yourself.

There are many ways to do this. For instance, you could run the system on standard datasets and look at the outputs. You could also construct your own set of inputs which target some linguistic feature of interest and try running it on those to see how it does. For instance, does the system do fine on active voice constructions but fail on passive voice ones?

Error analysis is often a jointly qualitative-quantitative endeavor. You'll need to look at particular examples, but also potentially write code or calculate statistics to attempt to characterize more generally the system behavior. One neat place to look is this info about Errudite, a recent proposal and codebase for a systematic approach to error analysis.

Paper Replication

Find an existing CL/NLP with an available dataset and attempt to replicate the author's methods and findings. This can be a tricky task! It's okay if you aren't able to fully do this for one reason or another, but in your writeup you should document the process of attempting to do so and why it might have been challenging. This includes partial replications - for instance, in a paper with five different feature sets perhaps you could implement one of the featuresets and see how well you can do.

A completely non-exhaustive sample of some possible papers you could imagine trying this for:

Another place to look is NLP "shared tasks" because they have existing datasets and even baseline systems, here is a good summary of some possibilities there: https://cicl-iscl.github.io/

This cannot just mean running someone else's code. The point here in replication is to attempt to follow the "recipe" given in the paper to reproduce the results yourself from scratch. If you want to run existing code, you need to use that to do something more complicated, for example much more detailed error analysis in which case your project would become a "system error analysis" project as described above.

Your Own Project

This could be a new idea, or a follow-on to work elsewhere using techniques or ideas we learned in class to advance your existing projects.

For ideas, refer to any of the materials from throughout the class, other chapters of the SLP or NLP Notes textbooks (and especially the exercises they provide), the topic descriptions below, or even the final project description from my other LING 300 class which has links to lots of potentially interesting datasets.

Possible Topics

If you're interested in exploring a new area of NLP to come up with a topic for your project or find an existing paper to replicate, here are some resources. Check out the papers cited in the SLP chapters, since they are often canonical and important ones that are worth replicating for example.

Diving Deeper on Neural Networks / LLMs

There are many, many resources for this out on the web, but here's a starting point with many excellent possibilities compiled for a seminar I recently taught: https://lingmechlab.notion.site/be37a59785fd42159a690ebe692d469c?v=ca065b71a7d24bbda18f93a64c77ecb8

Sequence tagging.

Rather than classify an entire sentence or document, many applications require giving a label to each individual word in a sequence; these include tagging for parts of speech as well as named entity recognition (distinguishing between types of proper nouns like people, countries, and organizations). SLP Chapter 8.

Syntactic parsing.

Given a sentence, under some theory of syntax, provide a parse tree that represents its hierarchical structure. Historically, CL/NLP has focused on constitutency grammars (SLP Chapter 17), but contemporarily dependency grammars are more prevalent (SLP Chapter 18).

Information extraction.

How do we take free-form text and extract information about relations between entities and events, or turn free text into a relational database upon which queries can be executed? SLP Chapter 19.

Word sense disambiguation.

When a word has multiple possible meanings, how do we determine which meaning is intended in a given instance? SLP Appendix Chapter G.

Coreference resolution.

An individual entity could be referred to in any number of ways in a text: Kamala, VP Harris, she, the Vice President, etc. How can we "resolve" these different surface forms of reference in context to understand that they refer to the same person? SLP Chapter 22.

Discourse parsing and coherence.

How can we represent and computationally infer the way in which clauses and sentences are meaningfully tied together into larger, coherent discourses? SLP Chapter 23.

Question answering.

How can we build a system capable of intelligently answering natural language questions from a user? SLP Chapter 14.

Chatbots and dialogue systems.

How can we build a system that can engage in a coherent, interesting, or entertaining conversation? SLP Chapter 15.

Topic modeling.

Given a corpus of text documents but no labels, can we determine the broad topics the documents are about? Textbook from Boyd-Graber et al.

Common sense reasoning.

Does a cat have fur? Does a toaster? Is the moon bigger than a breadbox? Is it actually made of cheese? These sorts of commonsense questions are easy for humans but extremely non-trivial for computational models to infer from language. How do we approach this problem? ACL Tutorial from Sap et al.

Cross-lingual word embeddings.

We learned about embeddings in one language - is it possible to make embeddings for multiple languages but for which the dimensions are comparable, so "pants" in English would be near "pantalones" in Spanish? Textbook from Sogaard et al.

Text summarization.

Given a long document, can we make it shorter and still keep most of the important information? Older article by Radev et al., newer neural paper by See et al.

... and many more!

Feel free to explore beyond! In addition to the resources in the "places to look for information" section above, another good place to look for topics of interest is the Synthesis Lectures on Human Language Technologies book series - these are relatively short, freely available textbooks that attempt to synthesize the work in some area of CL/NLP. You could do this assignment by picking one of those, reading the introduction to get a sense of the problem, and going from there.

Additionally, the section of Workshop Proceedings in the ACL Anthology is, in its own way, a fantastic summary of areas of interest in CL/NLP, and the proceedings of each workshop could be a great place to start for learning about that topic. In particular most workshops have a summary paper (usually the first one in the proceedings) that explains why the workshop is a thing, and this is often a great starting point too.

Another interesting place to look is the world of shared tasks. These are tasks proposed by people in the CL/NLP community who collect data and set up an evaluation scheme, and then people compete to see what they can discover and who can get the best performance on the task. One long-running example of this is SemEval, which has yearly tasks related to semantic understanding. These are great places to start since the setup is often provided for you in a nice way. This is especially useful for tasks from previous years where you can already read the papers and see precisely what people did.

Lastly

Deliverable 1. A writeup in ACL format.

Roughly, this should likely include at least an abstract (short one-paragraph summary of what was done); a brief introduction covering the task, its importance, and any relevant references; a data section describing any datasets used; a methods section describing what precisely was done; a results section describing the findings, and a discussion qualitatively exploring what those results might mean and any associated error analysis or thoughts on future work. You can deviate from this basic structure if your project calls for it, for instance potentially projects doing a literature review or other writing-focused approach might be better served by a different structure and this is okay.

Length

If your project has a substantial coding component, 2-4 pages is a fine length. If your project is more writing-focused (e.g. literature review or system error analysis) I expect more like 6-8 pages, but definitely no more than 10 pages. Note that the ACL format fits a good amount of text on the page, so these are not like your traditional double-spaced Word document paper.

LaTeX

You'll note that the ACL format linked above is an Overleaf LaTeX template. LaTeX is a system for scientific and mathematical typesetting which is widely used in academic writing of various sorts, for making textbooks and more (e.g., the SLP textbook is written in LaTeX). It's fundamentally a markup language that defines a rough structure, and the LaTeX compiler takes those instructions and produces a pretty document. It's a pretty amazing system, and can look intimidating at first but it's really not so bad and definitely worth learning.

If you click "Open as Template" in that link, it will create a new document starting from the template, which you can then edit. Fundamentally you'll want to simply edit the text in the document. Overleaf is a very friendly editing interface which shows the compiled document on the right, periodically updating it. You can collaboratively edit on there and track changes etc. They have great documentation too, for instance if you want to insert images read here.

Probably one the trickiest parts is references - but also, once you learn it, one of the most useful! This ultimately saves a ton of headaches formatting references in the bibliography because it does it all for you. You'll notice when you create a new template on the left in Overleaf there is a list of files. One is called acl2020.bib, which if you open it contains a list of (example) references in BibTeX format. Most citation management software (e.g. Zotero, Mendeley, etc) can export references in this format, and Google Scholar generates these as well - click the little hollow double-quote to the bottom left of an item, then in the popup click BibTeX. You can simply copy-paste references in this format into acl2020.bib (or type them in / edit them yourself), and then cite them in the document with \citep{id} for parenthetical citations or \citet{id} when the authors name should appear outside of the parens. The id is whatever you put at the top of the BibTeX entry right after the open curly brace - check out how they do it in the template and you'll understand.

If you have even the slightest passing interest in doing any scientific writing, typesetting, etc, I strongly recommend you use LaTeX for this writeup. If you're dead certain you will never do that and just want to use Word, there is an ACL 2020 Word template available here. Either way, submit a pdf of your writeup.

Deliverable 2. Any associated code or data you used or created for the project.

Please have this be runnable somewhere on Quest (e.g., in one group member's user directory), or in a github repository that we can clone and run on Quest. This can be .py files or a Jupyter notebook. Either way, please do at least some level of commenting in the code so we can understand what's going on. Please be aware of plagarism concerns - it's fine to e.g. use an implementation of a useful function from a StackOverflow post, put in that case cite the URL of the post in a comment in your code.

If you need to use packages that are not included in the default conda install, please follow the directions in the "Environments" section here to create a conda environment in your user directory that we can activate to run your code.

Deliverable 3. Sharing findings (optional).

You are not required to do this, but I strongly recommend and would love to see folks write a short post on your final project on Ed, so we can all see what each other got up to!

This could literally be as simple as copy-pasting your abstract and attaching your writeup pdf to your post. Check out what others did and comment if you're interested!

Submission

https://forms.gle/UgVmSy9ECdwpN3gJA