Welcome to Assignment 6!¶

This assignment is intended to be done mostly or entirely in class during week 8. The idea is that computational linguistics is a broad field, with many sub-topics and tasks of interest that we don't possibly have time to cover in a 10-week quarter.

So the plan here is to spend a week of class time working in small groups to learn about different areas of interest in the field. Each group will pick one area to investigate, and then next Monday (5/24), we'll have time in class for each group to give a very short (less than 5 min) lightning talk explaining what they found.

First read through this document individually to get a sense of the possibilities. Then as a group, talk about and decide which area you might like to dig into. I'll provide below a brief description of each topic area, as well as some resources to get you started on investigating the topic.

Note for Asynchronous Folks¶

There are a small number of people in the class who cannot attend synchronously due to distant time zones and other issues; if this is you, no worries. You can complete this assignment by essentially doing the same task on your own and doing a short (~ 1 pg) write-up on what you find rather than preparing a presentation. Please email this to us and fill out the submission form at the bottom of the page.

Your Jobs¶

The main idea here is to understand the task/topic and its importance, understand why it might be difficult to solve computationally, and understand the approaches that people have tried to solve it. Some of these tasks are, at present, relatively "solved" in the sense that we have systems that do them very well. Nevertheless, they all have linguistic properties that make them interesting to think about and learn from.

Reading requirements¶

Read something introductory about the topic - most commonly a textbook chapter.
Read at least one research paper about the topic. It's okay if you don't fully understand all the details in the paper, but try to at least understand conceptually what is being done and why.

Places to look for information¶

Many of the topics below are simply chapters from SLP; each chapter there also has an extensive bibliography in which you can find papers to check out, which is a great starting point. Other resources:

The ACL Anthology is the official respository of research papers for the computational linguistics community. Searching for your topic here will get lots of results.
Of course, Google Scholar is also very useful; it can help to include "nlp" or "computational linguistics" in your search query.
nlpprogress.com is an open-source effort to track the latest and greatest results on many NLP/CL tasks.
SCiL is a non-ACL conference which has many interesting papers that are more "linguisticky."
The "related work" section of papers about the topic!

Presentation¶

Your group will give a presentation on the topic next Monday synthesizing what you've learned - strictly less than 5 minutes long. You can designate one person to do this, or all join in, as long as you all contribute to preparing the material to be presented. Beyond broadly introducing the topic area and its importance, I'd like you to pick a particular task within that area and aim to answer the following questions:

What precisely is the task? What are the inputs, what are the outputs?
Why is this task difficult? Ideally, show an example that illustrates why this task might be challenging.
What methods have been used historically to approach this task, and what methods are used today?
How is the task evaluated? How do we know when we're doing this task well? How well do the best systems do?

Possible Topics¶

Sequence tagging.¶

Rather than classify an entire sentence or document, many applications require giving a label to each individual word in a sequence; these include tagging for parts of speech as well as named entity recognition (distinguishing between types of proper nouns like people, countries, and organizations). SLP Chapter 8.

Syntactic parsing.¶

Given a sentence, under some theory of syntax, provide a parse tree that represents its hierarchical structure. Historically, CL/NLP has focused on constitutency grammars (SLP Chapters 12 and 13), but contemporarily dependency grammars are more prevalent (SLP Chapter 14).

Semantic representations and semantic parsing.¶

The word vector semantics approach we've been using has benefits, but so too does a more hand-built logical representation that explicitly captures various aspects of meaning. What makes a good representation? The task of inferring these logical representations is called semantic parsing. SLP Chapters 15 and 19.

Information extraction.¶

How do we take free-form text and extract information about relations between entities and events, or turn free text into a relational database upon which queries can be executed? SLP Chapter 17.

Word sense disambiguation.¶

When a word has multiple possible meanings, how do we determine which meaning is intended in a given instance? SLP Chapter 18.

Coreference resolution.¶

An individual entity could be referred to in any number of ways in a text: Kamala, VP Harris, she, the Vice President, etc. How can we "resolve" these different surface forms of reference in context to understand that they refer to the same person? SLP Chapter 21.

Discourse parsing.¶

How can we represent and computationally infer the way in which clauses and sentences are meaningfully tied together into larger, coherent discourses? SLP Chapter 22.

Question answering.¶

How can we build a system capable of intelligently answering natural language questions from a user? SLP Chapter 23.

Chatbots and dialogue systems.¶

How can we build a system that can engage in a coherent, interesting, or entertaining conversation? SLP Chapter 24.

Topic modeling.¶

Given a corpus of text documents but no labels, can we determine the broad topics the documents are about? Textbook from Boyd-Graber et al.

Common sense reasoning.¶

Does a cat have fur? Does a toaster? Is the moon bigger than a breadbox? Is it actually made of cheese? These sorts of commonsense questions are easy for humans but extremely non-trivial for computational models to infer from language. How do we approach this problem? ACL Tutorial from Sap et al.

Cross-lingual word embeddings.¶

We learned about embeddings in one language - is it possible to make embeddings for multiple languages but for which the dimensions are comparable, so "pants" in English would be near "pantalones" in Spanish? Textbook from Sogaard et al.

Text summarization.¶

Given a long document, can we make it shorter and still keep most of the important information? Older article by Radev et al., newer neural paper by See et al.

Other Possibilities¶

You are not explicitly limited to these topics. Feel free to explore beyond! In addition to the resources in the "places to look for information" section above, another good place to look for topics of interest is the Synthesis Lectures on Human Language Technologies book series - these are relatively short, freely available textbooks that attempt to synthesize the work in some area of CL/NLP. You could do this assignment by picking one of those, reading the introduction to get a sense of the problem, and going from there.

Additionally, the section of Workshop Proceedings in the ACL Anthology is, in its own way, a fantastic summary of areas of interest in CL/NLP, and the proceedings of each workshop could be a great place to start for learning about that topic. In particular most workshops have a summary paper (usually the first one in the proceedings) that explains why the workshop is a thing, and this is often a great starting point too.

Extensions¶

The general task is to do some reading, get a sense for an area, and collectively distill it down into 5 minutes. If you want to go further with it, here's a few core possibilities:

Book exercises. For the topics that are in SLP, the book provides exercises at the end of the chapter; check them out and do a few!
Implementation. Write your own program to do a particular task in your chosen topic area.
Error analysis. Find an existing implementation on the web (or use your own), run it on some open data for the task, and find interesting patterns - where does the model fail and why does that seem to be?

Time permitting, it would be fantastic to incorporate your findings into your presentation! Either way, let us know on the submission form below if you do these or any other additional work here.

Submission¶

Please fill out this form when you're finished with this assignment: https://forms.gle/d8eapks6nBMppaNC9

And please have a group member submit your presentation materials here before class Monday: https://drive.google.com/drive/folders/1yPrcw7lFBD1NHdOEDR5rRsbswDEYz8Uq?usp=sharing