Welcome to Assignment 7 and the Final Project!

In this last assignment, the goal is to think creatively, explore available datasets, and use the skills you've learned from the course to process that data and do something interesting with it. What exactly that interesting thing will beis up to you.

I do not expect this to be a complete start-to-finish project, but it's more an exercise at seeing what sorts of linguistic and textual data are out there and playing with your new skills. You will naturally encounter some bumps in the road with weird formatting, multiple files, and on, and that's part of the process. Do your best and see what you can uncover!

If you're doing the "default assignment," your tasks are fundamentally the following:

  1. Get Some Data
  2. Do Something With It
  3. Tell Me About It

The options for each of these will be described in further detail below. One or the other may be the larger task in your case.

If you've already talked with me about your final project, you're good to go, feel free to stick with that plan.

Whether you're doing your own project or the default, here's what I'm asking for:

A short writeup (equivalent to 1-2 pages, which could be either a Word/PDF document or simply inside a Jupyter notebook) of what you did, why you did it, what hurdles and difficulties you ran into, and what you found, as well as your code in whatever format (bash scripts, Python scripts, Jupyter notebooks, etc). I expect you to put in about a normal assignment's worth of effort, and document as you go along; if you get that far but aren't "finished" in whatever sense, you are free to stop and let me know how far you got.

Please put all this on Quest in a assignment7 folder.

1. Get Some Data

To do text processing work, we need text data. Below I will provide a very large list of resources to consider as possible places to get data.

Some existing resources are in our course project directory, /projects/e31086/data. Be aware of space usage: we have 500GB free there, and you each have 80GB of space in your home directory. So don't download MASSIVE datasets; if you have any questions about space usage please post on Piazza.

For getting data from these sources onto Quest, one option is to use the information here: https://kb.northwestern.edu/page.php?id=70521

Alternatively you can use our old friend wget while logged in to Quest. If on a project page, for instance, there is a link to the dataset, right click on it and do 'Copy link address' (or similar), and then copy-paste this URL as an argument to wget.

Multilingualism

The data sources below are English-heavy, but to be 100% clear, you are MORE THAN WELCOME!!! to find and play with data in non-English languages if you speak and/or are interested in them. spaCy has support for several languages beyond English, documented here. If you want to work with a language not available in spaCy, Stanza is another NLP library with support for 60+ languages including many not in spaCy. If it turns out that essentially your entire assignment is e.g., learning how to use the basics of Stanza to work with some set of non-English data, I'm completely fine with that.

Getting data from the web

The internet is fundamentally an unimaginably huge book of interlinked pages, each of which is largely comprised of text. So naturally it makes for a great place to get text data.

Web Scraping

One great way to get data is by web scraping, writing a computer program to read webpages and process them into a usable format. At your current stage of programming you have the necessary skill to learn web scraping! Python is a great language for it, in particular because of the BeautifulSoup library which makes it (relatively) easy to intelligently handle the sometimes-complex formatting of modern websites.

If you want to try out web scraping for this assignment, I recommend first checking out this very informative video, and going from there: https://www.youtube.com/watch?v=ng2o98k983k

APIs

Note that some large websites like Twitter, Facebook, and Reddit have relatively aggressive anti-scraping policies, so be careful, because if you try to scrape them you may get your IP banned! They want to instead encourage you to use their API (application programming interface) - a controlled method of obtaining their data. Here's a brief primer on what APIs are: https://www.youtube.com/watch?v=OVvTv9Hy91Q

Scraping from these websites often requires registering for an account to "authenticate" with their API before it can be used. In the case of Twitter for instance you need to sign up here: https://developer.twitter.com/en

APIs themselves are generally based on going to certain (complicated) URLs, but there are often very convenient Python "wrappers" for APIs, for instance again for Twitter there's one called Tweepy: http://www.tweepy.org/

Web scraping and working with APIs are big tasks unto themselves - if you go this way for the assignment, it may be that obtaining some reasonable dataset takes up all your time, and that's fine.

Using Existing Datasets

If you don't feel like web scraping, luckily there is a TON of natural language data that people have already collected that you can use.

LDC

One core repository for linguistically-motivated data in particular is the Linguistic Data Consortium, hosted at the University of Pennsylvania. This data requires a subscription to access, but at Northwestern we have a full access subscription. Check out their Catalog, try searching, look at the Top 10 Corpora, and look at corpora collected by "project".

There's a lot there - a ton of news text for one thing, and some very interesting transcribed speech corpora. These include the very classic Switchboard corpus of unscripted telephone conversations used in many computational linguistics papers, as well as the similar CALLHOME and CALLFRIEND corpora.

If you want to use LDC data for this assignment, please post a link to the corpus page(s) you're interested in on Piazza (so others know it's available) and I'll download it to our class data folder.

BYU Datasets: COCA, COHA, et al

Mark Davies at Brigham Young University has collected a number of large datasets that are in wide use, most notably the Corpus of Contemporary American English (COCA) and the Corpus of Historical American English (COHA). The former contains a genre-balanced set of contemporary texts, and the latter has documents from 1810 to the present.

These two datasets are in the class data directory, /projects/e31086/data. However, a very important note that due to licensing these datasets are only to be used for the purposes of this class, and cannot be downloaded to your personal machine. I will get in trouble if you don't follow this rule, please stick with it, and work with these datasets on Quest only. If you want to work with this data for a project after class I'm glad to set that up with you, just contact me about it.

More information about the other BYU datasets is available here: https://www.corpusdata.org/

They also include datasets of scripts from TV and Movies, non-American Englishes, and news. These are also pay-to-play but we have them, if you want to use one of them let me know.

Other Sources

Here are some other assorted places to go looking! The resources here are all generally freely available.

And just look around, use search engines and explore, there's lots out there! For instance, just randomly searching right now I ran into this JSON file of 200k+ Jeopardy questions, neat! https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file/

2. Do Something With It

Okay, now you've got a dataset. Your next task is to look at the data, read the documentation for the dataset, and write a function or set of functions that reads that data into some usable format for processing.

Note on Reading Multiple and Large Files

I want to make you aware of some useful python functionality for this in the os module, which provides an interface to the operating system for reading files and organizing file paths. Here is a theoretical example for reading the files in COCA:

import os
base_dir = '/projects/e31086/data/COCA'
for filename in os.listdir(base_dir):
    full_path = os.path.join(base_dir, filename)

So the key methods from os to be aware of are os.listdir (like ls) and os.path.join which takes pieces of a filename and joins them cleanly - in this case, our base_dir '/projects/e31086/data/COCA' and a particular filename, say 'w_mag_2002.txt', and combines them into a full_path like '/projects/e31086/data/COCA/w_mag_2002.txt' which we can then pass to open to read.

Another thing to be aware of is for the purposes of this project, you may want to only sample some portion of the files if there are a ton of them or portion of the lines in a large file when the files are very big. For instance, across the 20 decades in COHA, there are more than 116,000 files, each of which is one document. So you might want to sample, for instance, only 1% of the files. You can do this like so:

import os, random
base_dir = '/projects/e31086/data/COCA'
for decade in os.listdir(base_dir):
    decade_dir = os.path.join(base_dir, decade)
    for filename in os.listdir(decade_dir):
        if random.random() < 0.99: continue # == sample 1%
        full_path = os.path.join(decade_dir, filename)

Things to Do

So I'm being deliberately obtuse here about what you should actually do with this data once you've got a way to read it, and it may well be that reading it is the biggest part of the challenge. It may also be that reading it is as simple as json.load and you're already there!

Regardless, the fact is what will be interesting to do depends on what the dataset is and what you'd like to do. Be creative! But here are some possibilities:

Count Stuff

How much language is there? How many words, speakers, texts? What are the most common words in different categories, time frames, or genres? How about the most common n-grams? What words tend to modify what other words?

Use Lexical Resources

Can you use the lexical resources we've seen (CMUdict, concreteness, emotion words, WordNet) to summarize or better understand the data? Note that Chapter 20 of SLP has more information on additional useful lexical resources that might be useful!

Remember that you could also create your own lexicons that capture relevant dimensions of the data and see how words in each might be distributed in different contexts!

Capture Patterns

Write regular expressions to extract meaningful pieces of information from the text. Count how often they occur in different contexts or time periods. You might find developing the patterns themselves is already a hard task, write about what difficult or surprising linguistic tidbits you encountered in doing so.

Write a New Format

Perhaps the original data has a lot of cruft or is not well-formatted for text processing. You could work on finding a better way to represent that data and writing it to a JSON, CSV, or plaintext file or files.

Try Out Classification

Chapter 6 of the NLTK book has basic information on how to do text classification. If your data has clear binary categories, trying this out could be a great thing to experiment with, and calculating the most informative features learned by your model will potentially provide great insights. Classification is a whole further world unto itself, but give it a shot!

3. Tell Me About It

Now in your write-up, describe your process and findings. Why did you choose the dataset you did, and what was it like working with it? What did you learn from this data exploration? What challenges did you encounter? Did you learn anything about language use in the domain of the data? Tell me what you did and what you found!