Welcome to Assignment 5!

And welcome yet again to a new world in which to do programming. We started with the command line, moved on to directly editing .py files with our code using command-line text editors, and now we're here in Jupyter-land. This file is what's called a "Jupyter notebook", a user-friendly and highly readable document that allows us to not only write code but also directly see the outputs in this web browser editor application. Very cool!

So this is another transition but hopefully one that makes your life easier rather than harder. We're very lucky that the Quest infrastructure has an easy setup for Jupyter notebooks already, so we can essentially just go to a URL in a web browser, log in, and directly access our code and files. Check out the resources for this week on the course website for more information about Jupyter.

Instructions for logging in are here: https://kb.northwestern.edu/94116

In this assignment, we won't introduce a huge amount of new material, but we will go deeper on what we do know. We'll become more expert at working with different sorts of dictionaries, reading and writing files, and doing some interesting text processing.


0. Info and Preparation

One neat thing about Jupyter is there are two main cell types: code and Markdown. Cells default to code, but can be switched to Markdown in the menus above (Cell > Cell Type > Markdown).

All the normal-looking text in this assignment is in Markdown cells. Markdown is a very simple formatting language that allows you to do pretty text formatting with only ascii text input. No need to dig too deep into it, but there's a cheat sheet here if you're interested: https://github.com/adam-p/markdown-here/wiki/Markdown-Cheatsheet

Anyway, this means you can input your info in a normal text-looking way. Do that in the cell below - double-click to edit formatted Markdown cells, delete the [your_answer] and fill in.

Name:

[your_answer]

Hours this took:

[your_answer]

Comments or Questions:

[your_answer]

Jupyter has many useful keyboard shortcuts which are worth looking up if you're interested (e.g. as explained in the first four sections here). But I want to make sure you know at least one: Shift+Enter. This 'executes' a given cell. For a code cell, this runs the code. For a Markdown cell, this does formatting on the cell and makes it all pretty. Note you can also do the same thing with the "Run" button in the toolbar at the top. Make sure you execute the cell above to prettify it.

FYI, you downloaded this assignment in a zip file containing some auxiliary data files we'll be working with, so they should also be in your assignment5 directory


Here are some helper functions that you can feel free to use if they're useful. They strip a string down to certain character sets - so letters_only reduces a string to only alphabetic characters, remove_letters removes all alphabetic characters, analogously for digits_only and remove_digits, and letters_and_spaces_only removes everything but alphabetic characters and spaces. tokenize is a version of our simple tokenizer from the previous assignment which takes in a string and returns a list of words. Leave these as-is but run the cell.

In [2]:
import string, math

def letters_only(s):
    return ''.join([c for c in s if c.isalpha()])

def remove_letters(s):
    return ''.join([c for c in s if not c.isalpha()])

def digits_only(s):
    return ''.join([c for c in s if c.isdigit()])

def remove_digits(s):
    return ''.join([c for c in s if not c.isdigit()])

def letters_and_spaces_only(s):
    return ''.join([c for c in s if (c.isalpha() or c == ' ')])

def tokenize(s):
    return [w.strip(string.punctuation) for w in s.lower().split()]

And here's our old friend run_tests again! Also leave as is but run this cell.

In [3]:
def run_tests(func, tests):
    print('\tRunning {} tests on the `{}` function...'.format(len(tests), func.__name__))
    errors = 0
    for val, ret in tests:
        try:
            if type(val) == tuple:
                assert func(*val) == ret
            else:
                assert func(val) == ret
        except AssertionError:
            print('\t\terror for input {}'.format(val))
            errors += 1
    if errors == 0:
        print('\tAll tests passed!')

Important Note

You have to run the above cells each time you open the notebook to have access to these functions. In fact, I suggest re-running the entire notebook every time you open the assignment, either with Kernel > Restart and Run All in the menu or by clicking the fast-forward icon in the toolbar.


1. Different Dazzling Dictionaries

As we've very glancingly touched upon in class, Python is what's called an "object-oriented" programming language. Briefly, this means that everything we use is abstractly a certain type of "object" which has an inherent structure and is inherently associated with certain attributes and methods. Any given variable is an "instance" of that type of object, so if we had a variable my_string = 'hello!', my_string is an instance of a string object and therefore it has all the methods and attributes strings have like lower() and startswith(). More info on the object-oriented paradigm on Wikipedia if you're interested.

In Python the type of an object is called its class, and one great thing Python can do is have sub-classes, which inherit the attributes and methods of their parent class, but add additional or more specific functionality for certain situations.

This is all background to help you understand the two new types of dictionaries we'll be working with here, Counter and defaultdict. These are both subclasses of the basic dictionary class, so you can do anything with them that you can do with a dictionary but they have additional functionality on top of that.


We'll start with Counter.

In [4]:
from collections import Counter

As you can see, Counter is a class available in the collections module. Documentation here. It is a dictionary class tailor-made for, well, counting things. Unlike a normal dictionary, the values in a Counter default to integers set to 0. In the previous assignment we used dictionaries to count, but each time had to check if the key existed (and create it if not). With a Counter, we can simply increment the value for any key (e.g. counts[word] += 1) and it will do all that for us. To see how this works, let's re-implement our word_counts function from last week.

a. Complete the function word_counts to count the words in a string, using a Counter.

This should be very quick to do and look almost the same as what you did last week, except with fewer lines of code since we can rely on Counter to do some of the cleanup for us as described above. Remember that a tokenize function has been implemented above that you can use.

In [5]:
def word_counts(s):
    """Tokenize the str `s` and accumulate counts in a dictionary for each 
    word that appears.
      
    Parameters
    ----------
    s : str
        The input string.

    Returns
    -------
    dict of { str : int }
        Word counts in the string, with words as keys and their corresponding
        counts as values.
    """
    counts = Counter()
    # >>> YOUR ANSWER HERE
    tokens = tokenize(s)
    for token in tokens:
        counts[token] += 1
    # >>> END YOUR ANSWER
    return counts
In [6]:
tests = [
    ('a a b a b b c a', {'a': 4, 'b': 3, 'c': 1}),
    ('I wish to wish the wish you wish to wish', {'i': 1, 'wish': 5, 'to': 2, 'the': 1, 'you': 1}),
    ('      ', {}),
    ("RaZzLe dAzZlE", {'razzle': 1, 'dazzle': 1})
]
run_tests(word_counts, tests)
	Running 4 tests on the `word_counts` function...
	All tests passed!

Counters have several useful built-in methods too. For instance, the most_common() method takes an integer as an argument and returns key-value tuples of that many of the most common words in the Counter.


Now for defaultdict.

In [7]:
from collections import defaultdict

Counter above allows us to set the default type for any key to an integer (set to 0). defaultdict is similar but more general: it allows us to set the default type to any Python object (which can be arbitrarily complex). Documentation here. Below we'll practice using defaultdict to read in a complicated but fun sort of file: a pronouncing dictionary!

We'll be using the CMU Pronouncing Dictionary, the website of which is here. They describe it succinctly as follows:

The Carnegie Mellon University Pronouncing Dictionary is an open-source machine-readable pronunciation dictionary for North American English that contains over 134,000 words and their pronunciations. CMUdict is being actively maintained and expanded. We are open to suggestions, corrections and other input.

The phonemic alphabet it uses is called ARPAbet from the 70s, which represents the American English-y section of the IPA in ASCII characters. Vowel phonemes are annotated with a digit representing stress: 0 means no stress, 1 means primary stress, and 2 means secondary stress.

The sociolinguists in the room may note that such a dictionary is (almost by definition) relying upon a semi-arbitrary notion of "standardness" in pronuncication and that it is missing a lot of meaningful variation. The prosody afficionados will notice that its annotations of accent make some semi-arbitrary assumptions about information structure and focus. Fair criticisms all, and more where that came from, but nevertheless this dictionary and others like it have been used to great effect for many very exciting applications like speech recognition, so it's very worth playing with!

The dictionary itself is contained in your data zip as a file called 'cmudict-0.7b', and I highly encourage you to open and look at the file (e.g. in less) and its structure before doing the next problem.

b. Complete the function read_cmudict to load the CMU Pronouncing Dictionary into a Python dictionary, using a defaultdict.

In this problem you'll load the CMUdict file into a format we can work with directly in Python. Specifically, each word in CMUdict will be a key in the dictionary, and each value will be a list of the pronunciations that word can take.

Important things to note:

  • This is an old file from 1993 and therefore uses an out-of-date encoding. We don't need to worry about those details for now so the for loop over lines in the file has been provided for you with the encoding set up.
  • Skip (with continue) all lines for which the first character is a punctuation mark. This both removes comments from the file as well as some pronunciations we will not use, like the pronunciations for punctuation marks themselves.
  • All the words and pronunciations are in all caps - leave them that way.
  • Each word and its pronunciation is separated by two spaces ' '. Useful for split()ing.
  • Always append the pronunciation, don't assign it.

Possibly the trickiest part of this is that words with multiple pronunciations in the dictionary will appear on multiple separate lines, one pronunciation per appearance. Each instance after the first will have (N) appended to the end of the word, like this:

ACTUALLY  AE1 K CH UW2 AH0 L IY0
ACTUALLY(1)  AE1 K CH L IY0
ACTUALLY(2)  AE1 K SH AH0 L IY0

The goal is to ultimately collect three lines in the dictionary as a key-value pair where the key is the string "ACTUALLY" and the value is a list of strings like ["AE1 K CH UW2 AH0 L IY0", "AE1 K CH L IY0", "AE1 K SH AH0 L IY0"].

Play with split() to figure out how to do this. As a hint, note that given a string my_string and a character c, if you do my_string.split(c) but c is not in my_string, you get back a list of length one where the 0th index simply contains my_string.

In [8]:
def read_cmudict(cmudict_file='cmudict-0.7b'):
    """Read the CMU Pronouncing Dictionary into a Python dictionary.

    Parameters
    ----------
    cmudict_file : str
        The path to the cmudict file.

    Returns
    -------
    dict of { str : list of str }
        Dictionary mapping words (in ALLCAPS) to lists of strings
        containing their pronunciations from CMUdict.
    """
    cmudict = defaultdict(list)
    for row in open(cmudict_file, encoding='iso-8859-1'):
        # >>> YOUR ANSWER HERE
        pass
        if row[0] in string.punctuation: continue
        w, p = row.split('  ')
        word = w.split('(')[0]
        pron = p.strip()
        cmudict[word].append(pron)
        # >>> END YOUR ANSWER
    return cmudict

cmudict = read_cmudict()

print('CMUdict loaded with {} entries'.format(len(cmudict.keys())))
CMUdict loaded with 125002 entries
In [9]:
error = False
# Test the expected number of entries
print('Checking number of entries...')
try:
    assert len(cmudict) == 125002
except:
    error = True
    print("\tCMUdict doesn't contain the right number of entries.")

# Test CMUdict with particular words
print('Running checks on 5 particular words...')
tests = [
    ("COMPUTATIONAL", ["K AA2 M P Y UW0 T EY1 SH AH0 N AH0 L"]),
    ("LINGUISTS", ["L IH1 NG G W IH0 S T S"]),
    ("ARE", ["AA1 R", "ER0"]),
    ("SOMETIMES", ["S AH0 M T AY1 M Z", "S AH1 M T AY2 M Z"]),
    ("WHIMSICAL", ["W IH1 M Z IH0 K AH0 L", "HH W IH1 M Z IH0 K AH0 L"])
]
for val, ret in tests:
    try:
        assert cmudict[val] == ret
    except:
        error = True
        print('\tError on {}, expected {} but got {}'.format(val, ret, cmudict[val]))
if not error:
    print('All tests passed!')
Checking number of entries...
Running checks on 5 particular words...
All tests passed!

2. At The Speed of Sound

Now that we have our pronouncing dictionary, let's dig a little deeper into some things we can calculate about word pronunciations.

a. Complete the function syllable_count to calculate the number of syllables in a CMUdict pronunciation.

Remember that a syllable is defined by the presence of a vowel phoneme, and vowel phonemes in CMUdict have an accent marking on the end.

In [10]:
def syllable_count(p):
    """
    Takes a cmudict pronunciation string and returns the number of syllables.

    Parameters
    ----------
    p : str
        Pronunciation string from CMUdict (ARPAbet phonemes separated by spaces).

    Returns
    -------
    int
        Count of syllables in the pronunciation, defined as the number of items
        in the pronunciation that end with a digit (representing their stress).
    """
    # >>> YOUR ANSWER HERE
    return len(digits_only(p))
    # >>> END YOUR ANSWER
In [11]:
tests = [
    ("K AE1 N", 1),
    ("EH1 N IY0 W AH2 N", 3),
    ("AH2 N D ER0 S T AE1 N D", 3),
    ("DH IH1 S", 1)
]
run_tests(syllable_count, tests)
	Running 4 tests on the `syllable_count` function...
	All tests passed!

b. Complete the function final_syllable to extract the portion of a CMUdict pronunciation from the final vowel phoneme until the end.

Below we're going to calculate whether two words rhyme, but to do this we need to extract the final syllable in each pronunciation: specifically, the phonemes from the final vowel phoneme through any potential consonant coda phonemes to the end of the word. So for a word like "EXPLAIN", with pronunciation "IH0 K S P L EY1 N", we want to extract "EY1 N", and we could see then that this could rhyme with "RAIN", which also ends in "EY1 N".

This function also has an optional argument, remove_accent, which when True should remove the accent markings from the returned answer. This is because we might miss for instance a rhyme between "EXPLAIN" and "COLTRANE", since "COLTRANE" ends with "EY0 N". So when True, the pronunciations for both should return simply "EY N".

This is somewhat tricky, since we want to find the location of the last accent marking, but potentially return a string without any accents.

One potentially useful built-in function here is enumerate. Documentation here. Any sequence can be wrapped in enumerate to add an index during a loop. For example:

>>> tokens = ['hello', 'dear', 'friends']
>>> for idx, word in enumerate(tokens):
...     print('current index is {} and current word is {}'.format(idx, word))
...
current index is 0 and current word is hello
current index is 1 and current word is dear
current index is 2 and current word is friends

I would do this problem by splitting the pronunciation into a list, doing some processing, and re-joining it at the end.

In [12]:
def final_syllable(p, remove_accent=False):
    """
    Takes a cmudict pronunciation string and returns the portion from
    the last vowel phoneme to the end (inclusive).
    
    Parameters
    ----------
    p : str
        Pronunciation string from CMUdict.
    remove_accent : bool (optional)
        Whether or not to remove accent markings from the returned phonemes.

    Returns
    -------
    str
        The final portion of the pronunciation, from the final phoneme with an
        accent marking (e.g., that ends in a digit) to the end.
    """
    # >>> YOUR ANSWER HERE
    index_of_last_accent = -1
    phonemes = []
    for i, phoneme in enumerate(p.split()):
        if phoneme[-1] in string.digits: # these have accent markings
            index_of_last_accent = i
        if remove_accent: phoneme = remove_digits(phoneme)
        phonemes.append(phoneme)
    return ' '.join(phonemes[index_of_last_accent:])
    # >>> END YOUR ANSWER
In [13]:
def final_syllable(p, remove_accent=False):
    """
    Takes a cmudict pronunciation string and returns the portion from
    the last vowel phoneme to the end (inclusive).
    
    Parameters
    ----------
    p : str
        Pronunciation string from CMUdict, comprised of space-separated phonemes.
    remove_accent : bool (optional)
        Whether or not to remove accent markings from the returned phonemes.

    Returns
    -------
    str
        The final portion of the pronunciation, from the final phoneme with an
        accent marking (e.g., that ends in a digit) to the end.
    """
    # >>> YOUR ANSWER HERE
    
    
    
    # so we have to find the LOCATION OF the final vowel   (integer index)
    location_of_final_vowel = 0
    # out of all the phonemes   (p.split, list of strings)
    for idx, phoneme in enumerate(p.split()):
        # so we have to check each phoneme to see if it's a vowel       (phoneme[-1].isdigit)
        if phoneme[-1].isdigit():
            # and keep updating the location of the most recent one
            location_of_final_vowel = idx
    # until we've seen them all, and the most recent one is the last one
    
    # return the final syllable         
    # which is the final vowel to the end  (string)
    final_syll = ' '.join(p.split()[location_of_final_vowel:])
    if remove_accent:
        return remove_digits(final_syll)
    else:
        return final_syll
    # >>> END YOUR ANSWER
In [14]:
tests = [
    (("B OW1 G AH0 S"), "AH0 S"),
    (("B AA1 B S L EH2 D"), "EH2 D"),
    (("B AA1 G AH0 L", True), "AH L"),
    (("B OY1 L ER0 M EY2 K ER0", True), "ER"),
    (("B OW1 L SH AH0 V IH2 K S", True), "IH K S")
]
run_tests(final_syllable, tests)
	Running 5 tests on the `final_syllable` function...
	All tests passed!

c. Complete the function words_rhyme to determine if two words could rhyme.

Now that we can extract the final syllable from a pronunciation, we can calculate if two words rhyme by whether those final syllables match!

Important things to note:

  • This function should be robust to case, so you want to run .upper() on both input words.
  • This function should work regardless of accent markings, so use yourremove_accent flag when you call final_syllable.
  • If either word is not in the CMUdict, return False (because we don't have enough information to say).
  • We want this function to return True if any possible pairing of the pronunciations of the words could rhyme, so you'll likely need to use a nested for-loop.

Finally, you should assume the cmudict dictionary variable, loaded above, is in scope and available. This is not always great practice (to use a global variable like cmudict here inside a local function scope), but it's fine for now so you don't have to load cmudict every time you check if things rhyme. Just remember you'll have to have run that cell for this to work.

In [15]:
def words_rhyme(word1, word2):
    """Checks if the word1 and word2 strings could rhyme.
      
    Parameters
    ----------
    word1, word2 : str
        Input words as strings.

    Returns
    -------
    bool
        True if any pronunciation of word1 shares a final syllable with 
        any pronunciation of word2 (ignoring accents), False otherwise.
    """
    # >>> YOUR ANSWER HERE
    if not (word1.upper() in cmudict and word2.upper() in cmudict):
        return False
    for p1 in cmudict[word1.upper()]:
        for p2 in cmudict[word2.upper()]:
            if final_syllable(p1, remove_accent=True) == final_syllable(p2, remove_accent=True):
                return True
    return False
    # >>> END YOUR ANSWER
In [16]:
tests = [
    (("cat", "hat"), True),
    (("LINGUISTICS", "MYSTICS"), True),
    (("MARBLE","MORBLE"), False),
    (("ORANGE", "PURPLE"), False),
    (("EXPLAIN", "COLTRANE"), True)
]
run_tests(words_rhyme, tests)
	Running 5 tests on the `words_rhyme` function...
	All tests passed!

Not a problem here - the below function I've implemented for you as a demonstration. It uses cmudict to try and calculate whether a line is in iambic pentameter or not.

It has comments explaining what's happening, feel free to read if you want. If you want an extra challenge, delete everything but the comments and try to reimplement this function. If you want an extra extra challenge, reimplement this from scratch!

In [17]:
import itertools

def detect_iambic_pentameter(line):
    """Detects whether a line is in iambic pentameter or not."""
    
    # Turns the line into a list of tokenized uppercase words
    words = [w.upper() for w in tokenize(line)]

    # Returns False if we can't give a pronuncication for every word
    if not all(w in cmudict for w in words):
        return False

    def stressed_unstressed(p):
        """Calculates sequence of stressed/unstressed syllables"""
        sus = ''
        for c in ''.join(p):
            if c == '0':
                sus += '0'
            elif c.isdigit():
                sus += '1'
        return sus

    # Sets what a proper iambic pentameter pattern would look like
    iambic_pattern = '0101010101'

    # The goal of this section is to accumulate the set of every possible
    # combination of every possible pronunciation for all the words.
    # possible_patterns will be a list of sets with len(possible_patterns) == len(words).
    possible_patterns = []
    for word in words:
        # For each word we create 'cur', a set representing all the
        # stressed-unstressed patterns for the current word.
        cur = set([])
        for p in cmudict[word]:
            sus = stressed_unstressed(''.join(p))
            cur.add(sus)
        # This is a special case to allow single-syllable words to be unstressed
        if cur == set(['1']):
            cur.add('0')
        possible_patterns.append(cur)

    # Now we will loop over all possible combinations of the patterns, which
    # we do using itertools.product(*possible_patterns). itertools.product creates
    # all possible combinations of the sequences given as its arugments.
    #
    # The * is a sometimes tricky operator which can be prepended to sequences
    # to 'unpack' them into arguments. So if we had a three-word line, with sets of
    # pronunciations cur1, cur2, cur3 for each of the 3 words, possible_products would be:
    #   [cur1, cur2, cur3]
    # The * operator unpacks these into arguments, so itertools.product(*possible_patterns)
    # would be equivalent to itertools.product(cur1, cur2, cur3) in this example case.
    # But the star lets us do that unpacking every time no matter how many words are in
    # possible_products.
    #
    # Anyway, the point is we loop through each possible pattern, and if even one of
    # them matches the iambic pentameter pattern, we return True, and otherwise return False.
    detected = False
    for combo in itertools.product(*possible_patterns):
        if ''.join(combo) == iambic_pattern:
            detected = True
            break
    return detected

For fun, let's detect whether we're actually getting iambic pentameter on some classic Shakespeare from Twelfth Night.

In [18]:
poem="""If music be the food of love, play on;
Give me excess of it, that, surfeiting,
The appetite may sicken, and so die.
That strain again! it had a dying fall:
O, it came o'er my ear like the sweet sound,
That breathes upon a bank of violets,
Stealing and giving odour! Enough; no more:
'Tis not so sweet now as it was before."""

print('idx\tiambic pent?\tline text')
for idx, line in enumerate(poem.split('\n')):
    print(idx, '\t', detect_iambic_pentameter(line), '\t', line, '\t')
idx	iambic pent?	line text
0 	 True 	 If music be the food of love, play on; 	
1 	 False 	 Give me excess of it, that, surfeiting, 	
2 	 True 	 The appetite may sicken, and so die. 	
3 	 True 	 That strain again! it had a dying fall: 	
4 	 False 	 O, it came o'er my ear like the sweet sound, 	
5 	 False 	 That breathes upon a bank of violets, 	
6 	 False 	 Stealing and giving odour! Enough; no more: 	
7 	 True 	 'Tis not so sweet now as it was before. 	

3. Textmath

Now that we have some reasonable building blocks, we can begin to calculate some coarse statistics that summarize interesting aspects of a text. In the following problems you will implement functions that calculate two metrics of textual complexity: lexical diversity (as measured by the type-token ratio) and readability (as measured by the Flesch reading ease metric).

a. Complete the function type_token_ratio to return the type-token ratio of the words in a string.

The type-token ratio is a simple metric of lexical diversity determined by the number of word types divided by the number of word tokens. Higher values represent more lexical diversity, with the extreme being a type-token ratio of 1.0, where every word type occurs only once.

Note this is a somewhat problematic metric because it is very sensitive to the length of the text in question, but it gives us a first approximation.

Dictionaries have special methods .keys() and .values() which return list-like objects containing all the dictionary's keys and values, respectively. We'll use those to get the type-token ratio.

For this problem, follow these steps:

  • Get a dictionary of { word : count } using your word_counts function above.
  • Get the type count as the size of the dictionary's keys.
  • Get the token count as the sum of the dictionary's values.
  • Return the type count divided by the token count.
In [19]:
def type_token_ratio(s):
    """Calculate the type-token ratio on the words in a string. The type-token
    ratio is defined as the number of unique word types divided by the number
    of total words.

    Parameters
    ----------
    s : str
        The input string to process.

    Returns
    -------
    float
        A decimal value for the type-token ratio of the words in the string.
    """
    # Delete pass and fill in your function.
    # >>> YOUR ANSWER HERE
    counts = word_counts(s)
    type_count = len(counts.keys())
    token_count = sum(counts.values())
    return type_count / token_count
    # >>> END YOUR ANSWER
In [20]:
tests = [
    ("I do not like them, Sam I am. I do not like green eggs and ham.", 0.6875),
    ("Wait, wait, don't tell me.", 0.8),
    ("Every word is different, special, magical, unique.", 1.0)
]
run_tests(type_token_ratio, tests)
	Running 3 tests on the `type_token_ratio` function...
	All tests passed!

b. Complete the flesch_reading_ease function to implement a readability calculation.

There is a long literature dating back to the early twentieth century and before of people trying to come up with quantitative metrics for how easy or difficult a text is to read. A number of formulas have been devised and tested for this purpose.

For more info: https://en.wikipedia.org/wiki/Readability#Popular_readability_formulas

These sorts of formulas are actually in wide use (overuse?) outside of linguistics, perhaps because they are relatively easy to calculate, but they are very coarse. As an exercise, here we'll implement the "Flesch reading ease" statistic, so common that it is reportedly used by the U.S. Department of Defense and some state governments as a standard of readibility for documents and forms.

The Flesch reading ease is represented by the following equation, where higher values represent easier-to-read documents:

Flesch Reading Ease Equation

The first term is just a constant to establish the scale, the second term is the average sentence length, and the third term is the average word length (in syllables).

For our purposes at the moment we'll assume lines are equal to sentences. Also, when calculating the number of syllables in a word, if the word has multiple pronunciations in cmudict just choose the first one.

I've laid out a structure, you just have to calculate the value of the second and third terms.

In [43]:
def flesch_reading_ease(s):
    """Calculate the Flesch reading ease formula on a string,
    assuming that each line is a sentence.

    Parameters
    ----------
    s : str
        The input string to process.

    Returns
    -------
    float
        A decimal value representing the Flesch reading ease.
    """
    
    lines = [l for l in s.split('\n') if not l.strip() == ''] # skips blank lines so they don't inflate the count
    words = tokenize(s)
    # calculate these two terms
    average_line_length, average_syllables_per_word = 0.0, 0.0 
    # >>> YOUR ANSWER HERE
    average_line_length = len(words) / len(lines)
    
    total_syllables = 0
    total_words = 0
    for w in words:
        if not w.upper() in cmudict: continue # need 
        pron = cmudict[w.upper()][0]
        total_syllables += syllable_count(pron)
        total_words += 1
    average_syllables_per_word = total_syllables / total_words
    # >>> END YOUR ANSWER
    return round(206.835 - (1.015 * average_line_length) - (84.6 * average_syllables_per_word), 2)
In [44]:
try:
    score = flesch_reading_ease(open('shakes.txt').read())
    assert 90.0 < score < 92.0
    print('Correctly calculated Flesch reading ease for Shakespeare: should be around 91.03, you got {}'.format(score))
except Exception as err:
    import traceback, sys
    traceback.print_exc(file=sys.stdout)
    print("Not yet implemented, or not working.")
Correctly calculated Flesch reading ease for Shakespeare: should be around 91.03, you got 91.03

4. Lyrically Applicable

We've spent a bunch of time in this assignment building some tools - let's put them to use! Your data zip contains two files of lyrics, beyonce.txt and taylorswift.txt. These are text files containing (roughly) the entirety of all the lyrics for every Beyonce and Taylor Swift song, respectively. Songs are separated by blank lines.

In this section you'll play around with the functions we have at our disposal to do some comparisons of Beyonce and Taylor, printing results in a human-readable way. You can set this up however you want, but say given some output values bey_val and tsw_val a start might be e.g.:

print('Beyonce:', bey_val)
print('Taylor Swift:', tsw_val)

Or however you want to do it. print can take any number of comma-separated variables which it print out (separated by a space by default, or editable with the sep argument). Also note that strings have a format method, which you can use to insert variables into them. Lots of info on that here.

Skeleton functions and testing code are not provided - do something reasonable and readable. These will get a bit more complicated as we go. Do as much as you want of what's suggested, or do some other sort of analysis you can dream up that's interesting to you. When you're done, there's a spot at the bottom to reflect on what these findings might tell you about Beyonce and Taylor.

Let's load both lyrics files as text to start and look at their basic stats.

In [45]:
beyonce = open('beyonce.txt').read()
taylorswift = open('taylorswift.txt').read()

print("beyonce.txt contains {} songs, {} lines ({} unique), and {} words.".format(
    len([l for l in beyonce.split('\n') if l.strip() == '']) + 1, # blank lines separate songs and albums
    len(beyonce.split('\n')), # line count
    len(set(beyonce.split('\n'))), # unique line count
    len(tokenize(beyonce)))) # word count
print("taylorswift.txt contains {} songs, {} lines ({} unique), and {} words.".format(
    len([l for l in taylorswift.split('\n') if l.strip() == '']) + 1, # blank lines separate songs and albums
    len(taylorswift.split('\n')), # line count
    len(set(taylorswift.split('\n'))), # unique line count
    len(tokenize(taylorswift)))) # word count
beyonce.txt contains 153 songs, 8738 lines (5464 unique), and 59868 words.
taylorswift.txt contains 134 songs, 7088 lines (4661 unique), and 48275 words.

a. Use your type_token_ratio function to calculate who has more lexical diversity according to the type-token ratio.

In [46]:
# >>> YOUR ANSWER HERE
print("Beyonce TTR: ", type_token_ratio(beyonce))
print("Taylor Swift TTR: ", type_token_ratio(taylorswift))
# >>> END YOUR ANSWER
Beyonce TTR:  0.06746508986436828
Taylor Swift TTR:  0.06336613153806318

b. Use your flesch_reading_ease function to calculate who is more "readable" according to the Flesch reading ease formula.

In [47]:
# >>> YOUR ANSWER HERE
print("Beyonce Readability: ", flesch_reading_ease(beyonce))
print("Taylor Swift Readability: ", flesch_reading_ease(taylorswift))
# >>> END YOUR ANSWER
Beyonce Readability:  97.9
Taylor Swift Readability:  98.83

c. Print out the words that Beyonce uses more than 20 times, but Taylor Swift never uses, and vice versa.

In [48]:
# >>> YOUR ANSWER HERE
beyonce_wc = word_counts(beyonce)
tswift_wc = word_counts(taylorswift)
print("Beyonce's tops that Taylor Swift don't touch:")
for word in beyonce_wc:
    if beyonce_wc[word] >= 20 and tswift_wc[word] == 0:
        print('\t', beyonce_wc[word], '\t', word)
        
print("\nTaylor Swift's tops that Beyonce don't touch:")
for word in tswift_wc:
    if tswift_wc[word] >= 20 and beyonce_wc[word] == 0:
        print('\t', tswift_wc[word], '\t', word)

# >>> END YOUR ANSWER
Beyonce's tops that Taylor Swift don't touch:
	 23 	 b
	 26 	 woo
	 28 	 y'all
	 79 	 yo
	 20 	 vu
	 33 	 upgrade
	 24 	 drip
	 50 	 i’m
	 24 	 chick
	 20 	 clap
	 33 	 bodied
	 33 	 diva
	 26 	 di
	 25 	 hustla
	 31 	 mi
	 39 	 kick
	 35 	 don’t
	 25 	 suga
	 30 	 creole
	 20 	 www
	 31 	 de
	 20 	 9
	 51 	 que
	 34 	 tu
	 29 	 smack
	 33 	 woah
	 30 	 lo
	 22 	 te
	 26 	 voy
	 49 	 slay
	 32 	 motherland

Taylor Swift's tops that Beyonce don't touch:
	 46 	 knows
	 40 	 daylight
	 21 	 getaway
	 38 	 woods
	 22 	 starlight
	 21 	 wonderland

d. Write a function to find the proportion of adjacent lines that rhyme (but aren't just the same word), and use it to calculate who rhymes more frequently.

Do this by looping through lines looking at the final word of the line, but keeping track of what the previous line's final word was. I recommend creating a variable like prev_word before the loop, then getting the cur_word within each iteration of the loop, and then before looping again assign prev_word = cur_word.

Remember that songs and albums are separated by blank lines, so you'll want to reset prev_word to blank if you encounter a blank line.

In [49]:
# >>> YOUR ANSWER HERE
def adjacent_rhymes(s):
    rhyme, norhyme = 0, 0
    prev_word = ''
    for l in s.split('\n'):
        if l.strip() == '': 
            prev_word = ''
            continue
        cur_word = tokenize(l)[-1]
        if words_rhyme(cur_word, prev_word) and not cur_word == prev_word:
            rhyme +=1
        else:
            norhyme += 1
        prev_word = cur_word
    return(rhyme / (rhyme+norhyme))
# >>> END YOUR ANSWER

print("Beyonce's rhyminess: ", round(adjacent_rhymes(beyonce),3))
print("Taylor Swift's rhyminess: ", round(adjacent_rhymes(taylorswift),3))
Beyonce's rhyminess:  0.07
Taylor Swift's rhyminess:  0.078

e. Calculate differences in proportional word usage.

This one is a bit more complicated. To do this, first write a function to convert the values in a Counter dictionary to proportions instead of integer counts.

So normally a Counter maps from words to integer occurrences; to convert it to proportions, first get the total count of values in the dictionary, then loop through the keys and assign the value to its count divided by the total, and finally return the dictionary back.

Apply this function to word_counts dictionaries from Beyonce and Taylor. Since they're still instances of Counter, they have access to a useful method called subtract. Look it up, and use it to calculate delta dictionaries (Beyonce's proportions minus Taylor Swift's, and Taylor Swift's proportions minus Beyonce's).

So say if Beyonce uses "howdy" 0.05 of the time, and Taylor Swift uses it 0.02 of the time, we want to calculate that "howdy" is 0.03 in Bob's delta dictionary, and -0.03 in Taylor Swift's delta dictionary.

Then you can use the most_common method on Beyonce's delta dictionary to show the top n words which Beyonce uses disproportionately more than Taylor Swift, and vice versa using Taylor Swift's delta dictionary.

In [50]:
# >>> YOUR ANSWER HERE
def counter_to_proportion(d):
    total = sum(d.values())
    for key in d:
        d[key] = d[key] / total
    return d

beyonce_wc = word_counts(beyonce)
tswift_wc = word_counts(taylorswift)
beyonce_prop = counter_to_proportion(beyonce_wc)
tswift_prop = counter_to_proportion(tswift_wc)
beyonce_prop.subtract(tswift_prop)
tswift_prop.subtract(beyonce_prop)

print("Beyonce's proportional tops:")
for word, val in beyonce_prop.most_common(10):
    print('\t', round(val, 3), '\t', word)

print("\nTaylor Swift's proportional tops:")
for word, val in tswift_prop.most_common(10):
    print('\t', round(val, 3), '\t', word)
# >>> END YOUR ANSWER
Beyonce's proportional tops:
	 0.006 	 me
	 0.005 	 no
	 0.004 	 love
	 0.004 	 on
	 0.004 	 my
	 0.004 	 up
	 0.003 	 baby
	 0.003 	 can
	 0.003 	 ain't
	 0.003 	 let

Taylor Swift's proportional tops:
	 0.054 	 i
	 0.048 	 you
	 0.042 	 and
	 0.041 	 the
	 0.019 	 in
	 0.018 	 a
	 0.016 	 to
	 0.015 	 of
	 0.013 	 it
	 0.012 	 but

f. Anything else you want! If you want, play around and calculate any quantity of interest that might come to mind.

In [27]:
# >>> YOUR ANSWER HERE

# >>> END YOUR ANSWER

g. Conclusion. Qualitatively, what have we learned about the differences between Beyonce and Taylor Swift's lyrics from the above?

Beyonce is potentially less readable with more diverse vocabulary, but interestingly nevertheless rhymes more than Taylor. They have distinctive regional styles (morning' and nothin' vs. honky and boogie), and Beyonce talks about third person stories while Taylor talks about her own experience.