And welcome yet again to a new world in which to do programming. We started with the command line, moved on to directly editing .py
files with our code using command-line text editors, and now we're here in Jupyter-land. This file is what's called a "Jupyter notebook", a user-friendly and highly readable document that allows us to not only write code but also directly see the outputs in this web browser editor application. Very cool!
So this is another transition but hopefully one that makes your life easier rather than harder. We're very lucky that the Quest infrastructure has an easy setup for Jupyter notebooks already, so we can essentially just go to a URL in a web browser, log in, and directly access our code and files. Check out the resources for this week on the course website for more information about Jupyter.
Instructions for logging in are here: https://kb.northwestern.edu/94116
In this assignment, we won't introduce a huge amount of new material, but we will go deeper on what we do know. We'll become more expert at working with different sorts of dictionaries, reading and writing files, and doing some interesting text processing.
One neat thing about Jupyter is there are two main cell types: code and Markdown. Cells default to code, but can be switched to Markdown in the menus above (Cell > Cell Type > Markdown).
All the normal-looking text in this assignment is in Markdown cells. Markdown is a very simple formatting language that allows you to do pretty text formatting with only ascii text input. No need to dig too deep into it, but there's a cheat sheet here if you're interested: https://github.com/adam-p/markdown-here/wiki/Markdown-Cheatsheet
Anyway, this means you can input your info in a normal text-looking way. Do that in the cell below - double-click to edit formatted Markdown cells, delete the [your_answer] and fill in.
Jupyter has many useful keyboard shortcuts which are worth looking up if you're interested (e.g. as explained in the first four sections here). But I want to make sure you know at least one: Shift+Enter
. This 'executes' a given cell. For a code cell, this runs the code. For a Markdown cell, this does formatting on the cell and makes it all pretty. Note you can also do the same thing with the "Run" button in the toolbar at the top. Make sure you execute the cell above to prettify it.
FYI, you downloaded this assignment in a zip file containing some auxiliary data files we'll be working with, so they should also be in your assignment5 directory
Here are some helper functions that you can feel free to use if they're useful. They strip a string down to certain character sets - so letters_only
reduces a string to only alphabetic characters, remove_letters
removes all alphabetic characters, analogously for digits_only
and remove_digits
, and letters_and_spaces_only
removes everything but alphabetic characters and spaces. tokenize
is a version of our simple tokenizer from the previous assignment which takes in a string and returns a list of words. Leave these as-is but run the cell.
import string, math
def letters_only(s):
return ''.join([c for c in s if c.isalpha()])
def remove_letters(s):
return ''.join([c for c in s if not c.isalpha()])
def digits_only(s):
return ''.join([c for c in s if c.isdigit()])
def remove_digits(s):
return ''.join([c for c in s if not c.isdigit()])
def letters_and_spaces_only(s):
return ''.join([c for c in s if (c.isalpha() or c == ' ')])
def tokenize(s):
return [w.strip(string.punctuation) for w in s.lower().split()]
And here's our old friend run_tests
again! Also leave as is but run this cell.
def run_tests(func, tests):
print('\tRunning {} tests on the `{}` function...'.format(len(tests), func.__name__))
errors = 0
for val, ret in tests:
try:
if type(val) == tuple:
assert func(*val) == ret
else:
assert func(val) == ret
except AssertionError:
print('\t\terror for input {}'.format(val))
errors += 1
if errors == 0:
print('\tAll tests passed!')
Important Note
You have to run the above cells each time you open the notebook to have access to these functions. In fact, I suggest re-running the entire notebook every time you open the assignment, either with Kernel > Restart and Run All in the menu or by clicking the fast-forward icon in the toolbar.
As we've very glancingly touched upon in class, Python is what's called an "object-oriented" programming language. Briefly, this means that everything we use is abstractly a certain type of "object" which has an inherent structure and is inherently associated with certain attributes and methods. Any given variable is an "instance" of that type of object, so if we had a variable my_string = 'hello!'
, my_string
is an instance of a string object and therefore it has all the methods and attributes strings have like lower()
and startswith()
. More info on the object-oriented paradigm on Wikipedia if you're interested.
In Python the type of an object is called its class, and one great thing Python can do is have sub-classes, which inherit the attributes and methods of their parent class, but add additional or more specific functionality for certain situations.
This is all background to help you understand the two new types of dictionaries we'll be working with here, Counter
and defaultdict
. These are both subclasses of the basic dictionary class, so you can do anything with them that you can do with a dictionary but they have additional functionality on top of that.
We'll start with Counter
.
from collections import Counter
As you can see, Counter
is a class available in the collections
module. Documentation here. It is a dictionary class tailor-made for, well, counting things. Unlike a normal dictionary, the values in a Counter
default to integers set to 0. In the previous assignment we used dictionaries to count, but each time had to check if the key existed (and create it if not). With a Counter
, we can simply increment the value for any key (e.g. counts[word] += 1
) and it will do all that for us. To see how this works, let's re-implement our word_counts
function from last week.
word_counts
to count the words in a string, using a Counter
.¶This should be very quick to do and look almost the same as what you did last week, except with fewer lines of code since we can rely on Counter
to do some of the cleanup for us as described above. Remember that a tokenize
function has been implemented above that you can use.
def word_counts(s):
"""Tokenize the str `s` and accumulate counts in a dictionary for each
word that appears.
Parameters
----------
s : str
The input string.
Returns
-------
dict of { str : int }
Word counts in the string, with words as keys and their corresponding
counts as values.
"""
counts = Counter()
# >>> YOUR ANSWER HERE
tokens = tokenize(s)
for token in tokens:
counts[token] += 1
# >>> END YOUR ANSWER
return counts
tests = [
('a a b a b b c a', {'a': 4, 'b': 3, 'c': 1}),
('I wish to wish the wish you wish to wish', {'i': 1, 'wish': 5, 'to': 2, 'the': 1, 'you': 1}),
(' ', {}),
("RaZzLe dAzZlE", {'razzle': 1, 'dazzle': 1})
]
run_tests(word_counts, tests)
Counter
s have several useful built-in methods too. For instance, the most_common()
method takes an integer as an argument and returns key-value tuples of that many of the most common words in the Counter
.
Now for defaultdict
.
from collections import defaultdict
Counter
above allows us to set the default type for any key to an integer (set to 0). defaultdict
is similar but more general: it allows us to set the default type to any Python object (which can be arbitrarily complex). Documentation here. Below we'll practice using defaultdict
to read in a complicated but fun sort of file: a pronouncing dictionary!
We'll be using the CMU Pronouncing Dictionary, the website of which is here. They describe it succinctly as follows:
The Carnegie Mellon University Pronouncing Dictionary is an open-source machine-readable pronunciation dictionary for North American English that contains over 134,000 words and their pronunciations. CMUdict is being actively maintained and expanded. We are open to suggestions, corrections and other input.
The phonemic alphabet it uses is called ARPAbet from the 70s, which represents the American English-y section of the IPA in ASCII characters. Vowel phonemes are annotated with a digit representing stress: 0 means no stress, 1 means primary stress, and 2 means secondary stress.
The sociolinguists in the room may note that such a dictionary is (almost by definition) relying upon a semi-arbitrary notion of "standardness" in pronuncication and that it is missing a lot of meaningful variation. The prosody afficionados will notice that its annotations of accent make some semi-arbitrary assumptions about information structure and focus. Fair criticisms all, and more where that came from, but nevertheless this dictionary and others like it have been used to great effect for many very exciting applications like speech recognition, so it's very worth playing with!
The dictionary itself is contained in your data zip as a file called 'cmudict-0.7b', and I highly encourage you to open and look at the file (e.g. in less
) and its structure before doing the next problem.
read_cmudict
to load the CMU Pronouncing Dictionary into a Python dictionary, using a defaultdict
.¶In this problem you'll load the CMUdict file into a format we can work with directly in Python. Specifically, each word in CMUdict will be a key in the dictionary, and each value will be a list of the pronunciations that word can take.
Important things to note:
for
loop over lines in the file has been provided for you with the encoding set up.continue
) all lines for which the first character is a punctuation mark. This both removes comments from the file as well as some pronunciations we will not use, like the pronunciations for punctuation marks themselves.split()
ing.append
the pronunciation, don't assign it.Possibly the trickiest part of this is that words with multiple pronunciations in the dictionary will appear on multiple separate lines, one pronunciation per appearance. Each instance after the first will have (N) appended to the end of the word, like this:
ACTUALLY AE1 K CH UW2 AH0 L IY0
ACTUALLY(1) AE1 K CH L IY0
ACTUALLY(2) AE1 K SH AH0 L IY0
The goal is to ultimately collect three lines in the dictionary as a key-value pair where the key is the string "ACTUALLY"
and the value is a list of strings like ["AE1 K CH UW2 AH0 L IY0", "AE1 K CH L IY0", "AE1 K SH AH0 L IY0"]
.
Play with split()
to figure out how to do this. As a hint, note that given a string my_string
and a character c
, if you do my_string.split(c)
but c
is not in my_string
, you get back a list of length one where the 0th index simply contains my_string
.
def read_cmudict(cmudict_file='cmudict-0.7b'):
"""Read the CMU Pronouncing Dictionary into a Python dictionary.
Parameters
----------
cmudict_file : str
The path to the cmudict file.
Returns
-------
dict of { str : list of str }
Dictionary mapping words (in ALLCAPS) to lists of strings
containing their pronunciations from CMUdict.
"""
cmudict = defaultdict(list)
for row in open(cmudict_file, encoding='iso-8859-1'):
# >>> YOUR ANSWER HERE
pass
if row[0] in string.punctuation: continue
w, p = row.split(' ')
word = w.split('(')[0]
pron = p.strip()
cmudict[word].append(pron)
# >>> END YOUR ANSWER
return cmudict
cmudict = read_cmudict()
print('CMUdict loaded with {} entries'.format(len(cmudict.keys())))
error = False
# Test the expected number of entries
print('Checking number of entries...')
try:
assert len(cmudict) == 125002
except:
error = True
print("\tCMUdict doesn't contain the right number of entries.")
# Test CMUdict with particular words
print('Running checks on 5 particular words...')
tests = [
("COMPUTATIONAL", ["K AA2 M P Y UW0 T EY1 SH AH0 N AH0 L"]),
("LINGUISTS", ["L IH1 NG G W IH0 S T S"]),
("ARE", ["AA1 R", "ER0"]),
("SOMETIMES", ["S AH0 M T AY1 M Z", "S AH1 M T AY2 M Z"]),
("WHIMSICAL", ["W IH1 M Z IH0 K AH0 L", "HH W IH1 M Z IH0 K AH0 L"])
]
for val, ret in tests:
try:
assert cmudict[val] == ret
except:
error = True
print('\tError on {}, expected {} but got {}'.format(val, ret, cmudict[val]))
if not error:
print('All tests passed!')
Now that we have our pronouncing dictionary, let's dig a little deeper into some things we can calculate about word pronunciations.
syllable_count
to calculate the number of syllables in a CMUdict pronunciation.¶Remember that a syllable is defined by the presence of a vowel phoneme, and vowel phonemes in CMUdict have an accent marking on the end.
def syllable_count(p):
"""
Takes a cmudict pronunciation string and returns the number of syllables.
Parameters
----------
p : str
Pronunciation string from CMUdict (ARPAbet phonemes separated by spaces).
Returns
-------
int
Count of syllables in the pronunciation, defined as the number of items
in the pronunciation that end with a digit (representing their stress).
"""
# >>> YOUR ANSWER HERE
return len(digits_only(p))
# >>> END YOUR ANSWER
tests = [
("K AE1 N", 1),
("EH1 N IY0 W AH2 N", 3),
("AH2 N D ER0 S T AE1 N D", 3),
("DH IH1 S", 1)
]
run_tests(syllable_count, tests)
final_syllable
to extract the portion of a CMUdict pronunciation from the final vowel phoneme until the end.¶Below we're going to calculate whether two words rhyme, but to do this we need to extract the final syllable in each pronunciation: specifically, the phonemes from the final vowel phoneme through any potential consonant coda phonemes to the end of the word. So for a word like "EXPLAIN", with pronunciation "IH0 K S P L EY1 N"
, we want to extract "EY1 N"
, and we could see then that this could rhyme with "RAIN", which also ends in "EY1 N"
.
This function also has an optional argument, remove_accent
, which when True should remove the accent markings from the returned answer. This is because we might miss for instance a rhyme between "EXPLAIN" and "COLTRANE", since "COLTRANE" ends with "EY0 N"
. So when True, the pronunciations for both should return simply "EY N"
.
This is somewhat tricky, since we want to find the location of the last accent marking, but potentially return a string without any accents.
One potentially useful built-in function here is enumerate
. Documentation here. Any sequence can be wrapped in enumerate
to add an index during a loop. For example:
>>> tokens = ['hello', 'dear', 'friends']
>>> for idx, word in enumerate(tokens):
... print('current index is {} and current word is {}'.format(idx, word))
...
current index is 0 and current word is hello
current index is 1 and current word is dear
current index is 2 and current word is friends
I would do this problem by split
ting the pronunciation into a list, doing some processing, and re-join
ing it at the end.
def final_syllable(p, remove_accent=False):
"""
Takes a cmudict pronunciation string and returns the portion from
the last vowel phoneme to the end (inclusive).
Parameters
----------
p : str
Pronunciation string from CMUdict.
remove_accent : bool (optional)
Whether or not to remove accent markings from the returned phonemes.
Returns
-------
str
The final portion of the pronunciation, from the final phoneme with an
accent marking (e.g., that ends in a digit) to the end.
"""
# >>> YOUR ANSWER HERE
index_of_last_accent = -1
phonemes = []
for i, phoneme in enumerate(p.split()):
if phoneme[-1] in string.digits: # these have accent markings
index_of_last_accent = i
if remove_accent: phoneme = remove_digits(phoneme)
phonemes.append(phoneme)
return ' '.join(phonemes[index_of_last_accent:])
# >>> END YOUR ANSWER
def final_syllable(p, remove_accent=False):
"""
Takes a cmudict pronunciation string and returns the portion from
the last vowel phoneme to the end (inclusive).
Parameters
----------
p : str
Pronunciation string from CMUdict, comprised of space-separated phonemes.
remove_accent : bool (optional)
Whether or not to remove accent markings from the returned phonemes.
Returns
-------
str
The final portion of the pronunciation, from the final phoneme with an
accent marking (e.g., that ends in a digit) to the end.
"""
# >>> YOUR ANSWER HERE
# so we have to find the LOCATION OF the final vowel (integer index)
location_of_final_vowel = 0
# out of all the phonemes (p.split, list of strings)
for idx, phoneme in enumerate(p.split()):
# so we have to check each phoneme to see if it's a vowel (phoneme[-1].isdigit)
if phoneme[-1].isdigit():
# and keep updating the location of the most recent one
location_of_final_vowel = idx
# until we've seen them all, and the most recent one is the last one
# return the final syllable
# which is the final vowel to the end (string)
final_syll = ' '.join(p.split()[location_of_final_vowel:])
if remove_accent:
return remove_digits(final_syll)
else:
return final_syll
# >>> END YOUR ANSWER
tests = [
(("B OW1 G AH0 S"), "AH0 S"),
(("B AA1 B S L EH2 D"), "EH2 D"),
(("B AA1 G AH0 L", True), "AH L"),
(("B OY1 L ER0 M EY2 K ER0", True), "ER"),
(("B OW1 L SH AH0 V IH2 K S", True), "IH K S")
]
run_tests(final_syllable, tests)
words_rhyme
to determine if two words could rhyme.¶Now that we can extract the final syllable from a pronunciation, we can calculate if two words rhyme by whether those final syllables match!
Important things to note:
.upper()
on both input words.remove_accent
flag when you call final_syllable
.for
-loop.Finally, you should assume the cmudict
dictionary variable, loaded above, is in scope and available. This is not always great practice (to use a global variable like cmudict
here inside a local function scope), but it's fine for now so you don't have to load cmudict
every time you check if things rhyme. Just remember you'll have to have run that cell for this to work.
def words_rhyme(word1, word2):
"""Checks if the word1 and word2 strings could rhyme.
Parameters
----------
word1, word2 : str
Input words as strings.
Returns
-------
bool
True if any pronunciation of word1 shares a final syllable with
any pronunciation of word2 (ignoring accents), False otherwise.
"""
# >>> YOUR ANSWER HERE
if not (word1.upper() in cmudict and word2.upper() in cmudict):
return False
for p1 in cmudict[word1.upper()]:
for p2 in cmudict[word2.upper()]:
if final_syllable(p1, remove_accent=True) == final_syllable(p2, remove_accent=True):
return True
return False
# >>> END YOUR ANSWER
tests = [
(("cat", "hat"), True),
(("LINGUISTICS", "MYSTICS"), True),
(("MARBLE","MORBLE"), False),
(("ORANGE", "PURPLE"), False),
(("EXPLAIN", "COLTRANE"), True)
]
run_tests(words_rhyme, tests)
Not a problem here - the below function I've implemented for you as a demonstration. It uses cmudict to try and calculate whether a line is in iambic pentameter or not.
It has comments explaining what's happening, feel free to read if you want. If you want an extra challenge, delete everything but the comments and try to reimplement this function. If you want an extra extra challenge, reimplement this from scratch!
import itertools
def detect_iambic_pentameter(line):
"""Detects whether a line is in iambic pentameter or not."""
# Turns the line into a list of tokenized uppercase words
words = [w.upper() for w in tokenize(line)]
# Returns False if we can't give a pronuncication for every word
if not all(w in cmudict for w in words):
return False
def stressed_unstressed(p):
"""Calculates sequence of stressed/unstressed syllables"""
sus = ''
for c in ''.join(p):
if c == '0':
sus += '0'
elif c.isdigit():
sus += '1'
return sus
# Sets what a proper iambic pentameter pattern would look like
iambic_pattern = '0101010101'
# The goal of this section is to accumulate the set of every possible
# combination of every possible pronunciation for all the words.
# possible_patterns will be a list of sets with len(possible_patterns) == len(words).
possible_patterns = []
for word in words:
# For each word we create 'cur', a set representing all the
# stressed-unstressed patterns for the current word.
cur = set([])
for p in cmudict[word]:
sus = stressed_unstressed(''.join(p))
cur.add(sus)
# This is a special case to allow single-syllable words to be unstressed
if cur == set(['1']):
cur.add('0')
possible_patterns.append(cur)
# Now we will loop over all possible combinations of the patterns, which
# we do using itertools.product(*possible_patterns). itertools.product creates
# all possible combinations of the sequences given as its arugments.
#
# The * is a sometimes tricky operator which can be prepended to sequences
# to 'unpack' them into arguments. So if we had a three-word line, with sets of
# pronunciations cur1, cur2, cur3 for each of the 3 words, possible_products would be:
# [cur1, cur2, cur3]
# The * operator unpacks these into arguments, so itertools.product(*possible_patterns)
# would be equivalent to itertools.product(cur1, cur2, cur3) in this example case.
# But the star lets us do that unpacking every time no matter how many words are in
# possible_products.
#
# Anyway, the point is we loop through each possible pattern, and if even one of
# them matches the iambic pentameter pattern, we return True, and otherwise return False.
detected = False
for combo in itertools.product(*possible_patterns):
if ''.join(combo) == iambic_pattern:
detected = True
break
return detected
For fun, let's detect whether we're actually getting iambic pentameter on some classic Shakespeare from Twelfth Night.
poem="""If music be the food of love, play on;
Give me excess of it, that, surfeiting,
The appetite may sicken, and so die.
That strain again! it had a dying fall:
O, it came o'er my ear like the sweet sound,
That breathes upon a bank of violets,
Stealing and giving odour! Enough; no more:
'Tis not so sweet now as it was before."""
print('idx\tiambic pent?\tline text')
for idx, line in enumerate(poem.split('\n')):
print(idx, '\t', detect_iambic_pentameter(line), '\t', line, '\t')
Now that we have some reasonable building blocks, we can begin to calculate some coarse statistics that summarize interesting aspects of a text. In the following problems you will implement functions that calculate two metrics of textual complexity: lexical diversity (as measured by the type-token ratio) and readability (as measured by the Flesch reading ease metric).
type_token_ratio
to return the type-token ratio of the words in a string.¶The type-token ratio is a simple metric of lexical diversity determined by the number of word types divided by the number of word tokens. Higher values represent more lexical diversity, with the extreme being a type-token ratio of 1.0, where every word type occurs only once.
Note this is a somewhat problematic metric because it is very sensitive to the length of the text in question, but it gives us a first approximation.
Dictionaries have special methods .keys()
and .values()
which return list-like objects containing all the dictionary's keys and values, respectively. We'll use those to get the type-token ratio.
For this problem, follow these steps:
word_counts
function above.sum
of the dictionary's values.def type_token_ratio(s):
"""Calculate the type-token ratio on the words in a string. The type-token
ratio is defined as the number of unique word types divided by the number
of total words.
Parameters
----------
s : str
The input string to process.
Returns
-------
float
A decimal value for the type-token ratio of the words in the string.
"""
# Delete pass and fill in your function.
# >>> YOUR ANSWER HERE
counts = word_counts(s)
type_count = len(counts.keys())
token_count = sum(counts.values())
return type_count / token_count
# >>> END YOUR ANSWER
tests = [
("I do not like them, Sam I am. I do not like green eggs and ham.", 0.6875),
("Wait, wait, don't tell me.", 0.8),
("Every word is different, special, magical, unique.", 1.0)
]
run_tests(type_token_ratio, tests)
flesch_reading_ease
function to implement a readability calculation.¶There is a long literature dating back to the early twentieth century and before of people trying to come up with quantitative metrics for how easy or difficult a text is to read. A number of formulas have been devised and tested for this purpose.
For more info: https://en.wikipedia.org/wiki/Readability#Popular_readability_formulas
These sorts of formulas are actually in wide use (overuse?) outside of linguistics, perhaps because they are relatively easy to calculate, but they are very coarse. As an exercise, here we'll implement the "Flesch reading ease" statistic, so common that it is reportedly used by the U.S. Department of Defense and some state governments as a standard of readibility for documents and forms.
The Flesch reading ease is represented by the following equation, where higher values represent easier-to-read documents:
The first term is just a constant to establish the scale, the second term is the average sentence length, and the third term is the average word length (in syllables).
For our purposes at the moment we'll assume lines are equal to sentences. Also, when calculating the number of syllables in a word, if the word has multiple pronunciations in cmudict
just choose the first one.
I've laid out a structure, you just have to calculate the value of the second and third terms.
def flesch_reading_ease(s):
"""Calculate the Flesch reading ease formula on a string,
assuming that each line is a sentence.
Parameters
----------
s : str
The input string to process.
Returns
-------
float
A decimal value representing the Flesch reading ease.
"""
lines = [l for l in s.split('\n') if not l.strip() == ''] # skips blank lines so they don't inflate the count
words = tokenize(s)
# calculate these two terms
average_line_length, average_syllables_per_word = 0.0, 0.0
# >>> YOUR ANSWER HERE
average_line_length = len(words) / len(lines)
total_syllables = 0
total_words = 0
for w in words:
if not w.upper() in cmudict: continue # need
pron = cmudict[w.upper()][0]
total_syllables += syllable_count(pron)
total_words += 1
average_syllables_per_word = total_syllables / total_words
# >>> END YOUR ANSWER
return round(206.835 - (1.015 * average_line_length) - (84.6 * average_syllables_per_word), 2)
try:
score = flesch_reading_ease(open('shakes.txt').read())
assert 90.0 < score < 92.0
print('Correctly calculated Flesch reading ease for Shakespeare: should be around 91.03, you got {}'.format(score))
except Exception as err:
import traceback, sys
traceback.print_exc(file=sys.stdout)
print("Not yet implemented, or not working.")
We've spent a bunch of time in this assignment building some tools - let's put them to use! Your data zip contains two files of lyrics, beyonce.txt
and taylorswift.txt
. These are text files containing (roughly) the entirety of all the lyrics for every Beyonce and Taylor Swift song, respectively. Songs are separated by blank lines.
In this section you'll play around with the functions we have at our disposal to do some comparisons of Beyonce and Taylor, printing results in a human-readable way. You can set this up however you want, but say given some output values bey_val
and tsw_val
a start might be e.g.:
print('Beyonce:', bey_val)
print('Taylor Swift:', tsw_val)
Or however you want to do it. print
can take any number of comma-separated variables which it print out (separated by a space by default, or editable with the sep
argument). Also note that strings have a format
method, which you can use to insert variables into them. Lots of info on that here.
Skeleton functions and testing code are not provided - do something reasonable and readable. These will get a bit more complicated as we go. Do as much as you want of what's suggested, or do some other sort of analysis you can dream up that's interesting to you. When you're done, there's a spot at the bottom to reflect on what these findings might tell you about Beyonce and Taylor.
Let's load both lyrics files as text to start and look at their basic stats.
beyonce = open('beyonce.txt').read()
taylorswift = open('taylorswift.txt').read()
print("beyonce.txt contains {} songs, {} lines ({} unique), and {} words.".format(
len([l for l in beyonce.split('\n') if l.strip() == '']) + 1, # blank lines separate songs and albums
len(beyonce.split('\n')), # line count
len(set(beyonce.split('\n'))), # unique line count
len(tokenize(beyonce)))) # word count
print("taylorswift.txt contains {} songs, {} lines ({} unique), and {} words.".format(
len([l for l in taylorswift.split('\n') if l.strip() == '']) + 1, # blank lines separate songs and albums
len(taylorswift.split('\n')), # line count
len(set(taylorswift.split('\n'))), # unique line count
len(tokenize(taylorswift)))) # word count
type_token_ratio
function to calculate who has more lexical diversity according to the type-token ratio.¶# >>> YOUR ANSWER HERE
print("Beyonce TTR: ", type_token_ratio(beyonce))
print("Taylor Swift TTR: ", type_token_ratio(taylorswift))
# >>> END YOUR ANSWER
flesch_reading_ease
function to calculate who is more "readable" according to the Flesch reading ease formula.¶# >>> YOUR ANSWER HERE
print("Beyonce Readability: ", flesch_reading_ease(beyonce))
print("Taylor Swift Readability: ", flesch_reading_ease(taylorswift))
# >>> END YOUR ANSWER
# >>> YOUR ANSWER HERE
beyonce_wc = word_counts(beyonce)
tswift_wc = word_counts(taylorswift)
print("Beyonce's tops that Taylor Swift don't touch:")
for word in beyonce_wc:
if beyonce_wc[word] >= 20 and tswift_wc[word] == 0:
print('\t', beyonce_wc[word], '\t', word)
print("\nTaylor Swift's tops that Beyonce don't touch:")
for word in tswift_wc:
if tswift_wc[word] >= 20 and beyonce_wc[word] == 0:
print('\t', tswift_wc[word], '\t', word)
# >>> END YOUR ANSWER
Do this by looping through lines looking at the final word of the line, but keeping track of what the previous line's final word was. I recommend creating a variable like prev_word
before the loop, then getting the cur_word
within each iteration of the loop, and then before looping again assign prev_word = cur_word
.
Remember that songs and albums are separated by blank lines, so you'll want to reset prev_word
to blank if you encounter a blank line.
# >>> YOUR ANSWER HERE
def adjacent_rhymes(s):
rhyme, norhyme = 0, 0
prev_word = ''
for l in s.split('\n'):
if l.strip() == '':
prev_word = ''
continue
cur_word = tokenize(l)[-1]
if words_rhyme(cur_word, prev_word) and not cur_word == prev_word:
rhyme +=1
else:
norhyme += 1
prev_word = cur_word
return(rhyme / (rhyme+norhyme))
# >>> END YOUR ANSWER
print("Beyonce's rhyminess: ", round(adjacent_rhymes(beyonce),3))
print("Taylor Swift's rhyminess: ", round(adjacent_rhymes(taylorswift),3))
This one is a bit more complicated. To do this, first write a function to convert the values in a Counter
dictionary to proportions instead of integer counts.
So normally a Counter
maps from words to integer occurrences; to convert it to proportions, first get the total count of values in the dictionary, then loop through the keys and assign the value to its count divided by the total, and finally return the dictionary back.
Apply this function to word_counts
dictionaries from Beyonce and Taylor. Since they're still instances of Counter
, they have access to a useful method called subtract
. Look it up, and use it to calculate delta dictionaries (Beyonce's proportions minus Taylor Swift's, and Taylor Swift's proportions minus Beyonce's).
So say if Beyonce uses "howdy" 0.05 of the time, and Taylor Swift uses it 0.02 of the time, we want to calculate that "howdy" is 0.03 in Bob's delta dictionary, and -0.03 in Taylor Swift's delta dictionary.
Then you can use the most_common
method on Beyonce's delta dictionary to show the top words which Beyonce uses disproportionately more than Taylor Swift, and vice versa using Taylor Swift's delta dictionary.
# >>> YOUR ANSWER HERE
def counter_to_proportion(d):
total = sum(d.values())
for key in d:
d[key] = d[key] / total
return d
beyonce_wc = word_counts(beyonce)
tswift_wc = word_counts(taylorswift)
beyonce_prop = counter_to_proportion(beyonce_wc)
tswift_prop = counter_to_proportion(tswift_wc)
beyonce_prop.subtract(tswift_prop)
tswift_prop.subtract(beyonce_prop)
print("Beyonce's proportional tops:")
for word, val in beyonce_prop.most_common(10):
print('\t', round(val, 3), '\t', word)
print("\nTaylor Swift's proportional tops:")
for word, val in tswift_prop.most_common(10):
print('\t', round(val, 3), '\t', word)
# >>> END YOUR ANSWER
# >>> YOUR ANSWER HERE
# >>> END YOUR ANSWER
Beyonce is potentially less readable with more diverse vocabulary, but interestingly nevertheless rhymes more than Taylor. They have distinctive regional styles (morning' and nothin' vs. honky and boogie), and Beyonce talks about third person stories while Taylor talks about her own experience.