FirstLines : Word Frequency Analysis of Poetry First Lines

FirstLines is a Python 3.x project presented as a Jupyter Notebook. Links to the code and inputs files as well as a link to the live Jupyter Notebook can be found at the bottom of the post.

This project was inspired by the book "Writing the Life Poetic" by Sage Cohen. Her book is filled with interesting ideas used to see things differently which have inspired creativity in my writing both poetry and prose. One of her suggestions is to use other person's poem titles as a jumping off points for your own ideas. She also has many other suggestions for creativity and I recommend her book. Writing the Life Poetic

This got me thinking about the process of writing. To me a poem or story title seems to be mostly an afterthought. What I was really interested in was the first line. The first line of a poem, the first line of a book. Anyone who has sat in front of a blank page, with an assignment due or project to be started understands that it is this beginning line that is often the hardest part.

This led me to looking at first lines, first lines of novels or for this project, first lines of poems. I decided this would be a good project to create another Jupyter Notebook and continue to practice my Python coding.

Program: Python 3.6.2
Input: 6 poets, 225 poems randomly selected from each. A total of 1350 first lines.
Program Operation:
-- Read in 225 first lines from each poet.

-- Perform a word frequency analysis on the words used by each poet in their first lines.
Program Output: A CSV file for each poet showing the frequency analysis of words used in the first lines.
Analysis: The focus of this project was really the Python programming and the Jupyter but I also did some simple analysis.

Some options are available. Common English words can be removed by adding them to the list. Freq Limit; report or ignore words used infrequently. 0 = report all. Display the result output tot he screen or file (or both)

In [ ]:
#  Python 3.6.2                                                                        #
#  1) read poetry first line files                                                                      #
#  2) turn each first line input file into a sorted frequency dictionary     #
#  >>>>  All code released as open source with no usage restrictions     #

# used to strip punctuation marks
import string

### ------ options ------ ###
RemoveCommon = True
# subset of most common words
CommonWords = ['the', 'to', 'of', 'and', 'a']
# optional freq limit, set to 0 to output all
optionalQty = 0
# optional output
GenerateScreenOutput = False
GenerateFileOutput = True

For this exercise the input filename and poet names are hardcoded.

In [ ]:
# input file names
f_rossetti = "225 First Lines from Christina Georgina Rossetti.txt"
f_dickinson = "225 First Lines from Emily Dickinson.txt"
f_longfellow = "225 First Lines from Henry Wadsworth Longfellow.txt"
f_emerson = "225 First Lines from Ralph Waldo Emerson.txt"
f_teasdale = "225 First Lines from Sara Teasdale.txt"
f_whitman = "225 First Lines from Walt Whitman.txt"

# strings for output display
rossetti = "Christina George Rossetti"
dickinson = "Emily Dickinson"
longfellow = "Henry Wadsworth Longfellow"
emerson = "Ralph Waldo Emerson"
teasdale = "Sara Teasdale"
whitman = "Walt Whitman"

Funtions: the comments describes the actions

In [ ]:
# read a file, return the contents in a string    
def readInputFile(fileName):
    f = open(fileName , 'r')
    textInputString =
    return textInputString

# takes a string, splits it and turns it into a list, returning the list
def stringToList(inputString):
    inList = inputString.split()
    return inList

#  takes a list of words and turns it into a frequency dictionary, returning the dictionary 
def listToFreqDict(inputList):
    wordFreq = [inputList.count(p) for p in inputList]
    return dict(zip(inputList,wordFreq))

# sorts a frequency dictionary in descending order
def sortFreqDict(inputFreqDict):
    result = [(inputFreqDict[key], key) for key in inputFreqDict]
    return result

# convert the input list into a sorted freq dictionary    
def getSortedFreqDict(inLst):
    tmpFreqDict = listToFreqDict(inLst)
    tmpSortedFreqDict = sortFreqDict(tmpFreqDict)
    return tmpSortedFreqDict

Funtions: the comments describes the text clean up actions

In [ ]:
# text processing is messy business, many times custom cleanup is required 
# clean up the string and return a list
def CleanupInput(inStr):
    # change strings to lower case    
    inStr = inStr.lower()
    #clean up some dashes and apostrophes
    inStr = inStr.replace("'", "")
    inStr = inStr.replace("-", "")
    # change string into a list
    words = inStr.split()
    #strip out punctuation    
    tempList = [w.strip( string.punctuation) for w in words]
    #remove all the common words 
    cleanList = []    
    if RemoveCommon:
        for item in tempList:
            if not(item in CommonWords):
        return cleanList
        return tempList

Funtions: the comments describes the screen and file out code

In [ ]:
def generateScreenOutput(poetName, poetSortedDict):
    for item in poetSortedDict: 
        tmpstr = str(item)
        #remove all python dictionary characters
        tmpstr = tmpstr.replace("(", "")
        tmpstr = tmpstr.replace(")", "")   
        tmpstr = tmpstr.replace("'", "")
        tmpstr = tmpstr.replace('"', "")
        #optional freq output
        tmplst = tmpstr.split(',')
        if ( int(tmplst[0]) >= optionalQty):
            print (tmpstr)
def generateFileOutput(poetName, poetSortedDict):
    filename = 'firstline_frequency_' + poetName + '.csv'
    f = open(filename, "w")
    f.write(poetName + '\r\n' )
    f.write('useage count' + ' , ' + 'word used' + '\r\n')
    for item in poetSortedDict: 
        tmpstr = str(item)
        #remove all python dictionary characters
        tmpstr = tmpstr.replace("(", "")
        tmpstr = tmpstr.replace(")", "")   
        tmpstr = tmpstr.replace("'", "")
        tmpstr = tmpstr.replace('"', "")
        #optional freq output
        tmplst = tmpstr.split(',')
        if ( int(tmplst[0]) >= optionalQty):
            f.write(tmpstr +'\r\n') 

The main section of the code. Individual lists and dictionaries were hard-coded to make debugging easier. It would be easy to make all this code generic. This exercise will be left to the reader.

In [ ]:
## read the files
s_rossetti = readInputFile(f_rossetti)
s_dickinson = readInputFile(f_dickinson)
s_longfellow = readInputFile(f_longfellow)
s_emerson = readInputFile(f_emerson)
s_teasdale = readInputFile(f_teasdale)
s_whitman  = readInputFile(f_whitman )

##  clean up the input and turn it into a list
l_rossetti = CleanupInput(s_rossetti)
l_dickinson = CleanupInput(s_dickinson)
l_longfellow = CleanupInput(s_longfellow)
l_emerson = CleanupInput(s_emerson)
l_teasdale = CleanupInput(s_teasdale)
l_whitman  = CleanupInput(s_whitman)

## get the sorted freq dictionary for each poet
sfd_rossetti = getSortedFreqDict(l_rossetti)
sfd_dickinson = getSortedFreqDict(l_dickinson)
sfd_longfellow = getSortedFreqDict(l_longfellow)
sfd_emerson = getSortedFreqDict(l_emerson)
sfd_teasdale = getSortedFreqDict(l_teasdale)
sfd_whitman  = getSortedFreqDict(l_whitman )

Output results code.

In [ ]:
## print out to the console
if (GenerateScreenOutput):
    generateScreenOutput( rossetti, sfd_rossetti)
    generateScreenOutput( dickinson, sfd_dickinson)
    generateScreenOutput( longfellow, sfd_longfellow)
    generateScreenOutput( emerson, sfd_emerson)
    generateScreenOutput( teasdale, sfd_teasdale)
    generateScreenOutput( whitman, sfd_whitman)

#print out to a text file
if (GenerateFileOutput):
    generateFileOutput( rossetti, sfd_rossetti)
    generateFileOutput( dickinson, sfd_dickinson)
    generateFileOutput( longfellow, sfd_longfellow)
    generateFileOutput( emerson, sfd_emerson)
    generateFileOutput( teasdale, sfd_teasdale)
    generateFileOutput( whitman, sfd_whitman)

The output will scroll on the screen and be written as a CSV (comma separated value) file that can be opened in a spreadsheet.

Posting a Jupyter Notebook to Blogger is a challenge. I haven't figured out to post the live/active notebook itself but just an HTML version. Even just the HTML version required removing all style and mathjax configuration code. I also removed the top and bottom HTML tags.
The actual Jupyter Notebook file can be found on my GitHub Page
A live viewer for the notebook can be found on the Jupyter Site All code, file and Jupyter notebooks are open source and can be used without permission.
Once the CSV files have been created it is up to the user to use other tools to analyze the data in whatever way they see fit. My focus was on the software coding but I did the following quick analysis.

I removed the 50 most common English words and then graphed the frequency of the words. The graph above shows the remaining words still conform to Zipf's Law.

Finally, I removed the 100 most common words, sorted the remaining in usage order and present them here in stanza form as my homage to Sage Cohen for her book: Writing the Life Poetic which I continue to enjoy.


Are love was night am long
Oh heart never said life thought
Sea thy old had summer thou sun
Little still sweet thee came
Far has song spring were dead through wind
Beauty heard let upon where alone earth gone
Heaven man once why bird

Dream eyes many tell woman before days
Death each face god king last light morning
Rose saw shall should sleep these today
While world again beautiful down here into land
Mine sat soul stars too went within
Cannot cold dear did dreamed first flowers garden
Look much sing snow

Tale times trees word yet April
Autumn born bring call city die
Evening every gave grass green hand
Hope hour joy leaves live lost may
Moon music must roses set south spirit
Though town woods young youth air
Always died done fair great high lies lord

Meet men more pale poor proud red sky
Soft stand stood strong such those us whom afraid
Against ago child children dark dawn deep door early end
Fire flower forgotten free friend gray house
Lie maiden morn nature own peace
Perfect place pleasant poet river road
Storm them thing waves weary whose wild wine years