FirstLines : Word Frequency Analysis of Poetry First Lines

FirstLines is a Python 3.x project presented as a Jupyter Notebook. Links to the code and inputs files as well as a link to the live Jupyter Notebook can be found at the bottom of the post.

This project was inspired by the book "Writing the Life Poetic" by Sage Cohen. Her book is filled with interesting ideas used to see things differently which have inspired creativity in my writing both poetry and prose. One of her suggestions is to use other person's poem titles as a jumping off points for your own ideas. She also has many other suggestions for creativity and I recommend her book. Writing the Life Poetic

This got me thinking about the process of writing. To me a poem or story title seems to be mostly an afterthought. What I was really interested in was the first line. The first line of a poem, the first line of a book. Anyone who has sat in front of a blank page, with an assignment due or project to be started understands that it is this beginning line that is often the hardest part.

This led me to looking at first lines, first lines of novels or for this project, first lines of poems. I decided this would be a good project to create another Jupyter Notebook and continue to practice my Python coding.

Program: FirstLines.py Python 3.6.2
Input: 6 poets, 225 poems randomly selected from each. A total of 1350 first lines.
Program Operation:
-- Read in 225 first lines from each poet.

-- Perform a word frequency analysis on the words used by each poet in their first lines.
Program Output: A CSV file for each poet showing the frequency analysis of words used in the first lines.
Analysis: The focus of this project was really the Python programming and the Jupyter but I also did some simple analysis.

Some options are available. Common English words can be removed by adding them to the list. Freq Limit; report or ignore words used infrequently. 0 = report all. Display the result output tot he screen or file (or both)

################################################
#   firstlines.py  Python 3.6.2                                                                        #
#  1) read poetry first line files                                                                      #
#  2) turn each first line input file into a sorted frequency dictionary     #
#  >>>>  All code released as open source with no usage restrictions     #
################################################

# used to strip punctuation marks
import string

### ------ options ------ ###
RemoveCommon = True
# subset of most common words
CommonWords = ['the', 'to', 'of', 'and', 'a']
# optional freq limit, set to 0 to output all
optionalQty = 0
# optional output
GenerateScreenOutput = False
GenerateFileOutput = True

For this exercise the input filename and poet names are hardcoded.

# input file names
f_rossetti = "225 First Lines from Christina Georgina Rossetti.txt"
f_dickinson = "225 First Lines from Emily Dickinson.txt"
f_longfellow = "225 First Lines from Henry Wadsworth Longfellow.txt"
f_emerson = "225 First Lines from Ralph Waldo Emerson.txt"
f_teasdale = "225 First Lines from Sara Teasdale.txt"
f_whitman = "225 First Lines from Walt Whitman.txt"


# strings for output display
rossetti = "Christina George Rossetti"
dickinson = "Emily Dickinson"
longfellow = "Henry Wadsworth Longfellow"
emerson = "Ralph Waldo Emerson"
teasdale = "Sara Teasdale"
whitman = "Walt Whitman"

Funtions: the comments describes the actions

# read a file, return the contents in a string    
def readInputFile(fileName):
    f = open(fileName , 'r')
    textInputString = f.read()
    f.close
    return textInputString

# takes a string, splits it and turns it into a list, returning the list
def stringToList(inputString):
    inList = inputString.split()
    return inList

#  takes a list of words and turns it into a frequency dictionary, returning the dictionary 
def listToFreqDict(inputList):
    wordFreq = [inputList.count(p) for p in inputList]
    return dict(zip(inputList,wordFreq))

# sorts a frequency dictionary in descending order
def sortFreqDict(inputFreqDict):
    result = [(inputFreqDict[key], key) for key in inputFreqDict]
    result.sort()
    result.reverse()
    return result

# convert the input list into a sorted freq dictionary    
def getSortedFreqDict(inLst):
    tmpFreqDict = listToFreqDict(inLst)
    tmpSortedFreqDict = sortFreqDict(tmpFreqDict)
    return tmpSortedFreqDict

Funtions: the comments describes the text clean up actions

# text processing is messy business, many times custom cleanup is required 
# clean up the string and return a list
def CleanupInput(inStr):
    # change strings to lower case    
    inStr = inStr.lower()
    #clean up some dashes and apostrophes
    inStr = inStr.replace("'", "")
    inStr = inStr.replace("-", "")
    # change string into a list
    words = inStr.split()
    #strip out punctuation    
    tempList = [w.strip( string.punctuation) for w in words]
    #remove all the common words 
    cleanList = []    
    if RemoveCommon:
        for item in tempList:
            if not(item in CommonWords):
                cleanList.append(item)       
        return cleanList
    else:
        return tempList

Funtions: the comments describes the screen and file out code

def generateScreenOutput(poetName, poetSortedDict):
    print(poetName)
    for item in poetSortedDict: 
        tmpstr = str(item)
        #remove all python dictionary characters
        tmpstr = tmpstr.replace("(", "")
        tmpstr = tmpstr.replace(")", "")   
        tmpstr = tmpstr.replace("'", "")
        tmpstr = tmpstr.replace('"', "")
        #optional freq output
        tmplst = tmpstr.split(',')
        if ( int(tmplst[0]) >= optionalQty):
            print (tmpstr)
    print()          
   
    
def generateFileOutput(poetName, poetSortedDict):
    filename = 'firstline_frequency_' + poetName + '.csv'
    f = open(filename, "w")
    f.write(poetName + '\r\n' )
    f.write('useage count' + ' , ' + 'word used' + '\r\n')
    for item in poetSortedDict: 
        tmpstr = str(item)
        #remove all python dictionary characters
        tmpstr = tmpstr.replace("(", "")
        tmpstr = tmpstr.replace(")", "")   
        tmpstr = tmpstr.replace("'", "")
        tmpstr = tmpstr.replace('"', "")
        #optional freq output
        tmplst = tmpstr.split(',')
        if ( int(tmplst[0]) >= optionalQty):
            f.write(tmpstr +'\r\n') 
    f.close()

The main section of the code. Individual lists and dictionaries were hard-coded to make debugging easier. It would be easy to make all this code generic. This exercise will be left to the reader.

## read the files
s_rossetti = readInputFile(f_rossetti)
s_dickinson = readInputFile(f_dickinson)
s_longfellow = readInputFile(f_longfellow)
s_emerson = readInputFile(f_emerson)
s_teasdale = readInputFile(f_teasdale)
s_whitman  = readInputFile(f_whitman )

##  clean up the input and turn it into a list
l_rossetti = CleanupInput(s_rossetti)
l_dickinson = CleanupInput(s_dickinson)
l_longfellow = CleanupInput(s_longfellow)
l_emerson = CleanupInput(s_emerson)
l_teasdale = CleanupInput(s_teasdale)
l_whitman  = CleanupInput(s_whitman)

## get the sorted freq dictionary for each poet
sfd_rossetti = getSortedFreqDict(l_rossetti)
sfd_dickinson = getSortedFreqDict(l_dickinson)
sfd_longfellow = getSortedFreqDict(l_longfellow)
sfd_emerson = getSortedFreqDict(l_emerson)
sfd_teasdale = getSortedFreqDict(l_teasdale)
sfd_whitman  = getSortedFreqDict(l_whitman )

Output results code.

## print out to the console
if (GenerateScreenOutput):
    generateScreenOutput( rossetti, sfd_rossetti)
    generateScreenOutput( dickinson, sfd_dickinson)
    generateScreenOutput( longfellow, sfd_longfellow)
    generateScreenOutput( emerson, sfd_emerson)
    generateScreenOutput( teasdale, sfd_teasdale)
    generateScreenOutput( whitman, sfd_whitman)

#print out to a text file
if (GenerateFileOutput):
    generateFileOutput( rossetti, sfd_rossetti)
    generateFileOutput( dickinson, sfd_dickinson)
    generateFileOutput( longfellow, sfd_longfellow)
    generateFileOutput( emerson, sfd_emerson)
    generateFileOutput( teasdale, sfd_teasdale)
    generateFileOutput( whitman, sfd_whitman)

The output will scroll on the screen and be written as a CSV (comma separated value) file that can be opened in a spreadsheet.

Posting a Jupyter Notebook to Blogger is a challenge. I haven't figured out to post the live/active notebook itself but just an HTML version. Even just the HTML version required removing all style and mathjax configuration code. I also removed the top and bottom HTML tags.

The actual Jupyter Notebook file can be found on my GitHub Page
A live viewer for the notebook can be found on the Jupyter Site All code, file and Jupyter notebooks are open source and can be used without permission.

Once the CSV files have been created it is up to the user to use other tools to analyze the data in whatever way they see fit. My focus was on the software coding but I did the following quick analysis.

DigitalStew : Maker Projects

FirstLines

* My Maker Projects

About DigitalStew Blog

My Other Blogs

Labels