Although initially uncomfortable with Python--and with text mining in general--I've found that they can be pretty efficient in essentially reading things for me. Yes, they can't really analyze text, but they can do the next best thing: itemize it.
I began with a nice neat .txt file--a diary of a soldier in the Mariposa Battalion--courtesy of The Internet Archive. The Battalion were supposedly the first white men in Yosemite Valley; they evicted Chief Tenaya and the Ahwahneechee tribe in the 1850s, following attacks on trading posts nearby. I used a simple Python script to write a html file with the diary's contents.
Great, but the data's still pretty crude. It also has some annoying metadata that I can dismiss using a tag stripping script. The product that prints to Terminal is much cleaner, and all I had to do was write a separate script identifying where the good stuff starts. The real text inexplicably starts at the word 'WTKODTICTIOF,' which is where I point my 'strip_return' script. Notice I've defined 'pageContents' as whatever follows my 'startLoc.'
That's all well and good, but the diary is still in long form. By adding the stripping script to another function, ('mariposa_list'), I can condense the diary into a list of words.
The next step is normalizing the data, which in this case means making the text entirely lower case (wouldn't want 'Case' and 'case' counted as different words). All I had to do was add the 'lower()' at the end of the line that defines the variable 'text.'
This text isn't too heavy on punctuation marks, but adding a regular expression to exclude them is never a bad idea. Now the variable 'wordlist' is not simply a list of words split from each other, but an object that refines my tag stripping function by further removing any non-alpha numeric characters.
Now that numbers and case differences have been nixed, I've got to add some dictionary functions. The first makes word-frequency pairs from a list of words; the second sorts these pairs by frequency.
Lastly, I've got add stopwords. A lot of them. With my modified and more powerful stripping script, my text-to-frequency script becomes unstoppable.
Not surprisingly, the most frequent words are 'captain,' 'war,' 'trail,' and 'discovery.' By itself, these word frequencies don't teach me an astounding amount about the diary itself; however, if I use similar techniques to mine early John Muir writings, the contrast would be pretty stark.
Thanks to The Programming Historian. Without their gentle tutorials, I would have never been able to do anything like this.