Topic Modeling

Sometimes, I ask very basic questions just to get the ball rolling. With topic modeling, the first one that came to mind was: just where does this lie on the digital methods family tree?

Unlike many inquiries in my DH seminar, this one had a simple answer: topic modeling is a specific kind of text mining. After all the fairly straightforward things I'd done to texts--stripping tags, adding stop words, counting word frequency--topic modeling seemed like a step up.

When I first installed MALLET, I followed a great tutorial with (pre-set sample data) from Programming Historian. I felt great power but little comprehension. I decided to use a familiar text, a description John Muir wrote of the features contained within the (proposed) Yosemite National Park.

me

The results were not especially comprehensible. On one hand, there is value to knowing that 'ice,' 'years,' and 'glacial' tend to congregate--but then again, I could have guessed that beforehand. The lesson is probably this: don't use abstract nature writing for topic modeling. If one knows anything about John Muir, they can already take a stab at topic modeling his opinions.

me

After my relative triumph with MALLET and the command line, I decided to fall back on things of comfort--namely, the GUI. The text modeling program Overview was there to help.

me

Loading files into Overview is almost too easy. The program accepts many formats, but I uploaded a folder full fo articles (PDFs) I had used for my thesis research on Yosemite in the late 1960s.

me

I love quick and easy results, and Overview was happy to oblige. Before it creates your document tree, Overview asks you for any specific words to omit or emphasize; I didn't enter any on my first go-round. Unfortunately, I still had words like 'JSTOR,' 'permission,' and 'copyright' in my tree.

me

Luckily, Overview allows me to create a new tree from the same documents, thus enabling me to add stop words and other nitpicky things. Once I stripped out all the copyright-related words, I was left with a much more helpful document hierarchy. Overview successfully separated works dealing with camping, policing, and domestic politics.

me

Sometimes, digital methods can create more questions than they answer. I was expecting Overview to provide a couple of neat suggestions--and I was rewarded. Many of my documents discussed dissidence, and many others involved runaways in the 1970s. Overview recognized (however unintentionally) that both actions were forms of protest, and they were grouped together. That works for me.

If I choose to topic model again, it'll be with Overview. I'm all for liberating myself from GUIs, but it seems like MALLET has an incredibly steep learning curve. It's not just that MALLET is somewhat difficult to use--its results are also fairly difficult to interpret (especially the various numerical values referencing paragraphs and correlative frequency data). With Overview, however, it takes about five minutes to upload AND remove stop words. After that, the document tree presents topic models in an easy-to-follow format. There is undoubtedly some very complex math behind these results, but Overview deals only in words, phrases, and punctuation. It's a powerful topic modeling tool specifically for humanists-- it's only natural that I like it more.

In the future, I think I'll use Overview in conjunction with my Zotero library. It's easy for me to form rough mental maps of how my sources relate to one another; it's quite another to have an impartial (even distant) reader organize them for me. The Zotero/Overview combo will be especially useful in between first and second drafts. As stated above, it's valuable to see one's source base categorized according to a metric other than one's own. I study Yosemite because it's dear to my heart--this is both a blessing and a curse. My familiarity with some subject matter can cause me to miss the forest for the trees, but Overview ensures an impartial--and occasionally idiosyncratic--organizational experience.