An experiment in writing in public, one page at a time, by S. Graham, I. Milligan, & S. Weingart

Topic Modeling with R

1 Leave a comment on paragraph 1 0 Working with MALLET from the command line takes some getting used to. Its output is somewhat difficult to work with as the user has to load it into Excel or another spreadsheet to manipulate it. One thing we might like to do with the output is to create a table where our documents are down the side and our topics are arranged in order across the top, with the percentage composition filling out the cells. Such a matrix is not natively output by MALLET, and with larger datasets, can be quite time consuming to create. It is possible to create a macro or a script in Excel that could do the work for you. However, we would recommend that you try the R language for statistical computing.

2 Leave a comment on paragraph 2 0 “Oh no! Do I have to learn a new programming language?” Yes. However, R is quite popular, it is widely supported, and people are creating new ‘packages’ (extensions or add-ons) that bundle together tools that you might wish to use. Thus, as your experience and comfort level grows with using R, you become able to do much more complicated work.

3 Leave a comment on paragraph 3 4 To begin with, download R and install it from http://cran.rstudio.com/. R uses a graphical interface that allows you to create ‘workspaces’ for your project. You load your data into the workspace, run your commands, and keep your output in this workspace (all without altering your original data). The ability to save and reload workspaces allows you to pick up where you left off.

4 Leave a comment on paragraph 4 0 We don’t intend this section to be a full-blown introduction to R and how to use the R environment. There are a number of tutorials available to get you started with R. Fred Gibbs’ has an excellent tutorial on computing document similarity with R that we highly recommend. Ben Marwick also has good tutorial that shows some of the ins and outs of using R. One thing to note is that, like Marwick, many people share their ‘scripts’ for doing different task via Github. For instance, if you went to Marwick’s tutorial and clicked on the ‘download gist’ button, you’d get a zipped folder containing a file ‘short-intro-R.R’. Unzip that file to your desktop. Then, in the R environment click ‘file’ > ‘open script’. Browse to ‘short-intro-R.R’ and select it. A new window will open in R Studio containing the script. Within that script, any line beginning with a hash character is a comment line. R will ignore these lines. You can now run either every command in the script, or you can run each line one at a time. To run each line one at a time, place the cursor at the top left of the script window, and hit ctrl + r. You’ll see the line appear in the main console window. Since the first line in Marwick’s script is led with a #, the line copies and nothing else happens. Hit ctrl + r so that the line
 2 + 2
is copied into the console. R will return, directly underneath:
As you proceed through Marwick’s script, you’ll see other ways of dealing with data. In line 22, you create a variable called ‘a’ and give it the value 2; in line 23 you create a variable called ‘b’ and give it the value 3. Line 24 has you add ‘a + b’, which will return ‘5’.

5 Leave a comment on paragraph 5 0 Sometimes, you will see a line like this:
This line is telling R to use a particular package which will provide R with more tools and algorithms to manipulate your data. If that package is not installed, you will receive an error message. If that happens, you can tell R to install the package quite easily:
R will ask you which mirror you wish to use; select a mirror that is geographically close to you, for the fastest download. These mirrors are the repositories that contain the latest versions of all the packages.

6 Leave a comment on paragraph 6 0 R can be used as a ‘wrapper’ for MALLET. That is, instead of running MALLET from the command line, we can run it from within R and use R’s capabilities to manipulate and visualize the output – including re-arranging the default composition file output into a more useful matrix (which we could then use to search for topics that correlate with one another, or documents that correlate). Ben Marwick has a script that does this, which we’ll now look at. Go to https://gist.github.com/benmarwick/4537873 , download the script, and open it in R. This script assumes that you have Mallet-2.0.7 installed and working on your machine, and it uses the sample tutorial materials included therein.

7 Leave a comment on paragraph 7 0 Marwick’s script is very well commented-out; before each command or task he clearly states what the following piece of code is going to do. You could select from the R Studio tool bar edit >> run all. If you then look in your Mallet directory, you’ll find three new files, ‘tutorial_keys.txt’, ‘tutorial_composition.txt’, and ‘topic_model_table.csv’, but you’ll miss how this all took place. Additionally, if there were any errors (if for instance you have Mallet set up in a different directory than this script envisions), you’ll have to scroll backwards through the console window to determine what happened. So let’s work through the script one line at a time.

8 Leave a comment on paragraph 8 0 The first few lines set the directory from which R will be working. Lines 9 to 20 set out a series of variables we will be using when we run the MALLET commands. Line 24 creates a string called ‘import’ that pulls together all of the commands that we would otherwise type directly into the command line to import our documents into a .mallet file for topic modeling. Line 25 creates a string called ‘train’ that pulls together the lengthy line that we would type in order to create a topic model directly on the command line.

9 Leave a comment on paragraph 9 0 Line 34 takes these two strings, ‘import’ and ‘train’ and passes them directly out of R to the command line, or ‘shell’. Lines 40 and 41 take our output, the topic keys list and the topic composition (look carefully at line 25 and you’ll see these being created at the end of the line) and turn them into chunks that R can digest for the last bit of magic which occurse between lines 44 and 50.

10 Leave a comment on paragraph 10 0 These six lines are reshaping that topic composition file, where each document’s largest document is listed first, with its percentage composition, then the next highest, and so on, into a matrix where the topics are listed across the top of the table and documents down the side. The very last line of the script opens up a window for the folder where the output has been written. You can then take that matrix file and load it into spreadsheet or visualization software for further manipulation.

11 Leave a comment on paragraph 11 0 Marwick’s r2mallet.r script is valuable for us for a couple of reasons. It is clearly and well commented, so we understand what is happening at each step. It builds on skills we have already developed (working with MALLET from the command line). It performs a useful manipulation of the final output from MALLET that would be exceedingly difficult or cumbersome to do using a spreadsheet. It is not however the most effective implementation of MALLET using R. It is, in one regard, a bit of a sleight-of-hand: it uses R as a kind of more user-friendly interface for MALLET. To topic model a different dataset, one would only have to change lines 12, 18-20, and 53 to give the output different names (or one could just use strings in those lines, and define these right at the opening of the script).

MALLET Package for R

12 Leave a comment on paragraph 12 0 David Mimno has translated MALLET directly into R as its own package, which allows greater speed and efficiency, as well as turning R’s full strength to the analysis and visualization of the resulting data. Ben Marwick has used Mimno’s package to analyze the 2013 Day of Archaeology posts, providing both analysis and the scripts he’s used at https://github.com/benmarwick/dayofarchaeology.

13 Leave a comment on paragraph 13 0 To use the MALLET package in R, one simply types
Now the whole suit of commands and parameters is available to you. At the ‘Historian’s Macroscope’ github page you can download and open our topicmodel.R script [nb, at the moment you can grab a copy here: https://github.com/shawngraham/R/blob/master/topicmodel.R]. This script is based on what Marwick does in his analysis of the Day of Archaeology. This script will import a folder containing .txt files, topic model it (the default number of topics set is 30), output various .csv containing the regular MALLET outputs, a .csv file with the topics labeled, output a dendrogram showing how the topics cluster, create a similarity matrix showing how the documents are similar based on their proportions of topics, and will visualize the same as a graphml file which can then be opened in Gephi (look for ‘g.graphml’ in your working directory once the script has finished running). It will also create an html file that uses d3 to represent this network interactively in a browser – look for ‘d3net.html’ in your working directory).

14 Leave a comment on paragraph 14 0 Change line 17 so that it contains the path to your folder containing the documents you wish to analyse. Make sure to use \ instead of a single \ when specifying the path or else you’ll receive an error.

15 Leave a comment on paragraph 15 0 It was this script we used to analyze and visualize the results of a topic model fitted against the 8000 biographies of Canadians found in the Dictionary of Canadian Biography (biographi.ca) that we discuss in the next section.

16 Leave a comment on paragraph 16 0 For more on using R in the services of text analysis, see Matthew Jocker’s book-in-progress at http://www.matthewjockers.net/2013/09/03/tawr/

Page 86

Source: http://www.themacroscope.org/?page_id=67