|
An experiment in writing in public, one page at a time, by S. Graham, I. Milligan, & S. Weingart

Advanced Topic Modeling with R

1 Leave a comment on paragraph 1 0 Working with MALLET from the command line takes some getting used to. Its output is somewhat difficult to work with, as the user has to load it into Excel or another spreadsheet to manipulate it. The STMT allows us to manipulate our topic model to a degree for better visualization, but it presents its own challenges.

2 Leave a comment on paragraph 2 0 One thing we might like to do with the output is to create a table where our documents are down the side and our topics are arranged in order across the top, with the percentage composition filling out the cells. Once we had the data arranged this way, we would be able to use functions in MS Excel to work out which documents are correlated with which other documents, or which topics are correlated with what topics (the better to create network visualizations with, for instance). Such a matrix is not natively output by MALLET or the other tools we’ve discussed, and with larger datasets, can be quite time consuming to create. It is possible (but not easy) to create a macro or a script in Excel that could do the work for you. In which case, we would suggest that you might wish to experiment with creating topic models within the R statistical programming environment. The key advantage here: R is built to deal with large matrices of data, to manipulate it, to rearrange it, and to visualize it. Many common tasks have already been packaged together such that instead of programming them from scratch, you can invoke the package as its own command. R also enables someone else to reproduce what you have done, and thus introduces the idea that another historian could work with your data and extend or critique your argument on those grounds.

3 Leave a comment on paragraph 3 0 Getting started with R

4 Leave a comment on paragraph 4 0 To begin, we will need to download R as well as RStudio. R is the language; RStudio is a user interface that makes working in R a great deal easier. You can download R from http://cran.rstudio.com (select the appropriate version for your operating system), and then RStudio from http://www.rstudio.com/products/rstudio/download/ (make sure to select the free, open-source version). Both have standard installation processes that you can run like any other piece of software.

5 Leave a comment on paragraph 5 0 RStudio uses a graphical interface that allows you to create ‘workspaces’ for your project. You load your data into the workspace, run your commands, and keep your output in this workspace (all without altering your original data). The ability to save and reload workspaces allows you to pick up where you left off.

6 Leave a comment on paragraph 6 0 We do not intend this section to be a full-blown introduction to R and how to use the R environment. There are a number of excellent online tutorials available to get you started with R. Paul Torfs and Claudia Braur have a very good general introduction to R.[1] Fred Gibbs has a tutorial on computing document similarity with R that we highly recommend.[2] Ben Marwick also has good tutorial that shows some of the ins and outs of using R, programmed in R itself. [3]

7 Leave a comment on paragraph 7 0 One thing to note is that, like Marwick, many people share their ‘scripts’ for doing different tasks, via GitHub. To see this in action, navigate to Marwick’s tutorial (https://gist.github.com/benmarwick/5403048) and click on the ‘download gist’ button. You will receive a zipped folder containing a file called ‘short-intro-R.R’. Unzip that file to your desktop. Then, in the R environment select ‘file’ > ‘open script’. Browse to ‘short-intro-R.R’ and select it. (Our vision here is to show you how to run other people’s programs, and make minor changes yourself).

8 Leave a comment on paragraph 8 0 A new window will open containing the script. Within that script, any line beginning with a hash character is a comment line. R ignores all comment lines when running code. You can now run either every command in the script, or you can run each line one at a time to see what happens. To run each line one at a time, place the cursor at the beginning of the line, and hit ctrl + r on Windows, or Command + Enter on OS X. You’ll see the line appear in the main console window. Since the first line in Marwick’s script is led with a # (meaning, a comment), the line copies and nothing else happens. Hit ctrl + r/Command + Enter so that the line

9 Leave a comment on paragraph 9 0 2 + 2

10 Leave a comment on paragraph 10 0 is copied into the console. R will return, directly underneath:

11 Leave a comment on paragraph 11 0 4

12 Leave a comment on paragraph 12 0 As you proceed through Marwick’s script, you’ll see other ways of dealing with data. In line 22, you create a variable called ‘a’ and give it the value 2; in line 23 you create a variable called ‘b’ and give it the value 3. Line 24 has you add ‘a + b’, which will return ‘5’. An excellent online, interactive tutorial that covers the basic ideas of R can be found at http://tryr.codeschool.com/ which we also recommend.

13 Leave a comment on paragraph 13 0  

14 Leave a comment on paragraph 14 0 Extending R with others’ packages

15 Leave a comment on paragraph 15 0 Sometimes, you will see a line like this:

16 Leave a comment on paragraph 16 0 library(igraph)

17 Leave a comment on paragraph 17 0 This line is telling R to use a particular package that will provide R with more tools and algorithms to manipulate your data. If that package is not installed, you will receive an error message. If that happens, you can tell R to install the package quite easily:

18 Leave a comment on paragraph 18 0 install.packages(”igraph”)

19 Leave a comment on paragraph 19 0 R will ask you which download site that you wish to use (which it calls a ‘mirror’); select one that is geographically close to you for the fastest download. These mirrors are the repositories that contain the latest versions of all the packages.

20 Leave a comment on paragraph 20 0  

21 Leave a comment on paragraph 21 0 Using Mimno’s MALLET Wrapper in R

22 Leave a comment on paragraph 22 0 David Mimno has written a wrapper for MALLET in R. Mimno’s wrapper installs MALLET’s topic modeling tools (which you were using on the command line earlier) directly inside R, allowing greater speed and efficiency, as well as turning R’s full strength to the analysis and visualization of the resulting data. You do not have to have the command-line version of MALLET installed on your machine already to use Mimno’s wrapper: the wrapper is MALLET, or at least, the part that does the topic modeling![4] Since MALLET is written in Java, the wrapper will install the rJava package to enable R to run it.

23 Leave a comment on paragraph 23 0 To use the MALLET wrapper in R, one simply types (remember, in R, any line with a # is a comment and so does not execute):

24 Leave a comment on paragraph 24 0
#the first time you wish to use it, you must install:

25 Leave a comment on paragraph 25 0 install.packages(“mallet”)

26 Leave a comment on paragraph 26 0 # R will then ask you which ‘mirror’ (repository) you wish to install from. Select one that is close to you.

27 Leave a comment on paragraph 27 0 #any subsequent time, after you’ve installed it:

28 Leave a comment on paragraph 28 0 require(mallet)

29 Leave a comment on paragraph 29 0 If you find you’re having trouble getting started, or there are error messages, please see our section on ‘Working with R’ on the website for this book at http://themacroscope.org/2.0/extrahelp

30 Leave a comment on paragraph 30 0  

31 Leave a comment on paragraph 31 0 Now the whole suite of commands and parameters is available to you. A short demonstration script that uses the example data bundled with MALLET (since you already downloaded that data earlier in this chapter) can be found[SG1]  at the code section on themacroscope.org. (The full manual for the wrapper may be found at http://cran.r-project.org/web/packages/mallet/mallet.pdf ; our example is based on Mimno’s example). Open it up. We will not provide every single line of code below, so we encourage you to work through the script you find online with this book in hand. We’ll explain the important lines!

32 Leave a comment on paragraph 32 0 Let’s look in more detail at how we build a topic model using this script.

33 Leave a comment on paragraph 33 0
documents <- mallet.read.dir("mallet-2.0.7/sample-data/web/en/")

34 Leave a comment on paragraph 34 0  

35 Leave a comment on paragraph 35 0 This line creates a variable called ‘documents’, and it contains the path to the documents you wish to analyze. On a Windows machine, you would include the full path, ie “C:\mallet-2.0.7\sampled-data\web\”.  By default on OS X, if you followed the default instructions, it will work out of the box. In that directory, each document is its own unique text file. Now we need to import those documents into MALLET. We do that by running this command:

36 Leave a comment on paragraph 36 0
mallet.instances <- mallet.import(documents$id, documents$text, "mallet-2.0.7/stoplists/en.txt", token.regexp = "\p{L}[\p{L}\p{P}]+\p{L}")

37 Leave a comment on paragraph 37 0  

38 Leave a comment on paragraph 38 0  

39 Leave a comment on paragraph 39 0 It’s complicated, but what’s happening here are that your documents are being brought into R, as a new object called ‘mallet.instances’. That is, a list with every document is listed by its id, with its associated text, where a stoplist has been used to filter out common stopwords, and a regular expression to keep all sequences of Unicode characters. It is worth asking yourself: is the default stoplist provided by MALLET appropriate for my text? Are there words that should be added or removed? You can create a stopword list in any text editor by opening the default one, adding or deleting as appropriate, and then saving with a new name and the .txt extension. The next step is to create an empty container for our topic model:

40 Leave a comment on paragraph 40 0
n.topics <- 30

41 Leave a comment on paragraph 41 0 topic.model <- MalletLDA(n.topics)

42 Leave a comment on paragraph 42 0 We created a variable called ‘n.topics’. If you reran your analysis to explore a greater or lesser number of topics, you would only have to change this one line to the number you wished. Now we can load the container up with our documents:

43 Leave a comment on paragraph 43 0
topic.model$loadDocuments(mallet.instances)

44 Leave a comment on paragraph 44 0  

45 Leave a comment on paragraph 45 0 At this point, you can begin to explore for patterns in the word use in your document, if you wish, by finding out what the vocabulary and word frequencies of the document are:

46 Leave a comment on paragraph 46 0
vocabulary <- topic.model$getVocabulary()

47 Leave a comment on paragraph 47 0 word.freqs <- mallet.word.freqs(topic.model)

 

48 Leave a comment on paragraph 48 0 If you now type

49 Leave a comment on paragraph 49 0
length(vocabulary)

50 Leave a comment on paragraph 50 0 …you will receive a number; this is the number of unique words in your document. You can inspect the top 100 words thus:

51 Leave a comment on paragraph 51 0
vocabulary[1:100]

52 Leave a comment on paragraph 52 0 As Mimno notes in the comments to his code, this information could be useful for you to customize your stoplist. Jockers also shows us how to explore some of the distribution of those words using the ‘head’ command (which returns the first few rows of a data matrix or data frame):

53 Leave a comment on paragraph 53 0
head(word.freqs)

54 Leave a comment on paragraph 54 0 You will be presented with a table with words down the side, and two columns: term.freq and doc.freq. This tells you the number of times the word appears in the corpus, and the number of documents in which it appears.

55 Leave a comment on paragraph 55 0 The script now sets the optimization parameters for the topic model. In essence, you can tune the model.[5] This line sets the ‘hyperparameters’:

56 Leave a comment on paragraph 56 0
topic.model$setAlphaOptimization(20, 50)

57 Leave a comment on paragraph 57 0 You can play with these to see what happens, or you can choose to leave this line out and accept MALLET’s defaults. The next two lines generate the topic model:

58 Leave a comment on paragraph 58 0
topic.model$train(200)

59 Leave a comment on paragraph 59 0 topic.model$maximize(10)

60 Leave a comment on paragraph 60 0  

61 Leave a comment on paragraph 61 0 The first line tells MALLET how many rounds or iterations to process through. More can sometimes lead to ‘better’ topics and clusters.  Jockers reports that he finds that the quality increases with the number of iterations only so far, before beginning to plateau.[6] When you run these commands, output will scroll by as the algorithm iterates. At each iteration, it will also give you the probability that the topic is likely; the closer the number is to zero, the better.

62 Leave a comment on paragraph 62 0 Now we want to examine the results of the topic model. These lines take the raw output and convert it to probabilities:

63 Leave a comment on paragraph 63 0
doc.topics <- mallet.doc.topics(topic.model, smoothed=T, normalized=T)

64 Leave a comment on paragraph 64 0 topic.words <- mallet.topic.words(topic.model, smoothed=T, normalized=T)

65 Leave a comment on paragraph 65 0  

66 Leave a comment on paragraph 66 0 One last bit of transformation will give us a spreadsheet with topics down the side and documents across the top (compare this with the ‘native’ output of MALLET from the command line).

topic.docs <- t(doc.topics)

67 Leave a comment on paragraph 67 0 topic.docs <- topic.docs / rowSums(topic.docs)

68 Leave a comment on paragraph 68 0 write.csv(topic.docs, "topics-docs.csv" )

 

69 Leave a comment on paragraph 69 0 This script will not work ‘out of the box’ for you the first time, because your files and our files might not necessarily be in the same location. To use it successfully for yourself, you will need to change certain lines to point to appropriate locations on your own machine (regardless of whether it is a Windows, Mac, or Linux machine). Study the example carefully. Do you see which lines need to be changed to access your own data?

70 Leave a comment on paragraph 70 0 This is a good starting point for your future work with R! If you want to try it out on your own data, you can change the directory with documents on line 21. There are a couple of ways you might try to visualize these results, too, within R. Let’s begin by creating some topic labels with this code:

# Get a vector containing short names for the topics

71 Leave a comment on paragraph 71 0  

72 Leave a comment on paragraph 72 0 topics.labels <- rep("", n.topics)

73 Leave a comment on paragraph 73 0 for (topic in 1:n.topics) topics.labels[topic] <- paste(mallet.top.words(topic.model, topic.words[topic,], num.top.words=5)$words, collapse=" ")

74 Leave a comment on paragraph 74 0  

75 Leave a comment on paragraph 75 0 # have a look at keywords for each topic

76 Leave a comment on paragraph 76 0  

77 Leave a comment on paragraph 77 0 topics.labels

78 Leave a comment on paragraph 78 0  

79 Leave a comment on paragraph 79 0 # write these to a file

80 Leave a comment on paragraph 80 0 write.csv(topics.labels, "topics-labels.csv")

81 Leave a comment on paragraph 81 0  

82 Leave a comment on paragraph 82 0 After creating your topic model, we can now create a clustergram of our topics. The following code will work:

83 Leave a comment on paragraph 83 0 # create data.frame with columns as documents and rows as topics

84 Leave a comment on paragraph 84 0 topic_docs <- data.frame(topic.docs)

85 Leave a comment on paragraph 85 0 names(topic_docs) <- documents$id

86 Leave a comment on paragraph 86 0  

87 Leave a comment on paragraph 87 0 ## cluster based on shared words

88 Leave a comment on paragraph 88 0 plot(hclust(dist(topic.words)), labels=topics.labels)

 

89 Leave a comment on paragraph 89 0 These lines create a clustergram of your documents based on the similarity of word use within the topics! Imagine trying to perform such a visualization in MS Excel. R is an extremely powerful programming environment for analyzing the kinds of data that historians will encounter. For a fun example of using R to topic model and visualize patterns across texts, see Ben Marwick’s ‘A Distant Reading of the Day of Archaeology’ https://github.com/benmarwick/dayofarchaeology.

90 Leave a comment on paragraph 90 0 Numerous online tutorials for visualizing and working with data exist; we would suggest Matthew Jockers’ Text Analysis with R for Students of Literature (http://link.springer.com/book/10.1007/978-3-319-03164-4) as your next point of call if you wish to explore R’s potential further.

91 Leave a comment on paragraph 91 0  


92 Leave a comment on paragraph 92 0 [1] Paul Torfs and Claudia Brauer, “A (very) Short Intorudction to R,” 3 March 2014, http://cran.r-project.org/doc/contrib/Torfs+Brauer-Short-R-Intro.pdf . Another very good interactive tutorial for R is available through codeschool.com at http://tryr.codeschool.com/

93 Leave a comment on paragraph 93 0 [2] Fred Gibbs, “Document Similarity with R,” fredgibbs.net, 4 June 2013, http://fredgibbs.net/tutorials/tutorial/document-similarity-with-r/.

95 Leave a comment on paragraph 95 0 [4] For an interesting use-case of this package, please see Ben Marwick’s analysis of the 2013 Day of Archaeology at https://github.com/benmarwick/dayofarchaeology. He published both the analysis and the scripts he has used to perform the analysis on Github itself, making it an interesting experiment in publishing data, digital methods, and discussion.

96 Leave a comment on paragraph 96 0 [5] See Hanna Wallach, David Mimno and Andrew McCallum “Rethinking LDA: Why Priors Matter,” in Proceedings of Advances in Neural Information Processing Systems (NIPS), Vancouver, BC, Canada, 2009.

97 Leave a comment on paragraph 97 0 [6] Matthew Jockers, Text Analysis with R for Students of Literature (New York: Springer: 2014), 147.


98 Leave a comment on paragraph 98 0  [SG1]link to book webpage

Page 48

Source: http://www.themacroscope.org/?page_id=822