1 Thomas Bayes and Mr. Price (1763), “An Essay towards solving a Problem in the Doctrine of Chances,” Philosophical Transactions of the Royal Society of London, 53, 370–418.
2 Nate Silver (2012), The Signal and the Noise: Why so Many Predictions Fail — but Some Don’t, New York, NY: Penguin Press, 243–247.
3 Ted Underwood (7 April 2012), “Topic Modeling made just simple enough.” The Stone and the Shell, http://tedunderwood.com/2012/04/07/topic-modeling-made-just-simple-enough/.
4 George E.P. Box and Norman R. Draper (1987), Empirical Model-Building and Response Surfaces, New York: Wiley, p. 424.
5 When we describe topic modeling here, we have in mind the most commonly used approach, “Latent Dirichlet Allocation” (LDA). There are many other possible algo- rithms and approaches, but most usages of topic modeling amongst digital humanists and historians treat LDA as synonymous with topic modeling. It is worth keeping in mind that there are other options, which might shed useful light on your problem at hand. A special issue of the Journal of Digital Humanities treats topic modeling across a variety of domains and is a useful jumping off point for a deeper exploration of the possibilities (http://journalofdigitalhumanities.org/2-1/). The LDA technique was not the first technique now considered topic modeling but it is by far the most popular. The myriad variations of topic modeling have resulted in an alphabet soup of techniques and programs to implement them that might be confusing or overwhelming to the uniniti- ated; for the beginner it is enough to know that they exist. MALLET primarily utilizes LDA.
6 David Blei (2012), “Topic modeling and the digital humanities,” Journal of Digital Humanities, 2(1), http://journalofdigitalhumanities.org/2-1/topic-modeling-and-digital-humanities-by-david-m-blei/.
7 Scott Weingart (6 May 2012), “The Myth of Text Analytics and Unobtrusive Measurement,” scottbot.net, http://www.scottbot.net/HIAL/?p=16713.
8 Andrew Gelman (March 31, 2010), “‘How Many Zombies Do You Know?’ Using Indirect Survey Methods to Measure Alien Attacks and Outbreaks of the Undead,” arXiv:1003.6087 [physics], http://arxiv.org/abs/1003.6087/.
9 Originally pointed out by Ben Schmidt, “When you have a MALLET, everything looks like a nail” on his wonderful blog SappingAttention.com. See http://sappingattention.blogspot.ca/2012/11/when-you-have-mallet-everything-looks.html.
10 We have previously published an online tutorial to help the novice install and use the most popular of the many different topic modeling programs available, MALLET, at programminghistorian.org. This section republishes elements of that tutorial but we recommend checking the online version in case of any upgrades or version changes.
11 One thing to be aware of is that, since many of the tools we are about to discuss rely on Java, changes to the Java run-time environment and to the Java development kit (as for instance when Oracle updates Java, periodically) can break the other tools. We have tested everything and know that these tools work with Java 7. If you are finding that the tools do not run, you should check what version of Java is on your machine. In a terminal window, type ‘java –version’ at the prompt. You should then see something like ‘java version ”1.7.0 05”’. If you’re not seeing this, it could be that you need to install a different version of Java.
12 Assuming you are using Excel, and the first cell where you wish to put a unique ID number is cell A1: put ‘1’ in that cell. In cell A2, type =a1+1 and hit return. Then, copy that cell, select the remaining cells you wish to fill with numbers, and hit enter. Other spreadsheet programs will have similar functionality.
13 Our version of this file may be found at http://themacroscope.org/2.0/datafiles/johnadams-for-stmt.csv.
14 If the scripts on the Stanford site are now different than what is recounted in this passage, please use the ones at http://themacroscope.org/2.0/code-stmt/.
15 Paul Torfs and Claudia Brauer (3 March 2014), “A (very) Short Introduc- t i o n t o R , ” https://cran.r-project.org/doc/contrib/Torfs+Brauer-Short-R-Intro.pdf. Another very good interactive tutorial for R is available through codeschool.com at http://tryr.codeschool.com/.
16 Fred Gibbs (4 June 2013), “Document Similarity with R,” fredgibbs.net, http://fredgibbs.net/tutorials/tutorial/document-similarity-with-r/.
17 Downloadable at https://gist.github.com/benmarwick/5403048. Another version is lodged at http://themacroscope.org/2.0/datafiles/gist5403048. The other files in this folder contain Marwick’s notes to his class providing more context, and can be opened in RStudio.
18 For an interesting use-case of this package, please see Ben Marwick’s analysis of the 2013 Day of Archaeology at https://github.com/benmarwick/dayofarchaeology. He published both the analysis and the scripts he has used to perform the analysis on GitHub itself, making it an interesting experiment in publishing data, digital methods, and discussion.
19 See Hanna Wallach, David Mimno and Andrew McCallum (2009) “Rethinking LDA: Why Priors Matter,” in Proceedings of Advances in Neural Information Processing Systems (NIPS), 22, 1973–1981. Vancouver, BC: Curran Associates. http://papers.nips.cc/paper/3854-rethinking-lda-why-priors-matter.
20 Matthew Jockers, Text Analysis with R for Students of Literature (New York: Springer: 2014), 147.
22 Andrew Goldstone and Ted Underwood (May 28, 2014), “The Quiet Transformations of Literary Studies: What Thirteen Thousand Scholars Could Tell Us,” Preprint for New Literary History available at https://www.ideals.illinois.edu/handle/2142/49323.