|
An experiment in writing in public, one page at a time, by S. Graham, I. Milligan, & S. Weingart

Topic Modeling as an Integral Part of the Historian’s Macroscope

1 Leave a comment on paragraph 1 0 nb The formatting on this page might be a bit wonky, as it was written in Word, then converted to html using Pandoc. Footnotes, for instance, work in the raw html, but within WordPress, not so much. Caveat Lector.

2 Leave a comment on paragraph 2 0 Here we discuss in some depth different ways of building topic models of the patterns of discourse in your source documents. We explore what ‘topic models’ are, how topic modeling works, and explore why some tools might be better in certain circumstances than others. We provide detailed instructions for working with a number of tools, including the R statistical programming language. We work through an example study of 8000 biographical essays from the Dictionary of Canadian Biography.

3 Leave a comment on paragraph 3 0 Keywords have their limitations, as they require that we know what to search for. Topic modeling, on the other hand, allows us to come in with an open mind. Instead, in this approach, the documents ‘tell’ us what topics they contain. The ‘model’ in a topic model is the idea of how texts get written: authors compose texts by selecting words from a distribution of words (or ‘bag of words’ or ‘bucket of words’) that describe various thematic topics. Got that? In the beginning there was the topic. The entire universe of writing is one giant warehouse wherein its aisles are bins of words – here the bins of Canadian History, there are the bins for major league sports (a very small aisle indeed). All documents (your essay, my dissertation, this book) are composed of words plucked from the various topic bins and combined. If that describes how the author actually writes, then this process is reversible: it is possible to decompose from the entire collection of words the original distributions held in those bags and buckets.

4 Leave a comment on paragraph 4 0 In this section, we explore various ways of creating topic models, what they might mean, and how they might be visualized. We work through a number of examples, so that the reader might find a model to adapt to his or her own work. The essence of a topic model is in its input and its output: a corpus, a collection, of text goes in, and a list of topics that comprise the text comes out the other side. Its mechanism is a deeply flawed assumption about how writing works, and yet the results of this mechanism are often surprisingly cogent and useful.

5 Leave a comment on paragraph 5 0 What is a topic, anyway? If you are a literary scholar, you will understand what a ‘topic’ might mean perhaps rather differently than how a librarian might understand it, as discourses rather than as subject headings. Then there is the problem of how do the mathematicians and computer scientists understand what a ‘topic’ might be? To answer that question, we have to wonder about the meaning of a ‘document’. To the developers of these algorithms, a ‘document’ is simply a collection of words that are found in differing proportions (thus it could be, in the real world, a blog post, a paragraph, a chapter, a ledger entry, an entire book). To decompose a document to its constituent ‘topics’, we have to imagine a world in which every conceivable topic of discussion exists and is well defined, and each topic is perfectly represented as a distribution of particular words. Each distribution of words is unique, and thus you can infer a document’s topicality by carefully comparing its distribution of words to the set of ideal topics we already know exist.

6 Leave a comment on paragraph 6 0 Topic Modeling by Hand

7 Leave a comment on paragraph 7 0 Let’s look at the Gettysburg Address:

8 Leave a comment on paragraph 8 0 Four score and seven years ago our fathers brought forth on this continent a new nation, conceived in liberty, and dedicated to the proposition that all men are created equal.

9 Leave a comment on paragraph 9 0 Now we are engaged in a great civil war, testing whether that nation, or any nation so conceived and so dedicated, can long endure. We are met on a great battlefield of that war. We have come to dedicate a portion of that field, as a final resting place for those who here gave their lives that that nation might live. It is altogether fitting and proper that we should do this.

10 Leave a comment on paragraph 10 0 But, in a larger sense, we can not dedicate, we can not consecrate, we can not hallow this ground. The brave men, living and dead, who struggled here, have consecrated it, far above our poor power to add or detract. The world will little note, nor long remember what we say here, but it can never forget what they did here. It is for us the living, rather, to be dedicated here to the unfinished work which they who fought here have thus far so nobly advanced. It is rather for us to be here dedicated to the great task remaining before us—that from these honored dead we take increased devotion to that cause for which they gave the last full measure of devotion—that we here highly resolve that these dead shall not have died in vain—that this nation, under God, shall have a new birth of freedom—and that government of the people, by the people, for the people, shall not perish from the earth.

11 Leave a comment on paragraph 11 0 This is a single document. How many topics are present, and what are their most significant words? Let us generate a topic model by hand. We have used this exercise in class as an introduction to text analysis more generally, but it can also highlight the differences in the ways computers ‘know’ something versus the ways historians ‘know’.

12 Leave a comment on paragraph 12 0 Take out some highlighters, and mark up this passage (print out a readily accessible copy online if you do not want to sully this book). Use one colour to highlight words related to ‘war’ and another for words related to ‘governance’. You might find that some words get double-marked. As you do this, you are doing some rather complicated inferences. What are ‘war’ words anyway? Have someone else do this exercise at the same time. Depending on your experience with 19th century rhetoric, or politics, or history, your marked up pages will be subtly different. One person will see that ‘field’ perhaps ought to be part of the ‘war’ topic, while another will not make that connection. What makes the difference? Fundamentally, given our individual backgrounds, it comes down to a matter of probability.

13 Leave a comment on paragraph 13 0 Then, list the ‘war’ words down one side of a page, and the ‘governance’ words on the other. Have a ‘count’ column beside your list in which you will display the number of times the word appears. You might like to do this using a spreadsheet, so you can then sort your lists so that the words that appear most often are at the top. The Gettysburg Address, as reproduced here, has 271 words. The most frequently occurring word, ‘that’, is found 13 times, or roughly 5%. Add up your ‘counts’ column, and figure out the proportion that those words, these topics, account for your total document, by dividing by 271. (Incidentally, if you visualize your results as a histogram, swapping out the bars in the chart for the words themselves, you have more or less created a word cloud).

14 Leave a comment on paragraph 14 0 Hollis Peirce, an undergraduate student at Carleton, created the following spreadsheet, working this out for himself (figure 4.1).

15 Leave a comment on paragraph 15 0 [insert Figure 4.1: Hollis Peirce’s marked up Gettysburg Address]

16 Leave a comment on paragraph 16 0 His count looks like this:

17 Leave a comment on paragraph 17 0 War Words Count

18 Leave a comment on paragraph 18 0 Dead 3
War 2
Gave 2
Living 2
Have 2
Engaged 1
Battlefield 1
Field 1
Final 1
Resting 1
Place 1
Lives 1
Live 1
Hallow 1
Ground 1
Brave 1
Struggled 1
Remember 1
Never 1
Forget 1
Fought 1
Nobly 1
Advanced 1
Honoured 1
Take 1
Cause 1
Died 1
Vain 1
Perish 1

19 Leave a comment on paragraph 19 0 Total: 35
35/271= 9%

20 Leave a comment on paragraph 20 0 This was Hollis’ hand-made topic model, but the magical, computational part – deciding what constituted a topic – was done in his head. We decided, a priori, that there were two topics in this document, and that they dealt specifically with ‘war’ and ‘governance’. We poured over the words, and fitted them probabilistically into one or the other topic (and sometimes, both). The finished list of words and their counts is the distribution-over-words that characterizes a topic; the proportion that those words account for the entire document demonstrates the document’s topical composition. If we had several more documents, we could use our lists that we’ve generated as a guide to colour-code, to mark up, these other documents. In the argot of topic modeling, this would be a ‘trained’ topic model (we use our intuition about the Gettysburg Address to find patterns in other documents). We can run the same process from scratch on each of our new documents, and then iterate again through our lists, to understand the latent or hidden structure of our corpus as a whole. We should point out that while ‘document’ in everyday use means a diary entry, a single speech, an entire book, for the purpose of data mining, a document could be just every paragraph within that book, or every 1000 words.

21 Leave a comment on paragraph 21 0 When the computer does the work for us, it pays close attention to those words that might appear in multiple documents. Ted Underwood asks us to think about the word ‘lead’ which might be a verb, and thus part of a topic related to leadership, (ie, he took the lead in the charge) or it might be a noun, and thus part of a topic related to environmental contamination (ie, lead in the pipes was responsible for the poisoning). How can we know the difference? That is, how can we encode our understanding of the semantic differences and word usage in a series of steps for the computer to undertake? We ask the computer to figure out the probability that ‘lead’ belongs in a given topic, versus other topics. Additionally, we start by initially assigning words to topics at random. Hollis already knew that some words were more likely to be about war than governance; the computer does not.

22 Leave a comment on paragraph 22 0 Instead, we instruct the computer to pick topics for us, and it begins with a series of blind guesses, assigning words to bins at random. The computer knows a warehouse full of word bins exists, but it cannot see inside it. The topic model is the computer’s attempt at inferring the contents of each bin by looking at each document and working backwards to the topic bins it likely drew from. The computer starts from the assumption that if several documents contain the same groups of words, those words likely form a ‘topic’. As the computer scans through the text over and over again, it reorganizes its initially random bins into closer and closer approximations of what it guesses the “real” topic bins must look like. Internally, the computer is optimizing for this problem: given a distribution of words over an entire collection of documents, what is the probability that this distribution of words within a document belong to a particular topic?

23 Leave a comment on paragraph 23 0 This is a Bayesian approach to probability. Thomas Bayes was an 18th century clergyman who dabbled in mathematics. He was interested in problems of conditional probabilities, in light of prior knowledge.1 The formula which now bears Bayes’ name depends on assigning a prior probability, and then re-evaluating that probability in the light of what it finds. As the computer goes through this process over and over for each word in the collection, it changes its assumptions about the distribution. In his book The Signal and the Noise, statistician Nate Silver’s example examines the chances that you are being cheated on when you discover a pair of underwear in your house not belonging to your partner. 2 To estimate the chances that you are being cheated on, you have to decide (or estimate) three conditions.

  1. What are the chances that the underwear are there because you are being cheated on (call this ‘y’)?
  2. What are the chances that the underwear are there because you are not being cheated on (call this ‘z’)?
  3. And what would you have estimated, before finding the underwear, that your partner would have been prone to cheat (call this ‘x’, the prior probability)?

25 Leave a comment on paragraph 25 0 The formula is:

26 Leave a comment on paragraph 26 0 xy / xy + z(1-x)

27 Leave a comment on paragraph 27 0 You can do the math for yourself. You can also feed your result back into the equation, changing your prior probability as new information comes to light.

28 Leave a comment on paragraph 28 0 Is this what topic modeling does? Yes, in essence, though the maths are a bit more complicated than this. Underwood writes,

29 Leave a comment on paragraph 29 0 […As we iterate our estimates, adjusting our probabilities, fitting words into topics, fitting topics across documents], a) words will gradually become more common in topics where they are already common. And also, b) topics will become more common in documents where they are already common. Thus our model will gradually become more consistent as topics focus on specific words and documents. But it can’t ever become perfectly consistent, because words and documents don’t line up in one-to-one fashion. […] the tendency for topics to concentrate on particular words and documents will eventually be limited by the actual, messy distribution of words across documents.

30 Leave a comment on paragraph 30 0 That’s how topic modeling works in practice. You assign words to topics randomly and then just keep improving the model, to make your guess more internally consistent, until the model reaches an equilibrium that is as consistent as the collection allows.”3

31 Leave a comment on paragraph 31 0 There is a fundamental difficulty however. When we began looking at the Gettysburg Address, Hollis was instructed to look for two topics that we had already named ‘war’ and ‘governance’. When the computer looks for two topics, it does not know beforehand that there are two topics present, let alone what they might mean in human terms. In fact, we as the investigators have to tell the computer ‘look for two topics in this corpus of material’, at which point the machine will duly find two topics. At the moment of writing, there is no easily-instantiated method to automatically determine the ‘best’ number of topics in a corpus, although this will no doubt be resolved. For the time being, the investigator has to try out a number of different scenarios to find out what’s best. This is not a bad thing, as it forces the investigator continually to confront (or even, close-read) the data, the model, and the patterns that might be emerging.

32 Leave a comment on paragraph 32 0 The late statistician George Box once wrote, “Essentially, all models are wrong, but some are useful.”4 A topic model is a way of fitting semantic meaning against a large volume of text. The researcher has to generate many models against the same corpus until she finds one that reaches Box’s utility.5 We create topic models not to prove that our idea about phenomenon x in the past is true, but rather to generate new ways of looking at our materials, to deform it. In fact there is a danger in using topic models as historical evidence; they are configurable and ambiguous enough that no matter what you are looking for, you just might find it. Remember, a topic model is in essence a statistical model that describes the way that topics are formed. It might not be the right model for your corpus. It is however a starting point, and the topics that it finds (or fails to find) should become a lens through which you look at your material, reading closely to understand this productive failure. Ideally, you would then re-run the model, tweaking it so that it better describes the kind of structure you believe exists. You generate a model to embody your instincts and beliefs about how the material you are working on was formed. Your model could represent ideas about syntax, or ideas about the level of ‘token’ (n-grams of a particular length, for instance) that is appropriate to model. Then you use the algorithm to discover that structure in your real collection. And then repeat.6

33 Leave a comment on paragraph 33 0 Installing MALLET

34 Leave a comment on paragraph 34 0 There is an irony of course in naming the major topic modeling toolkit after a hammer, with all the caveats about the entire world looking like a nail once you have it installed.7 Nevertheless, here we describe the most basic usage and how to get started with this tool.8 If the historian used it on a series of political speeches for example, the program would return a list of topics and the keywords composing those topics. Each of these lists is a topic according to the algorithm. Using the example of political speeches, the list might look like:

  1. Job Jobs Loss Unemployment Growth
  2. Economy Sector Economics Stock Banks
  3. Afghanistan War Troops Middle-East Taliban Terror
  4. Election Opponent Upcoming President
  5. … etc.

36 Leave a comment on paragraph 36 0 By examining the keywords we can discern that the politician who gave the speeches was concerned with the economy, jobs, the Middle East, the upcoming election, and so on.

37 Leave a comment on paragraph 37 0 There are many dangers that face those who use topic modeling without fully understanding it.9 For instance, we might be interested in word use as a proxy for placement along a political spectrum. Topic modeling could certainly help with that, but we have to remember that the proxy is not in itself the thing we seek to understand – as Andrew Gelman demonstrates in his mock study of zombies using Google Trends.10

38 Leave a comment on paragraph 38 0 To install MALLET on any platform, please follow these steps:

  1. Go to the MALLET project page, and download MALLET. (As of this writing, we are working with version 2.0.7.)11 Unzip it in your home directory – that is in C:/ on Windows, or the directory with your username in OS X (it should have a picture of a house).
  2. You will also need the Java Development Kit, or JDK – that is, not the regular Java that one will find on every computer, but the one that lets you program things.12 Install this on your computer.

40 Leave a comment on paragraph 40 0 For computers running OS X or Linux, you’re ready to go! For Windows systems, you have a few more steps:

  1. Unzip MALLET into your C: directory. This is important: it cannot be anywhere else. You will then have a directory called C:\mallet-2.0.7 or similar. For simplicity’s sake, rename this directory to just mallet.
  2. MALLET uses an environment variable to tell the computer where to find all the various components of its processes when it is running. It’s rather like a shortcut for the program. A programmer cannot know exactly where every user will install a program, so the programmer creates a variable in the code that will always stand in for that location. We tell the computer, once, where that location is by setting the environment variable. If you moved the program to a new location, you’d have to change the variable.

42 Leave a comment on paragraph 42 0 To create an environment variable in Windows 7, click on your Start Menu -> Control Panel -> System -> Advanced System Settings. Click new and type MALLET_HOME in the variable name box. It must be like this – all caps, with an underscore – since that is the shortcut that the programmer built into the program and all of its subroutines. Then type the exact path (location) of where you unzipped MALLET in the variable value, e.g., c:\mallet

43 Leave a comment on paragraph 43 0 MALLET is run from the command line, also known as a Command Prompt. If you remember MS-DOS, or have ever played with a Unix computer Terminal (or have seen ‘hackers’ represented in movies or on television shows), this will be familiar. The command line is where you can type commands directly, rather than clicking on icons and menus.

44 Leave a comment on paragraph 44 0 On Windows, click on your Start Menu -> All Programs -> Accessories -> Command Prompt. On a Mac, open up your Applications -> Utilities -> Terminal.

45 Leave a comment on paragraph 45 2 You’ll get the command prompt window, which will have a cursor at c:\user\user> for Windows or ~ username$ on Windows.

46 Leave a comment on paragraph 46 0 On Windows, type cd .. (That is: cd-space-period-period) to change directory. Keep doing this until you’re at the C:\. For OS X, type cd – and you’ll be brought to your home directory.

47 Leave a comment on paragraph 47 0 Then type :

48 Leave a comment on paragraph 48 0 cd mallet

49 Leave a comment on paragraph 49 0 and you will be in the MALLET directory. Anything you type in the command prompt window is a command. There are commands like cd (change directory) and to see all the fiels in a directory you could type dir (Windows) or ls (OS X). You have to tell the computer explicitly that ‘this is a MALLET command’ when you want to use MALLET. You do this by telling the computer to grab its instructions from the MALLET bin, a subfolder in MALLET that contains the core operating routines. Type:

50 Leave a comment on paragraph 50 0 bin\mallet on Windows, or

51 Leave a comment on paragraph 51 0 ./bin/mallet on OS X

52 Leave a comment on paragraph 52 0 at the prompt. If all has gone well, you should be presented with a list of MALLET commands – congratulations! If you get an error message, check your typing. Did you use the wrong slash? Did you set up the environment variable correctly? Is MALLET located at C:\mallet ?

53 Leave a comment on paragraph 53 0 For more instructions on using MALLET from the command line and on Mac OS X, please see our online tutorial at The Programming Historian. There are many options for fine tuning your results, and one can build chains of commands as well (such as removing stopwords, or filtering out numbers or leaving them in).13 But sometimes, to stimulate thought, we might like to run the basic algorithm as a kind of brainstorming exercise. Wouldn’t it be easier – simpler, easier to use with students – if there was a way of running this program by clicking on icons, navigating menus, and so on?

54 Leave a comment on paragraph 54 0 Topic Modeling with the GUI Topic Modeling Tool

55 Leave a comment on paragraph 55 0 Fortunately, there is! The trade-off is that you cannot use all of the available options that MALLET has in order to fine-tune the topic model. On the other hand the GUI Topic Modeling Tool (GTMT) is an excellent way to introduce topic modelling to classroom settings and other areas where technical expertise may be limited (our experience is that this is a good entryway into simple topic modelling), or if you wish to quickly explore a body of materials. Because it is a Java-based program, it also has the advantage of being natively cross-platform: it has been tested and will work on Windows, OS X, and even Linux systems.

56 Leave a comment on paragraph 56 0 Available in a Google Code repository, the GTMT provides quick and easy topic model generation and navigation.14 Compared to the previous section, the GTMT is easy to install. With a working Java instance on any platform, simply download the GTMT program and double click on its icon to run it. Java will open the program, which will present you with a menu interface (figure 4.2).

57 Leave a comment on paragraph 57 0 [*Insert Figure 4.2: The GUI Topic Modeling Tool]*

58 Leave a comment on paragraph 58 0 Imagine that we were interested in knowing how discourses around the commemoration of heritage sites played out across a city (so we want to know not just what the topics or discourses are, but also if there are any spatial or temporal associations too). The first part of such a project would be to topic model the text of the historical plaques. The reader may download the full text of 612 heritage plaques in Toronto as a zip file from themacroscope.org to follow along. Unzip that folder, and start the GTMT.

59 Leave a comment on paragraph 59 0 Select the data to be imported by clicking the “Select Input File or Dir” button, which allows you to pick an individual file or an entire directory of documents. You tell the system where you want your output to be generated (by default it will be where you have the GTMT installed, so beware if you’re running it out of your “Applications” folder as it can get a bit cluttered), note the number of topics, and then can click “Learn Topics” to generate a topic model. The advanced settings are important as well, as they can let you remove stopwords, normalize text by standardizing case, and tweak your iterations, size of topic descriptors, and the threshold at which you want to cut topics off (figure 4.3).

60 Leave a comment on paragraph 60 0 [Insert Figure 4.3: The GUI Topic Modeling Tool settings]

61 Leave a comment on paragraph 61 0 Let’s run it! When you click Learn Topics, you’ll see in the program’s console output a stream of text that looks remarkably similar to what you saw when you ran MALLET on the command line. Pay attention to what happens. For example, you might notice that it finishes quicker than before. In that case, you may need to fiddle with the number of iterations or other parameters.

62 Leave a comment on paragraph 62 0 In the directory you selected as your output, you will now have two folders: output_csv and output_html. Take a moment to explore them. In the former, you will see three files: DocsInTopics.csv, Topics_Words.csv, and TopicsInDocs.csv. The first one will be a big file, which you can open in a spreadsheet program. It is arranged by topic, and then by the relative ranks of each file within each topic. For example, using our sample data, you might find:

topicId rank docId filename
1 1 5 184_Roxborough_Drive-info.txt
1 2 490 St_Josephs_College_School-info.txt
1 3 328 Moulton_College-info.txt
1 4 428 Ryerson_Polytechnical_Institute-info.txt

63 Leave a comment on paragraph 63 0 In the above, we see that in topic number 1, we have a declining order of relevant documents, that are probably about education: three of the plaques are obviously educational institutions. The first (184 Roxborough) is the former home of Nancy Ruth, a feminist activist who helped found the Canadian Women’s Legal Education Fund. By opening the Topics_Words.csv file, our suspicions are confirmed: topic #1 is school, college, university, women, toronto, public, institute, opened, association, residence.

64 Leave a comment on paragraph 64 0 GTMT shines best, however, when you explore the HTML output. This allows you to navigate all of the information in an easy-to-use interface. In the output_html folder, open up the file all_topics.html. It should open in your default browser. The results of our model are visualized below (figure 4.4):

65 Leave a comment on paragraph 65 0 [insert Figure 4.4: sample html output from the GUI Topic Modeling Tool]

66 Leave a comment on paragraph 66 0 This is similar to the Topics_Words.csv file, but the difference is that each topic can be explored further. If we click on the first topic, the school-related one, we see our top-ranked documents from before (figure 4.5)

67 Leave a comment on paragraph 67 0 [Insert Figure 4.5: top-ranked documents in a topic, generated html output from the GUI Topic Modeling Tool]

68 Leave a comment on paragraph 68 0 We can then click on each individual document: we get a snippet of the text, and also the various topics attached to each file (figure 4.6). Those are each individually hyper-linked as well, letting you explore the various topics and documents that comprise your model. If you have your own server space, or use a service like Dropbox, you can easily move all these files online so that others can explore your results. (Part two of our hypothetical study, the mapping of these documents over space, could be accomplished by using online services like Google Fusion Tables or Cartodb).

69 Leave a comment on paragraph 69 0 [insert Figure 4.6: looking inside a document, using the html output from the GUI Topic Modeling Tool]

70 Leave a comment on paragraph 70 0 From beginning to end, then, we quickly go through all the stages of a topic model and have a useful graphical user interface to deal with. While the tool can be limiting, and we prefer the versatility of the command line, this is an essential and useful component of our topic modelling toolkit.

71 Leave a comment on paragraph 71 0 Topic Modeling with R

72 Leave a comment on paragraph 72 0 Working with MALLET from the command line takes some getting used to. Its output is somewhat difficult to work with as the user has to load it into Excel or another spreadsheet to manipulate it. One thing we might like to do with the output is to create a table where our documents are down the side and our topics are arranged in order across the top, with the percentage composition filling out the cells. Once we had the data arranged this way, we would be able to use functions in MS Excel to work out which documents are correlated with which other documents, or which topics are correlated with what topics (the better to create network visualizations with, for instance). Such a matrix is not natively output by MALLET, and with larger datasets, can be quite time consuming to create. It is possible to create a macro or a script in Excel that could do the work for you. However, we would recommend that you try the R language for statistical computing, which we have briefly encountered in Chapter Two.

73 Leave a comment on paragraph 73 0 “Oh no! Do I have to learn a new programming language?” It wouldn’t be a bad idea… However, R is quite popular, it is widely supported, and people are creating new ‘packages’ (extensions or add-ons) that bundle together tools that you might wish to use. Thus, as your experience and comfort level grows with using R, you become able to do much more complicated.

74 Leave a comment on paragraph 74 0 To begin with, download R as well as RStudio and install these two programs.15 You can download R from http://cran.rstudio.com (select the appropriate version for your operating system), and then RStudio from http://www.rstudio.com/products/rstudio/download/ (make sure to select the free, open-source version). Both have standard installation processes that you can run like any other piece of software.

75 Leave a comment on paragraph 75 0 RStudio uses a graphical interface that allows you to create ‘workspaces’ for your project. You load your data into the workspace, run your commands, and keep your output in this workspace (all without altering your original data). The ability to save and reload workspaces allows you to pick up where you left off.

76 Leave a comment on paragraph 76 0 We do not intend this section to be a full-blown introduction to R and how to use the R environment. There are a number of excellent online tutorials available to get you started with R. Paul Torfs and Claudia Braur have a very good general introduction to R.16 Fred Gibbs has a tutorial on computing document similarity with R that we highly recommend.17 Ben Marwick also has good tutorial that shows some of the ins and outs of using R, programmed in R itself. 18

77 Leave a comment on paragraph 77 0 One thing to note is that, like Marwick, many people share their ‘scripts’ for doing different tasks, via Github. To see this in action, navigate to Marwick’s tutorial (https://gist.github.com/benmarwick/5403048) and click on the ‘download gist’ button. You will receive a zipped folder containing a file called ‘short-intro-R.R’. Unzip that file to your desktop. Then, in the R environment select ‘file’ > ‘open script’. Browse to ‘short-intro-R.R’ and select it. Our vision here is to show you how to run other people’s programs, and make minor changes yourself.

78 Leave a comment on paragraph 78 0 A new window will open containing the script. Within that script, any line beginning with a hash character is a comment line. R ignores all comment lines when running code. You can now run either every command in the script, or you can run each line one at a time to see what happens. To run each line one at a time, place the cursor at the beginning of the line, and hit ctrl + r on Windows, or Command + Enter on OS X. You’ll see the line appear in the main console window. Since the first line in Marwick’s script is led with a # (meaning, a comment), the line copies and nothing else happens. Hit ctrl + r/Command + Enter so that the line
2 + 2
is copied into the console. R will return, directly underneath:
4

79 Leave a comment on paragraph 79 0 As you proceed through Marwick’s script, you’ll see other ways of dealing with data. In line 22, you create a variable called ‘a’ and give it the value 2; in line 23 you create a variable called ‘b’ and give it the value 3. Line 24 has you add ‘a + b’, which will return ‘5’.

80 Leave a comment on paragraph 80 0 Sometimes, you will see a line like this:

81 Leave a comment on paragraph 81 0 library(igraph)

82 Leave a comment on paragraph 82 0 This line is telling R to use a particular package that will provide R with more tools and algorithms to manipulate your data. If that package is not installed, you will receive an error message. If that happens, you can tell R to install the package quite easily:

83 Leave a comment on paragraph 83 0 install.packages(”igraph”)

84 Leave a comment on paragraph 84 0 R will ask you which download site that you wish to use (which it calls a ‘mirror’); select one that is geographically close to you for the fastest download. These mirrors are the repositories that contain the latest versions of all the packages.

85 Leave a comment on paragraph 85 0 Mimno’s MALLET Wrapper in R

86 Leave a comment on paragraph 86 0 David Mimno has written a wrapper for MALLET in R. Mimno’s wrapper installs MALLET directly inside R, allowing greater speed and efficiency, as well as turning R’s full strength to the analysis and visualization of the resulting data. You do not have to have the command-line version of MALLET installed on your machine already to use Mimno’s wrapper: the wrapper is MALLET, or at least, the part that does the topic modeling!19 (The command-line version of MALLET can do many other things besides topic modeling). Since MALLET is written in Java, the wrapper will install the rJava package to enable R to run it.

87 Leave a comment on paragraph 87 0 To use the MALLET wrapper in R, one simply types (remember, in R, any line with a # is a comment and so does not execute):

88 Leave a comment on paragraph 88 0 #the first time you wish to use it, you must install:

89 Leave a comment on paragraph 89 0 install.packages(“mallet”)

90 Leave a comment on paragraph 90 0 # R will then ask you which ‘mirror’ (repository) you wish to install from. Select one that is close to you.

91 Leave a comment on paragraph 91 0 #any subsequent time, after you’ve installed it:

92 Leave a comment on paragraph 92 0 Require(mallet)

93 Leave a comment on paragraph 93 0 (It may happen, for Mac users, that the ‘install’ command does not work, giving you a cryptic error message: file‘/var/folders/7r/_6pf946x0wd3m30wjgwhvjd80000gp/T/ /Rtmp3CAuAH/downloaded_packages/mallet_1.0.tgz’ is not an OS X binary package

94 Leave a comment on paragraph 94 0 If this happens, go to http://cran.r-project.org/web/packages/mallet/index.html and look for the ‘package source’ under downloads. Download that file. To install it in R, open a terminal window (in your applications, utilities folder), go to your Downloads folder (type cd Downloads) and enter this command:

95 Leave a comment on paragraph 95 0 sudo R CMD INSTALL mallet_1.0.tar.gz

96 Leave a comment on paragraph 96 0 It will ask you for your computer password; enter this and then the package will install. You should now be able to follow the rest of this section. In your R console, type require (mallet) and if you receive no error message, then you’re ready to use MALLET!)

97 Leave a comment on paragraph 97 0 Now the whole suite of commands and parameters is available to you. A short demonstration script that uses the example data bundled with MALLET (since you already downloaded that data earlier in this chapter) can be found at http://j.mp/mimnowrapperexample, or by visiting https://gist.github.com/shawngraham and searching for ‘mimnowrapperexample.r’. (The full manual for the wrapper may be found at http://cran.r-project.org/web/packages/mallet/mallet.pdf ; our example is based on Mimno’s example). Open it up.

98 Leave a comment on paragraph 98 0 Let’s look in more detail at how we build a topic model using this script.

99 Leave a comment on paragraph 99 0 documents <- mallet.read.dir(“mallet-2.0.7/sample-data/web/en/”)

100 Leave a comment on paragraph 100 0 This line creates a variable called ‘documents’, and it contains the path to the documents you wish to analyze. On a Windows machine, you would include the full path, ie “C:\mallet-2.0.7\sampled-data\web\”. By default on OS X, if you followed the default instructions, it will work out of the box. In that directory, each document is its own unique text file. Now we need to import those documents into MALLET. We do that by running this command:

101 Leave a comment on paragraph 101 0 mallet.instances <- mallet.import(documents$id, documents$text, “mallet-2.0.7/stoplists/en.txt”, token.regexp = “\p{L}[\p{L}\p{P}]+\p{L}”)

102 Leave a comment on paragraph 102 0 It looks complicated, but the mallet.import function creates an instance list that lists every document by its id, with its associated text, using a stoplist to remove the common stopwords, and a regular expression to keep all sequences of Unicode characters. It is worth asking yourself: is the default stoplist provided by MALLET appropriate for my text? Are there words that should be added or removed? You can create a stopword list in any text editor by opening the default one, adding or deleting as appropriate, and then saving with a new name and the .txt extension. The next step is to create an empty container for our topic model:

103 Leave a comment on paragraph 103 0 n.topics <- 30

104 Leave a comment on paragraph 104 0 topic.model <- MalletLDA(n.topics)

105 Leave a comment on paragraph 105 0 We created a variable called ‘n.topics’. If you reran your analysis to explore a greater or lesser number of topics, you would only have to change this one line to the number you wished. Now we can load the container up with our documents:

106 Leave a comment on paragraph 106 0 topic.model$loadDocuments(mallet.instances)

107 Leave a comment on paragraph 107 0 At this point, you can begin to explore for patterns in the word use in your document, if you wish, by finding out what the vocabulary and word frequencies of the document are:

108 Leave a comment on paragraph 108 0 vocabulary <- topic.model$getVocabulary()

109 Leave a comment on paragraph 109 0 word.freqs <- mallet.word.freqs(topic.model)

110 Leave a comment on paragraph 110 0 If you now type

111 Leave a comment on paragraph 111 0 length(vocabulary)

112 Leave a comment on paragraph 112 0 …you will receive a number; this is the number of unique words in your document. You can inspect the top 100 words thus:

113 Leave a comment on paragraph 113 0 vocabulary[1:100]

114 Leave a comment on paragraph 114 0 As Mimno notes in the comments to his code, this information could be useful for you to customize your stoplist. Jockers also shows us how to explore some of the distribution of those words using the ‘head’ command (which returns the first few rows of a data matrix or data frame):

115 Leave a comment on paragraph 115 0 head(word.freqs)

116 Leave a comment on paragraph 116 0 You will be presented with a table with words down the side, and two columns: term.freq and doc.freq. This tells you the number of times the word appears in the corpus, and the number of documents in which it appears.

117 Leave a comment on paragraph 117 0 The script now sets the optimization parameters for the topic model. In essence, you can tune the model.20 This line sets the ‘hyperparameters’:

118 Leave a comment on paragraph 118 0 topic.model$setAlphaOptimization(20, 50)

119 Leave a comment on paragraph 119 0 You can play with these to see what happens, or you can choose to leave this line out and accept MALLET’s defaults. The next two lines generate the topic model:

120 Leave a comment on paragraph 120 0 topic.model$train(200)

121 Leave a comment on paragraph 121 0 topic.model$maximize(10)

122 Leave a comment on paragraph 122 0 The first line tells MALLET how many rounds or iterations to process through; more can sometimes lead to ‘better’ topics and clusters. Jockers reports that he finds that the quality increases with the number of iterations only so far before plateauing.21 When you run these commands, output will scroll by as the algorithm iterates. At each iteration, it will also give you the probability that the topic is likely; the closer the number is to zero, the better.

123 Leave a comment on paragraph 123 0 Now we want to examine the results of the topic model. These lines take the raw output and convert it to probabilities:

124 Leave a comment on paragraph 124 0 doc.topics <- mallet.doc.topics(topic.model, smoothed=T, normalized=T)

125 Leave a comment on paragraph 125 0 topic.words <- mallet.topic.words(topic.model, smoothed=T, normalized=T)

126 Leave a comment on paragraph 126 0 One last bit of transformation will give us a spreadsheet with topics down the side and documents across the top (compare this with the ‘native’ output of MALLET from the command line).

127 Leave a comment on paragraph 127 0 topic.docs <- t(doc.topics)

128 Leave a comment on paragraph 128 0 topic.docs <- topic.docs / rowSums(topic.docs)

129 Leave a comment on paragraph 129 0 write.csv(topic.docs, “topics-docs.csv” )

130 Leave a comment on paragraph 130 0 This script will not work ‘out of the box’ for you the first time, because your files and our files might not necessarily be in the same location. To use it successfully for yourself, you will need to change certain lines to point to appropriate locations on your own machine (regardless of whether it is a Windows, Mac, or Linux machine). Study the example carefully. Do you see which lines need to be changed to point to your own data?

131 Leave a comment on paragraph 131 0 This is a good starting point for your future work with R! If you want to try it out on your own data, you can change the directory with documents on line 21. There are a couple of ways you might try to visualize these results, too, within R. Consider the following:

132 Leave a comment on paragraph 132 0 # create data.frame with columns as documents and rows as topics

133 Leave a comment on paragraph 133 0 topic_docs <- data.frame(topic.docs)

134 Leave a comment on paragraph 134 0 names(topic_docs) <- documents$id

135 Leave a comment on paragraph 135 0 ## cluster based on shared words

136 Leave a comment on paragraph 136 0 plot(hclust(dist(topic.words)), labels=topics.labels)

137 Leave a comment on paragraph 137 0 You can write these three lines at the end of the sample script found at http://j.mp/mimnowrapperexample. When you run these three lines, R will plot how your documents based on the similarity of word use within the topics! Imagine trying to perform such a visualization in MS Excel. For a few other examples of potential visualizations (including a network of documents joined by positively correlated topics) look at lines 67 – 126 in this script: https://github.com/shawngraham/R/blob/master/topicmodel.R. Try adapting them to work with the sample script we’ve been discussing. One final aspect to working with MALLET in R is the amount of memory available for rJava to employ. The default amount that rJava uses is 512mb, which can often be too little, especially when topic modeling. To increase this memory, you need to increase the amount of memory allocated by setting these parameters before loading rJava (remember, when you install and call MALLET with the library command, MALLET in turn installs and calls rJava):

138 Leave a comment on paragraph 138 0 options(java.parameters = “-Xmx5120m”)

139 Leave a comment on paragraph 139 0 You can put this line at the beginning of your script. As written above, we have increased the heap size to 5 gb (there are 1024 mb in a gb, not 1000, as some websites would have you believe!) The more working memory your machine has available, the quicker your analyses will run.

140 Leave a comment on paragraph 140 0 R is an extremely powerful programming environment for analyzing the kinds of data that historians will encounter. Numerous online tutorials for visualizing and working with data exist; we would suggest Matthew Jockers’ Text Analysis with R for Students of Literature (http://link.springer.com/book/10.1007/978-3-319-03164-4) as your next point of call if you wish to explore R’s potential further.

141 Leave a comment on paragraph 141 0 Topic Modeling with the Stanford Topic Modeling Toolbox

142 Leave a comment on paragraph 142 0 Different tools give different results even if they are all still ‘topic modeling’ writ large. This is a useful way to further understand how these algorithms shape our research, and is a useful reminder to always be up front about the tools that you are using in your research. One tool that historians should become familiar with is the Stanford Topic Modeling Toolbox, (STMT) because its outputs make it easy to see how topics play out over time (it ‘slices’ its output by the dates associated with each document).22 The STMT allows us to track the evolution of topics within a corpus at a much finer level, at the level of each individual entry. STMT is written in Java, and requires the latest version of Java to be installed on your machine (available for free from https://www.java.com/en/download/). You can find it at http://nlp.stanford.edu/software/tmt/tmt-0.4/.

143 Leave a comment on paragraph 143 0 Return, for a moment, to one of the online pages showing Adams’ diary entries (for example, http://www.masshist.org/digitaladams/archive/doc?id=D0) Right-click on the page in your web browser, select ‘view source’ on the pop up menu, and examine the tags used to mark up the html. Every so often, you will see something like:

144 Leave a comment on paragraph 144 0 <a id=“D0.17530608”></a><div class=“entry”> <div class=“head”>HARVARD <span title=“College”>COLLEDGE</span> JUNE 8TH. 1753.</div> <div class=“dateline”><span title=“1753-06-08”>8 FRIDAY.</span></div> <div class=“p”>At <span title=“College”>Colledge</span>.

145 Leave a comment on paragraph 145 0 Using the scraper tool in Outwit Hub we can grab everything that falls within the “entry” class. We want to make sure we get the dates first. In this case, we want the computer to find “1753-06-08” and grab it in a separate column. We see that every time a date appears, it is preceded by <div class=“dateline”><span title=” and followed by “>.

146 Leave a comment on paragraph 146 0 To do this, let’s refresh how to use Outwit Hub. Load the program, and enter the URL http://www.masshist.org/digitaladams/archive/doc?id=D0 into Outwit’s navigation bar. Hit enter and be brought to the page. Now, click ‘scrapers’ in the left-hand column, and then ‘new’ below. Our goal is to set up a scraper that can produce two columns: one with dates, and the other one with entries.

147 Leave a comment on paragraph 147 0 In the ‘scrapers’ section, we want to make this happen. We begin by doubleclicking on empty fields in the table below, which allow us to enter information (figure 4.7). We type a descriptive label under ‘description’ (so we can remember what is supposed to be scraped there) – which will either be ‘date’ or ‘entry.’ We then peruse the HTML to again identify the relevant tag or ‘marker’ before the information we want, and the marker afterwards, like so. So remember above, what we had before the date we want – 1760-12-02 – and what we have afterwards. Fill out the form like so:

description marker before marker after
Date <div class=“dateline”><span title=” “>

148 Leave a comment on paragraph 148 0 Now it is time to grab the actual text of the diary entry. It’s a bit more messy. When we look at the HTML around entries, we see the following:

149 Leave a comment on paragraph 149 0 <a id=“D0.17530608”></a><div class=“entry”> <div class=“head”>HARVARD <span title=“College”>COLLEDGE</span> JUNE 8TH. 1753.</div> <div class=“dateline”><span title=“1753-06-08”>8 FRIDAY.</span></div> <div class=“p”>At <span title=“College”>Colledge</span>. … </div> </div>

150 Leave a comment on paragraph 150 0 (We added the elipses there!) There is a lot of text cut out of here, but in general, entries begin with the code <div class=”entry”> and they end with two </div> </div> tags (with two spaces between them). In Outwit hub, then, we put the following as the next row in our scraper:

entry class=“entry”> </div> </div>

151 Leave a comment on paragraph 151 0 This is telling us to get everything that starts with that paragraph marker and ends with the beginning of the link. Don’t worry, you’ll see what we mean below.

152 Leave a comment on paragraph 152 0 [insert figure 4.7 A scraper using Outwit Hub]

153 Leave a comment on paragraph 153 0 The resulting output will look like this, opening in a new panel at the bottom of the screen when we hit the ‘execute’ button and then the ‘catch’ button:

1753-06-08 At Colledge. A Clowdy Dull morning and so continued till about 5 a Clock when it began to rain moderately But continued not long But remained Clowdy all night in which night I watched with Powers.
1753-06-09 At Colledge the weather still remaining Clowdy all Day till 6 o’Clock when the Clowds were Dissipated and the sun brake forth in all his glory.
1753-06-10 At Colledge a clear morning. Heard Mr. Appleton expound those words in I.Cor.12 Chapt. 7 first verses and in the afternoon heard him preach from those words in 26 of Mathew 41 verse watch and pray that ye enter not into temptation

154 Leave a comment on paragraph 154 0 Perfect! We have a spreadsheet of that first online page of John Adam’s diary thanks to the magic of Outwit Hub! The paid version of Outwit can crawl through the entire website automatically, exploring all subpages, matching your scraper against each page it finds, saving the data to a single spreadsheet file. Using the free version, you can manually page through the site, getting Outwit to add the new data to your spreadsheet. Beside the ‘catch’ button there is a drop-down arrow. Set this to ‘auto-catch’. Then, in the URL at the top of the screen, which ends with /doc?id=D0, you can page through the site by chaging the 0 to 1, then to 2, then to 3… and so on. Each time the page loads, Outwit will automatically apply the scraper and paste the data into your spreadsheet. You can then click on the ‘export’ button at the bottom of the screen (the drop-down arrow beside export will let you choose your desired format). The free version is limited to 100 rows of data. When you push up against that barrier, you can hit the ‘empty’ button (after exporting, of course!) and continue onwards, using a new file name.

155 Leave a comment on paragraph 155 0 After exporting, open your file in your spreadsheet of choice (when you have more than one file, you can copy one and paste it into the other). Delete the first row (the one that reads URL, Date, and Entry). We then need to insert a column at the beginning of this data (in most spreadsheet programs, click ‘add column’ in the ‘Insert’ menu), so that each diary entry gets its own unique record number (you can do this by either manually inputting numbers, or by creating a quick formula in the first cell, and pasting it in the remaining cells)23:

1 1753-06-08 At Colledge. A Clowdy Dull morning and so continued till about 5 a Clock when it began to rain moderately But continued not long But remained Clowdy all night in which night I watched with Powers.
2 1753-06-09 At Colledge the weather still remaining Clowdy all Day till 6 o’Clock when the Clowds were Dissipated and the sun brake forth in all his glory.
3 1753-06-10 At Colledge a clear morning. Heard Mr. Appleton expound those words in I.Cor.12 Chapt. 7 first verses and in the afternoon heard him preach from those words in 26 of Mathew 41 verse watch and pray that ye enter not into temptation

156 Leave a comment on paragraph 156 0 We save this as ‘johnadamsscrape.csv’. The STMT always imagines the data to be structured in three (or more) columns – a record id, a date, and the text itself (of course, it could be structured differently, but for our purposes here, this suffices). With our data extracted, we turn to the toolbox. Installing the STMT is a matter of downloading, unzipping, and then double-clicking the file named tmt-0.40.0jar, which brings up this interface (figure 4.8):

157 Leave a comment on paragraph 157 0 [insert Figure 4.8 Stanford Topic Modeling Toolbox]

158 Leave a comment on paragraph 158 0 The STMT operates by running various scripts the user creates.24 This allows a lot of flexibility, but it also can seem very daunting for the first time user. However, it is not as scary as it might first seem. The STMT uses scripts written in the Scala language. For our purposes, we can simply modify the sample scripts provided by the STMT team.25 Just scroll down from the download page (http://nlp.stanford.edu/software/tmt/tmt-0.4/) to see them listed below.

159 Leave a comment on paragraph 159 0 The four scripts that are provided with the program show how to build up a workflow of several different operations into a single script. Scripts 1 and 2 first show how to load data, and then how to load data and tokenize it (turn it into the chunks from which the topic model will be built). Script 3 loads the data, tokenizes it, and then runs it through a topic model. Script 4 loads the output of script 3 and then slices it.

160 Leave a comment on paragraph 160 0 Download the scripts, and make sure to save them with the .scala file extension. Save them in the same folder as your data csv file.

161 Leave a comment on paragraph 161 0 Example 2 from the STMT website trains a topic model (http://nlp.stanford.edu/software/tmt/tmt-0.3/examples/example-2-lda-learn.scala ). Open this script using a text editor such as Notepad++ or TextWrangler, which will automatically provide line numbers (which are quite handy when examining code).26 The critical line is line #15:

162 Leave a comment on paragraph 162 0 15 val source = CSVFile(“pubmed-oa-subset.csv”) ~> IDColumn(1);

163 Leave a comment on paragraph 163 0 This line is telling STMT where to find your data, and that the first column is the unique ID number. Change the example file name to whatever you called your data file (in our case, “johnadams.csv”).

164 Leave a comment on paragraph 164 0 The next line to examine is line #26, in this block (in this language, comments are noted by the two forward slashes // which explain what each line is doing):

165 Leave a comment on paragraph 165 0 24 val text = {
25 source ~> // read from the source file
26 Column(3) ~> // select column containing text
27 TokenizeWith(tokenizer) ~> // tokenize with tokenizer above
28 TermCounter() ~> // collect counts (needed below)
29 TermMinimumDocumentCountFilter(4) ~> // filter terms in // filter out 60 most common terms
30 DocumentMinimumLengthFilter(5) // take only docs with >=5 terms

166 Leave a comment on paragraph 166 0 31 }

167 Leave a comment on paragraph 167 0 That is, the line stating ‘Column(3)’ .  This tells STMT to look in the third column for the actual text we wish to topic model. It is also extracting and filtering common words that might create noise; whether or not 60 is an appropriate number for this corpus is something with which we should experiment.

168 Leave a comment on paragraph 168 0 Finally, you may wish to examine line 38 and line 39:

169 Leave a comment on paragraph 169 0 val params = LDAModelParams(numTopics = 30, dataset = dataset,

170 Leave a comment on paragraph 170 0 topicSmoothing = 0.01, termSmoothing = 0.01);

171 Leave a comment on paragraph 171 0 Changing the numTopics allows you to change, well, the number of topics fitted in the topic model. Save your script (we used the name “johnadams-topicmodel.scala”). In the TMT interface, select File >> open script. Select your script. Your TMT interface will now look like figure 4.9:

172 Leave a comment on paragraph 172 0 [Insert Figure 4.9 Configuring the Stanford Topic Modeling Toolbox]

173 Leave a comment on paragraph 173 0 We find it useful to increase the available memory available to the STMT to do its calculations by changing the number in the Memory box in the interface (the default is 256 mb; type in 1024 or another multiple of 256, as appropriate to your machine’s available memory). This is because sometimes, if we are putting too much text in the system, we might run out of room in the system’s memory!

174 Leave a comment on paragraph 174 0 Now click ‘run’, and you should soon have a new folder in your directory named something along the lines of “lda-afbfe5c4-30-4f47b13a”. Note that the algorithm for generating the topics is our old friend, LDA, as indicated in the folder name (the lda part of that long string).

175 Leave a comment on paragraph 175 0 Open this folder. There are many subfolders, each one corresponding to a progressive iteration of the topic model. Open the last one, labeled ‘01000’. There are a variety of .txt files; you will want to examine the one called ‘summary.txt’. The information within that file is arranged by topic and descending importance of words within each topic:

176 Leave a comment on paragraph 176 0 Topic 00        454.55948070466843
company    24.665848270843448
came     21.932877059649453
judge     14.170879738912884
who     13.427171229569737
sir     10.826043463533079

177 Leave a comment on paragraph 177 0 If you return to the folder, there is also a csv file indicating the distribution of topics over each of the diary entries, which can be visualized or explored further in a variety of ways; one in particular worth looking at in more detail is STMT’s ability to ‘slice’ topics by time.

178 Leave a comment on paragraph 178 0 Slicing a topic model

179 Leave a comment on paragraph 179 0 The Stanford Topic Modeling Toolbox allows us to ‘slice’ a model to show its evolution over time, and its contribution to the corpus as a whole. That is, we can look at the proportion of a topic at a particular point in time. To achieve a visualization comparing two topics (say) over the entire duration of John Adam’s diary entries, one has to create a “pivot table report” (a summary of your data in aggregate categories in Microsoft Excel or in a similar spreadsheet). The exact steps will differ based on which version of the spreadsheet you have installed; more on that in a moment.

180 Leave a comment on paragraph 180 0 In the code for script 4 (http://nlp.stanford.edu/software/tmt/tmt-0.3/examples/example-4-lda-slice.scala), pay attention to line 16:

181 Leave a comment on paragraph 181 0 16 val modelPath = file(“lda-59ea15c7-30-75faccf7”);

182 Leave a comment on paragraph 182 0 Remember seeing a bunch of letters and numbers that looked like that before? Make sure the file name within the quotations is to the output folder you created previously. In line 24, make sure that you have the original csv file name inserted. In line 28, make sure that the column indicated is the column with your text in it, i.e. Column 3. Line 36 is the key line for ‘slicing’ the topic model by date:

183 Leave a comment on paragraph 183 0 36 val slice = source ~> Column(2);

184 Leave a comment on paragraph 184 0 37 // could be multiple columns with: source ~> Columns(2,7,8)

185 Leave a comment on paragraph 185 0 Thus, in our data, column 2 contains the year-month-day date for the diary entry (whereas column 3 has the entry itself; check to make sure that your data is arranged the same way). One could have three separate columns for year, month, day, or indeed, whatever other slicing criterion.

186 Leave a comment on paragraph 186 0 Once you load and run that script, you will end up with a .csv file that looks something like this. The location of it will be noted in the TMT window – for us, for example, it was lda-8bbb972c-30-28de11e5/johnadams2-sliced-top-terms.csv:

Topic Group ID Documents Words
Topic 00 1753-06-08 0.047680088 0.667521
Topic 00 1753-06-09 2.79E-05 2.23E-04
Topic 00 1753-06-10 0.999618435 12.99504
Topic 00 1753-06-11 1.62E-04 0.001781
Topic 00 1753-06-12 0.001597659 0.007988

187 Leave a comment on paragraph 187 0 …and so on for every topic, for every document (‘Group ID’), in your corpus. The numbers under ‘documents’ and ‘words’ will be decimals, because in the current version of the topic modeling toolbox, each word in the corpus is not assigned to a single topic, but over a distribution of topics (ie ‘cow’ might be .001 of topic 4 – or 0.1%, but .23 or 23% of topic 11). Similarly, the ‘documents’ number indicates the total number of documents associated with the topic (again, as a distribution). Creating a pivot table report will allow us to take these individual slices and aggregate them in interesting ways to see the evolution of patterns over time in our corpus.

188 Leave a comment on paragraph 188 0 To create a pivot table (see figures 4.10 – 4.15)

  1. Highlight all of the data on the page, including the column header (figure 4.10).
  2. Select ‘pivot table’ (under the ‘data’ menu option). The pivot table wizard will open. Arrange ‘topic’ under ‘column labels’, ‘Group ID’ under row labels, and under ‘values’ select either documents or words. Under ‘values’, select the ‘i’ and make sure that the value being represented is ‘sum’ rather than ‘count’ (figure 4.11).

190 Leave a comment on paragraph 190 0 You’ve now got a table as in Figure 4.12 that sums up how much the various topics contribute to each document. Let us now visualize the trends using simple line charts.

191 Leave a comment on paragraph 191 0 3. Highlight two columns: try ‘row labels’ and ‘topic 00’. Click charts, then line chart. You now have a visualization of topic 00 over time (figure 4.13).

192 Leave a comment on paragraph 192 0 4. To compare various topics over time, click the drop down arrow beside column labels, and select the topics you wish to visualize. You may have to first unselect ‘select all,’ and then click a few. For example, topics 04, 07, and 11 as in figure 4.14.

193 Leave a comment on paragraph 193 0 6. Your table will repopulate with just those topics displayed. Highlight row labels and the columns (don’t highlight the ‘grand totals’). Select line chart – and you can now see the evolution over fine-grained time of various topics within the documents as in figure 4.15.

194 Leave a comment on paragraph 194 0 [insert figures 4.10 – 4.15]

195 Leave a comment on paragraph 195 0 Let’s look at our example topic model again. In our data, the word ‘congress’ appears in three different topics:

196 Leave a comment on paragraph 196 0 Topic 10 498.75611573955666

197 Leave a comment on paragraph 197 0 town 16.69292048824643

198 Leave a comment on paragraph 198 0 miles 13.89718543152377

199 Leave a comment on paragraph 199 0 tavern 12.93903988493706

200 Leave a comment on paragraph 200 0 through 9.802415532979191

201 Leave a comment on paragraph 201 0 place 9.276480769212077

202 Leave a comment on paragraph 202 0 round 9.048239826246022

203 Leave a comment on paragraph 203 0 number 8.488670753159125

204 Leave a comment on paragraph 204 0 passed 7.799961749511179

205 Leave a comment on paragraph 205 0 north 7.484974235702917

206 Leave a comment on paragraph 206 0 each 6.744740259678034

207 Leave a comment on paragraph 207 0 captn 6.605002560249323

208 Leave a comment on paragraph 208 0 coll 6.504975477980229

209 Leave a comment on paragraph 209 0 back 6.347642624820879

210 Leave a comment on paragraph 210 0 common 6.272711370743526

211 Leave a comment on paragraph 211 0 congress 6.1549912410407135

212 Leave a comment on paragraph 212 0 side 6.058441654893633

213 Leave a comment on paragraph 213 0 village 5.981146620989283

214 Leave a comment on paragraph 214 0 dozen 5.963423616121272

215 Leave a comment on paragraph 215 0 park 5.898152600754463

216 Leave a comment on paragraph 216 0 salem 5.864463108247379

217 Leave a comment on paragraph 217 0 Topic 15 377.279139869195

218 Leave a comment on paragraph 218 0 should 14.714242395918141

219 Leave a comment on paragraph 219 0 may 11.427645785723927

220 Leave a comment on paragraph 220 0 being 11.309756818192291

221 Leave a comment on paragraph 221 0 congress 10.652337301569547

222 Leave a comment on paragraph 222 0 children 8.983289013109097

223 Leave a comment on paragraph 223 0 son 8.449087061231712

224 Leave a comment on paragraph 224 0 well 8.09746455155195

225 Leave a comment on paragraph 225 0 first 7.432256959926409

226 Leave a comment on paragraph 226 0 good 7.309576510891309

227 Leave a comment on paragraph 227 0 america 7.213459745318859

228 Leave a comment on paragraph 228 0 shall 6.9669200007792345

229 Leave a comment on paragraph 229 0 thus 6.941222002768462

230 Leave a comment on paragraph 230 0 state 6.830011194555543

231 Leave a comment on paragraph 231 0 private 6.688248638768475

232 Leave a comment on paragraph 232 0 states 6.546277272369566

233 Leave a comment on paragraph 233 0 navy 5.9781329069165015

234 Leave a comment on paragraph 234 0 must 5.509903082873842

235 Leave a comment on paragraph 235 0 news 5.462992821996899

236 Leave a comment on paragraph 236 0 future 5.105010412312934

237 Leave a comment on paragraph 237 0 present 4.907616840233855

238 Leave a comment on paragraph 238 0 Topic 18 385.6024287288036

239 Leave a comment on paragraph 239 0 french 18.243384948219443

240 Leave a comment on paragraph 240 0 written 15.919785193963612

241 Leave a comment on paragraph 241 0 minister 12.110373497509345

242 Leave a comment on paragraph 242 0 available 10.615420801791679

243 Leave a comment on paragraph 243 0 some 9.903407524395778

244 Leave a comment on paragraph 244 0 who 9.245823795980353

245 Leave a comment on paragraph 245 0 made 8.445444930945051

246 Leave a comment on paragraph 246 0 congress 8.043713670428902

247 Leave a comment on paragraph 247 0 other 7.923965049197159

248 Leave a comment on paragraph 248 0 character 7.1039611800997005

249 Leave a comment on paragraph 249 0 king 7.048852185761656

250 Leave a comment on paragraph 250 0 english 6.856574786621914

251 Leave a comment on paragraph 251 0 governor 6.762114646057875

252 Leave a comment on paragraph 252 0 full 6.520903036682074

253 Leave a comment on paragraph 253 0 heard 6.255137288426042

254 Leave a comment on paragraph 254 0 formed 5.870660807641354

255 Leave a comment on paragraph 255 0 books 5.837244336904303

256 Leave a comment on paragraph 256 0 asked 5.83306916947137

257 Leave a comment on paragraph 257 0 send 5.810249556108117

258 Leave a comment on paragraph 258 0 between 5.776470078486788

259 Leave a comment on paragraph 259 0 These numbers give a sense of the general overall weight or importance of these words to the topic and the corpus as a whole. Topic 10 seems to be a topic surrounding a discourse concerning local governance, while Topic 15 seems to be about ideas of what governance, at a national scale, ought to be, and Topic 18 concerns what is actually happening, in terms of the nation’s governance. Thus, we might want to explore how these three topics play out against each other over time to get a sense of Adams’ differing scales of ‘governance’ discourses play out over time. Accordingly, we select topic 10, 15, and 18 from the drop down menu. The chart updates automatically, plotting the composition of the corpus with these three topics over time (Figure 4.16).

260 Leave a comment on paragraph 260 0 [insert Figure 4.16 Topics 10, 15, 18 over time in John Adams’ Diaries.]

261 Leave a comment on paragraph 261 0 The chart is a bit difficult to read, however. We can see a spike in Adams’ ‘theoretical’ musings on governance, in 1774 followed by a rapid spike in his ‘realpolitick’ writings. It would be nice to be able to zoom in on a particular period. On the dropdown arrow under the dates column, we can select the relevant We could also achieve a dynamic time visualization by copying and pasting the entire pivot table (filtered for our three topics) into a new Google spreadsheet. At http://j.mp/ja-3-topics there is a publicly available spreadsheet that does just that (Figure 4.17 is a screen shot). We copied and pasted the filtered pivot table into a blank sheet. Then we clicked on the ‘insert chart’ button on the toolbar. Google recognized the data as having a date column, and automatically selected a scrollable/zoomable time series chart. At the bottom of that chart, we can simply drag the time slider to bracket the period that we are interested in.

262 Leave a comment on paragraph 262 0 [insert Figure 4.17 Using Google Sheets to visualize time slices of documents and topics in John Adams’ Diaries ]

263 Leave a comment on paragraph 263 0 The ability to slice our topic model into time chunks is perhaps, for the historian, the greatest attraction of the Stanford tool. That it also accepts input from a csv file, which we can generate from scraping online sources, is another important feature of the tool.

264 Leave a comment on paragraph 264 0 Working with scripts can be daunting at first, but the learning curve is worth the power that they bring us! We would suggest keeping an unchanged copy of the example scripts from the STMT website in their own folder. Then, copy and paste them to a unique folder for each dataset you will work with. Edit them in Notepad++ (or similar) to point to the particular dataset you wish to work with. Keep your data and scripts together, preferably in a Dropbox folder as a backup strategy. Folding Google spreadsheets in your workflow is also a handy tool, especially if you plan on sharing or writing about your work online.

265 Leave a comment on paragraph 265 0 A Proviso Concerning Using Other People’s Tools

266 Leave a comment on paragraph 266 0 Tools and platforms are changing all the time, so there’s an awful lot of work for developers to make sure they’re always working. One tool that we quite like is called Paper Machines (http://papermachines.org), and it holds a lot of promise since it allows the user to data mine collections of materials kept within the Zotero reference manager. It was built by Jo Guldi and Chris Johnson-Roberson. 27 It is a plugin that you install within the Zotero; once enabled, a series of contextual commands, including topic modeling, become available to you by right-clicking on a collection within your Zotero library. At the current time of writing it can be quite fiddly to use, and much will depend on how your system is configured. However, it’s only in version 0.4, meaning, it’s really not much more than a prototype. It will be something to keep an eye on. If you visit our draft website at http://www.themacroscope.org/?page_id=60 you can see the kinds of things which Paper Machines aspires to do.

267 Leave a comment on paragraph 267 0 We mention this here to highlight the speed with which the digital landscape of tools can change. When we initially wrote about Paper Machines, we were able to topic model and visualize John Adams diaries, scraping the page itself using Zotero. When we revisited that workflow a few months later, given changes that we had made to our own machines (updating software, moving folders around and so on, and changes to the underlying html of the John Adams Diaries website), it –our workflow- no longer worked! Working with digital tools can sometimes make it necessary to not update your software! Rather, keep in mind which tools work with what versions of other supporting pieces. Archive a copy of the tool that works for you and your set-up, and keep notes under what conditions the software works. Open source software can be ‘forked’ (copied) on github with a single click. This is a habit we should all get into (not only does it give us the archived software, but it also provides a kind of citation network demonstrating the impact of that software over time). In digital work, careful documentation of what works (and under what conditions) can be crucial to ensuring the reproducibility of your research. Historians are not accustomed to thinking about reproducibility, but it will become an issue.

268 Leave a comment on paragraph 268 0 A worked out example: Reading the lives of 8000 Canadians

269 Leave a comment on paragraph 269 0 As you read this section, please feel free to consult the online visualization of the results of this analysis, at ‘8000 Canadians’ (http://themacroscope.org/interactive/dcbnet/) and ‘Topics connecting the lives of 8000 Canadians’ (http://themacroscope.org/interactive/dcbtopicnet/) How appropriate is a networked visualization? Does it help us understand, or does it obsfucate, the patterns?

270 Leave a comment on paragraph 270 0 In this section, we show a use-case of topic modeling for exploring the historiography of the Dictionary of Canadian Biography (DCB). P.B. Waite expressed it best in 1995, when confronted by the task of obtaining an overview of the DCB:

271 Leave a comment on paragraph 271 0 Reading thirteen volumes from cover to cover is not lightly to be essayed. This author tried it for eight volumes; then, in the middle of volume IX came volume XIII in the mail. The sheer size of volume XIII broke the resolution that had stood staunch until then. It occurred to this author, as it might well have done before, that there must be other ways of enjoying and assessing the quality and range of the Dictionary of Canadian Biography than the straight route he had chosen.28

272 Leave a comment on paragraph 272 0 There is a way! We can enjoy and assess the DCB through topic modeling. The DCB contains scholarly biographical essays in both French and English on over 8,000 Canadians (and sometimes, non-Canadians whose exploits are of significant interest to Canada, including the vikings Eric the Red and his son, Leif Ericsson). Its foundation was in the bequest of James Nicholson who in 1952 left his estate to the University of Toronto for the purpose of creating ‘a biographical reference work for Canada of truly national importance’. Today, it is hosted and curated by the University of Toronto and the Université Laval, with the support of the Government of Canada.

273 Leave a comment on paragraph 273 0 Every biographical essay is available online.29 This makes it an excellent candidate as an example dataset for demonstrating a big history approach. What patterns will we see in this corpus of historical writing? The essays have been written over a period of 50 years, and so they span a number of different fashions in historiography, and are the work of hundreds of scholars. If there are global patterns in Canadian History writ large, then an examination of the DCB might discover it. As a series of biographical sketches, the nature of what we might extract is naturally constrained by the dictates of the form. For instance, we note later in this section the relative paucity of individuals from First Nations, but as our colleague Michel Hogue points out, this paucity could be explained not only in terms of the ways outsiders have regarded First Nations’ histories, or the nationalism implicit in a project of this nature, but also the difficulty in crafting a biography of peoples with an oral tradition.30

274 Leave a comment on paragraph 274 0 There are a number of ways one could extract the information from the DCB.31 In the case of the Dictionary of Canadian Biography, we did the following:

275 Leave a comment on paragraph 275 0 1. Used wget to download every biographical essay. The Programming Historian 2 has additional information on how to use wget in this manner.32

276 Leave a comment on paragraph 276 0 2. Stripped out the html. One could use a script in R such as Ben Marwick’s html2dtm.r (a copy is lodged at https://gist.github.com/shawngraham/6319543) or run a python script such as that suggested in The Programming Historian. It is also possible to use Notepad++ to strip out everything before, and everything after, the text of interest across multiple files, using search and replace and the careful use of regex patterns, although that is a better solution when you are working with only a few files.

277 Leave a comment on paragraph 277 0 The easiest route is to put all of the html files into a single folder, open the R script html2dtm.r, and run it on the entire folder. With what you’ve learned about R in the previous sections, this should not be too difficult. Make sure to set the working directory in the script (the command ‘setwd’ in the second line) to point to the exact location of your folder with the downloaded html files.

278 Leave a comment on paragraph 278 0 3. Fitted a topic model using Mimno’s wrapper for MALLET in R.

279 Leave a comment on paragraph 279 0 Topic modeling should be done multiple times, in order to find the ‘right’ number of topics. In this case, we found that around 30 topics captured the spread of topics rather well. Using the additional R script mentioned in the previous section that produced a dendrogram of related topics, we found that topics clustered together rather neatly.

280 Leave a comment on paragraph 280 0 Here is some sample output of the results of two separate runs of the topic model, looking for 30 topics. In Run 1, we set the ‘optimization interval’ at 20. This value is arbitrary, for the most part, but it serves to order the topics in terms of their overall contributions to the corpus. In Run 2, we did not set an optimization interval and thus there is no indication of rank importance.

281 Leave a comment on paragraph 281 0 Run 1 (sorted in order of strongest topics; optimization interval was set at 20)

Topic Key words
7 made time years great left long end found good make work return man set part received order began number
4 family children years father son life wife died death daughter st john married time home house brother mary sons
26 government british canadian political french canada governor support system public king power influence control canadians policy american act opposition
18 life published young work letters wrote canadian great century world man author written education people history english social appeared
2 men man people day found trial days death case black evidence night court arrested young prison back claimed police
23 council assembly court law governor office general justice appointed judge lieutenant house legal chief william government province legislative john
24 business trade company firm merchant merchants timber john partnership land built year sold mill goods large commercial lumber william
17 london british canada england smith sir north john colonial lord america american governor united britain royal office william general
15 war military british army lieutenant general officer fort command militia french regiment colonel major troops captain officers force governor
10 quebec de saint qu rivi montreal la res bec canada oct cn sept anq nov trois fran du jean
5 toronto canada canadian ontario ont hamilton ottawa william rg association women upper globe history county london president john public
6 company montreal railway business bank canada president city canadian firm john st board william director financial years james commercial
14 government election macdonald party political liberal conservative provincial politics confederation sir minister federal john liberals railway cabinet canada elected
12 de la le montreal al qu montr saint canada bec du quebec des louis joseph french jean les fran
29 de la france le saint fran jean livres louis ff pierre ois des king quebec canada governor du intendant
13 upper canada toronto john york kingston land niagara william ont district township rg county ao papers mackenzie hamilton robert
11 halifax nova scotia john brunswick saint county fredericton howe william pans acadian province history boston pictou loyalist hist mg
20 school medical college education university hospital schools medicine board women montreal teaching students society surgeon public dr institution year
1 de bishop saint priest parish quebec la marie superior church des catholic seminary montreal priests minaire diocese joseph canada
3 church methodist college missionary canada bishop presbyterian england john toronto society baptist st mission school reverend minister anglican christian
27 canada published canadian society history montreal work natural literary author scientific science london journal american john william survey year
28 expedition coast bay voyage ship ships john london england captain island north newfoundland naval sailed sea arctic sir franklin
22 de la france french iroquois father indians quebec le des english fort champlain saint governor mission country jesuit louis
0 river company fort hbc columbia vancouver victoria bay fur british trade west simpson hudson red lake york north indians
21 indian indians chief detroit river american fort johnson white lake british treaty band michilimackinac war reserve hist ed nations
16 paper bell york montreal gazette newspaper printing printer american brown hart press editor theatre jewish published german city canadian
9 john newfoundland island st charlottetown edward prince catholic nfld government irish james roman william pope house land year colony
8 ottawa building engineer work music construction architect works buildings st church surveyor plans city architecture public engineers built canal
25 winnipeg manitoba west tis canadian riel canada north government red st calgary ottawa river labour man land western saskatchewan
19 art work montreal artist painting painter wood arts paintings works artists church canadian silver exhibition gallery museum painted portraits

282 Leave a comment on paragraph 282 0 Run two (no optimization hence no rank ordering):

topics keywords
1 assembly government council governor office court house appointed law general public province colonial justice year
2 british part north position support american led important late london states interest influence large affairs
3 business company trade firm merchant year bank james stâ partnership city john estate commercial george
4 winnipeg indian manitoba river tis indians red chief west man riel police calgary canadian fort
5 newfoundland stâ johnâ john london years nfld william thomas england island bay harbour year fishery
6 quebec quã bec lower nov.â oct.â years sept.â canada juneâ son rue mayâ julyâ aprilâ
7 montreal canadian canada published london mcgill gazette john york paper literary early history newspaper œthe
8 river fort company hbc bay london john north expedition trade west years fur york indians
9 montreal res des quebec trois-riviã laâ roy franã greffe son montrã brh married pierre died
10 time made left great received men found man began end set order good wrote brought
11 canada upper john toronto william york kingston ont hamilton years london early niagara district county
12 french laâ indians fort iroquois france father english governor years year river quebec leâ indian
13 toronto canadian canada school years ontario college william year education university public city association board
14 montreal montrã quã leâ laâ des bec french canada les years franã res son histoire
15 leâ des music acadian acadians champlain gaspã franã musical les bert cartier french joseph charles
16 island british vancouver victoria charlottetown columbia prince edward john years douglas james william b.c marchâ
17 company railway canadian montreal construction bank line industry limited president mining works western companies railways
18 life work family young letters age social century author published friends world friend written form
19 church methodist england canada missionary college presbyterian society baptist mission school bishop london minister christian
20 halifax nova scotia john n.s years william county pans howe son early pictou cape d.â
21 medical women hospital medicine surgeon children health journal physician practice womenâ doctor daughter asylum mrs
22 france laâ colony livres governor french canada intendant marine minister ãžle quebec louisbourg paris franã
23 quebec priest bishopâ parish bishop canada church catholic years superior stâ seminary france diocese priests
24 government political macdonald election liberal party conservative federal minister provincial public canada politics canadian confederation
25 military war british army officer militia command fort regiment england service troops french officers royal
26 work art published society works survey scientific royal museum science natural painting arts artist ottawa
27 land settlement lands grant settlers acres alexander farm area scotland fraser lower surveyor townships township
28 american york indian hist john indians william boston detroit united soc johnson n.y black coll
29 john saint brunswick n.b fredericton years county william œthe son george family w.â d.â early
30 canada political canadian union upper reform baldwin mackenzie reformers john brown lower party united george

283 Leave a comment on paragraph 283 0 What a difference a single parameter can make! The thing is, both sets of results are ‘true’ for a given value of true. Reading the topics from top to bottom in Run 1 seems to give a sense of each biography’s overall sketch: a man or woman makes good, has a family background, becomes involved in politics or the law, and perhaps has business on the side. They might then incidentally be involved in a war somewhere. The second run seems to give a sense of global themes in the corpus as a whole.

284 Leave a comment on paragraph 284 0 One of the outputs of our topic modeling script using R as a wrapper for MALLET is a dendrogram of topics via the statistical method of k-means clustering (the dendrogram will open as a new window in your R environment or R Studio; you can also save it as an image file directly from the toolbar at the top of the screen). This technique looks at the distribution of the topics over the entire corpus and clusters topics together based on the similarity of their proportions. The resulting diagram looks like the branches of a tree; hence, a dendrogram. While dendrograms are generated at progressive levels of similarity (thus, from the bottom up), we read them from the top down. In Figure 4.18 the chronological and thematic associations apparent in the labels make the diagram sensible to be read from top left to bottom right.

285 Leave a comment on paragraph 285 0 [insert Figure 4.18 Dendrogram of topics in the DCB ]

286 Leave a comment on paragraph 286 0 Broadly, there is a topic called ‘land township lands acres river’… and then everything else. In a land like Canada, it is unsurprising that so much of every biography should be connected with the idea of settlement – especially if we can take these words as indicative of a discourse surrounding the opening up of territory for new townships, especially given the importance of river transport. Harold Innis would not be surprised.33

287 Leave a comment on paragraph 287 0 The next branch down divides neatly in two, with sub branches to the left neatly covering the oldest colonies (excluding the province of Newfoundland, which didn’t enter Confederation until 1949). Take another step down to the right, and we have topics related to the churches, then to education, then to medicine. Taking the left branch of the next step brings us to discourses surrounding the construction of railways, relationships with First Nations, and government. Alongside the government branch is a topic that tells us exactly what flavour of government, as well: notably, the Liberals (the Liberal party has governed Canada for the overwhelming majority of years since Confederation).

288 Leave a comment on paragraph 288 0 Scanning along amongst the remaining branches of the dendrogram, we spot topics that clearly separate out military history, industry, French Canada, the Hudson’s Bay Company and exploration. Finally we have a topics that betray the DCB’s genesis in a scholarly milieu in Toronto of the 1950s – ‘published canadian history author work’ which speaks to a focus on published work, perhaps.

289 Leave a comment on paragraph 289 0 This dendrogram contains within it not just the broad lines of the thematic unity of Canadian History as practised by Canadian historians, but also its chronological periodisation. This is perhaps more apparent when we represent the composition of these biographies as a kind of a network. Recall that our topic modeling script in R also created a similarity matrix for all documents. The proportions of each topic in each document were correlated so that documents with a similar overall composition would be tied (with weighting) to other most similar documents (figure 4.19) allowing us to visualize it using the Gephi package.

290 Leave a comment on paragraph 290 0 [insert Figure 4.19 Positively correlated topics in the DCB, where edge weight indicates the strength of correlation.]

291 Leave a comment on paragraph 291 0 The edges connecting the topics together represent the strength of the correlation, ie, themes that tend to appear together in the same life. The heavier the edge, the more times those topics go hand in hand. We can then ask Gephi to colour-code those nodes (topics) and edges (correlations) so that those having similar patterns of connectivity are coloured the same way (see chapter 6 for how to modify the appearance of nodes given a particular metric; here we calculated ‘modularity’ and recoloured the nodes by the module, that is, community having similar patterns of connections). ‘Railway construction line industry works’ often appears in thematic space together with ‘canadian ottawa british victoria vancouver’, which makes sense to us knowing that British Columbia’s price for entry into Confederation was the construction of the transcontinental railway, a project, with associated scandals and controversies which took up many years and governments in the late 19th century. This network diagram is really not all that different from the dendrogram visualization we examined first. What we can do, that we could not do with the dendrogram, is ask which topics tie the entire corpus together? Which topics do the heavy lifting, semantically? This is not the same thing as asking which topics are found most often. Rather, we are looking for the topic that most often is the pivot point on which an essay will hang. The metric for determining this is betweenness centrality. In the figure above, the nodes are sized according to their relative betweenness scores. The topic that tie Canadian history together, on this reading, is ‘government political party election liberal’.

292 Leave a comment on paragraph 292 0 This throws up some interesting questions. Is the Liberal Party of Canada really the glue that holds our sense of national identity together? How has the concept of ‘liberalism’ in Canadian politics evolved over time (we are reminded that one of the first names of the current Conservative party of Canada was the Liberal-Conservative Party)? Do the authors of these essays feel that they need to discuss Liberal party affiliations (for surely not every individual in the corpus was a member of the party) when they see them, out of a sense of class solidarity (the present-day Liberal Party being chiefly a party of the centre-left, a traditional home for academics)? How representative of Canadian history is this corpus of 8000 people? Did those who joined the party enjoy a higher prominence in the (however-defined) Liberal-affiliated newspapers of the time (which are now the sources used by the academics)? The preponderance of liberal mentions also gives credence to Canadian historian Ian McKay’s theory that Canada should be understood as a conscious project of liberal rule, the ‘liberal order framework.’34 Clearly, the topic model throws up more questions than answers.

293 Leave a comment on paragraph 293 0 When we look at the network visualization of individuals (as a way of seeing patterns of clusters), where ties represent similar proportions of similar topics, we see a very complicated picture indeed. Of the 8,000 individuals, some 4,361 individuals (or 55%) tie together into a coherent clump. The other 45% are either isolated or participate in smaller clumps. Let us focus on that giant component. We can search for groupings within this clump. Figure 4.16 shows the clump coloured by modules. These subgroups seem to make sense again on periodisation grounds. Francophones from the New France era all clump together not because they are from New France or knew one another or had genuine social connections, but because the biographical sketches of this period all tend to tellthe same kinds of stories. They are stories about coureurs-du-bois, of Seigneurs, of government officials. What is perhaps more interesting will be the cases where individuals are grouped together out-of-time.

294 Leave a comment on paragraph 294 0 [insert Figure 4.20 Positively correlated documents (biographical sketches of individuals in the DCB)]

295 Leave a comment on paragraph 295 0 Periodization

296 Leave a comment on paragraph 296 0 Treating the biographies of 8,000 individuals who are spread over the centuries is a very distant way of looking at the patterns. Alternatively, we could have begun by dividing out the source documents into ‘century’ bins, and topic modeling each separate group. If, however, we are happy with creating a single topic model of all 8,000, we can still examine patterns over chronology by sorting our individuals into distinct networks, by filtering the results by time.

297 Leave a comment on paragraph 297 0 Broadly considered, let us divide the 8,000 into ‘17th century and earlier’, ‘the long 18th century’, and ‘the 19th century’ (which we’ll draw to a close with World War I). We add a new column to the spreadsheet, and copy in ‘17’, ‘18’ and ‘19’ alongside the appropriate rows. Then we use Excel’s ‘filter’ function to just show us the relevant period. What does Canadian history through lives lived look like this way?

298 Leave a comment on paragraph 298 0 Using the results of Run 1 of the topic model for 30 topics (to look at a slightly different outcome, thus adding another facet to our perspective on the data), we can look at the distribution of topics for the 17th century (figure 4.21).

299 Leave a comment on paragraph 299 0 [insert Figure 4.21 Topics along the x axis, greatest proportion along the y; each dot is an individual biography. 17th century ]

300 Leave a comment on paragraph 300 0 Topics 22, 28, and 29 dominate. Topic 1 is also well represented. These topics again are:

22 de la france french iroquois father indians quebec le des english fort champlain saint governor mission country jesuit louis
28 expedition coast bay voyage ship ships john london england captain island north newfoundland naval sailed sea arctic sir franklin
29 de la france le saint fran jean livres louis ff pierre ois des king quebec canada governor du intendant
1 de bishop saint priest parish quebec la marie superior church des catholic seminary montreal priests minaire diocese joseph canada

301 Leave a comment on paragraph 301 0 …suggesting that the 17th century is the story of New France, of the Iroquois, of clergy and explorers. So far, an outline familiar from the standard Canadian high school history textbook.

302 Leave a comment on paragraph 302 0 We can represent this as a network and run a query looking for nodes (individual biographies) with high ‘betweeness centrality’ (see chapter 6 on how to examine your network for various metrics). We are not arguing that these individuals are important on historical grounds, but rather, that the themes discussed in the biographical sketch (in their various proportions) are such that they tie the clusters together in such a way to suggest we should look at them more closely. The most between individual biographies for the 17th century are

  • 303 Leave a comment on paragraph 303 0
  • LEGARDEUR DE REPENTIGNY ,PIERRE; Governor Huault de Montmagny’s lieutenant.
  • LAUSON , JEAN DE (Junior); Grand seneschal of New France
  • BOURDON , JEAN; (sometimes called M. de Saint-Jean or Sieur de Saint-François)Seigneur, engineer, surveyor, cartographer, business man, procurator-syndic of the village of Quebec, head clerk of the Communauté des Habitants, explorer, attorney-general in the Conseil Souverain
  • DUBOIS DE COCREAUMONT ET DE SAINT-MAURICE , JEAN-BAPTISTE; esquire, artillery commander and staff officer in the Carignan-Salières regiment.
  • LE VIEUX DE HAUTEVILLE , NICOLAS; lieutenant-general for civil and criminal affairs in the seneschal’s court at Quebec
  • MESSIER , MARTINE; wife of Antoine Primot; b. at Saint-Denis-le-Thiboult
  • LAUSON , JEAN DE (Senior); governor of New France.
  • HÉBERT , JOSEPH; grandson of Canada’s first settler, only son of Guillaume Hébert and Hélène Desportes
  • E ROUACHY (Eroachi , Esrouachit); known to the French as “La Ferrière,” “La Forière,” “La Fourière,” “La Foyrière”; chief of the Montagnais Indians around Tadoussac;
  • ATIRONTA (Aëoptahon), Jean-Baptiste , a captain in the Huron Indian village of Cahiagué (near Hawkestone, Ontario)

304 Leave a comment on paragraph 304 0 In these ten individuals, we have encapsulated the history of the French regime in North America – governors and seneschals, officers and aboriginal allies, and one woman. Martine Messier is primarily remembered for her courage when under attack by three Iroquois warriors (a courage retroactively imputed to her, perhaps, as her grandsons, the Le Moyne brothers, were famed adventurers).

305 Leave a comment on paragraph 305 0 The individuals whose stories (that is, the proportion of the thematic topics) tie the long 18th century together in what became Canada are

  • 306 Leave a comment on paragraph 306 0
  • GADOIS , PIERRE; Montreal Island farmer, armourer, gunsmith, witchcraft victim
  • DUNN , THOMAS; businessman, seigneur, office holder, politician, judge, and colonial administrator
  • MARTEL DE MAGOS(Magesse),JEAN; soldier, merchant, trader, seigneur, clerk in the king’s stores
  • McGILL , JAMES; merchant, office holder, politician, landowner, militia officer, and philanthropist
  • KERR , JAMES; lawyer, judge, and politician
  • GRANT , Sir WILLIAM , lawyer, militia officer, and office holder
  • TODD , ISAAC; businessman, office holder, militia officer, and landowner
  • DOBIE , RICHARD; fur trader, businessman, and militia officer
  • POWNALL , Sir GEORGE; office holder, politician, and justice of the peace
  • TASCHEREAU , THOMAS-JACQUES; agent of the treasurers-general of the Marine, councillor in the Conseil Supérieur, seigneur

307 Leave a comment on paragraph 307 0 In these lives, we see the concern with reconciling the newly acquired Francophone colonists into the developing British world system. Echoes from the earlier regime, as evidenced by Gadois and Taschereau, still reverberate. There is no real reason why one would select the ‘top ten’ versus the ‘top twenty’ or ‘top one hundred’, but it is interesting that no aboriginal appears on this list until the 230th place (of 2,954 individuals), suggesting perhaps the beginnings of the eclipse of first nations’ history in the broader story of Canada (a supposition that would require deeper analysis to support or refute). If we look at the distribution of topics over this period we see (figure 4.22):

308 Leave a comment on paragraph 308 0 [insert Figure 4.22 Topics along the x axis, greatest proportion along the y; each dot is an individual biography. 18th century]

309 Leave a comment on paragraph 309 0 The strongest topic again is 29, the topic concerned with the governance of New France. Topic 15, which deals with war between the English and French is strong (as one would expect in the 18th century), as is 21 and 22, topics concerned with the Iroquois and First Nations peoples more generally – a slightly different picture than when we considered the interplay of topics as represented as a network. Whether we plot the output of the topic model as dots on a chart, or networked thematic ties between lives lived, nuanced patterns are cast into relief.

310 Leave a comment on paragraph 310 0 As we move through these broad periods, the overall network thematic structure each time becomes more atomized, with more and more individuals whose (thematic-) lives do not tie into the larger group. In the nineteenth century modern Canada is founded. Religious callings appear in the lives of the top ten individuals’ whose stories tie the network together for the first time:

  • 311 Leave a comment on paragraph 311 0
  • GUIBORD , JOSEPH typographer, member of the Institut Canadien
  • LANGEVIN , EDMOND (baptized Edmond-Charles – Hippolyte ), priest and vicar general
  • GÉLINAS, ÉVARISTE , journalist, federal civil servant
  • BABY , LOUIS – FRANÇOIS – GEORGES , office holder, lawyer, politician, judge, and collector
  • STUART , GEORGE OKILL (O ’ Kill) , lawyer, politician, and judge
  • SIMPSON , JOHN, government official and politician
  • DAY , CHARLES DEWEY , lawyer, politician, judge, and educationalist
  • CONNOLLY ( Connelly ), MARY , named Sister Mary Clare , member of the Sisters of Charity of Halifax and teacher
  • CREEDON , MARIANNE (Mary Ann), named Mother Mary Francis (Frances), member of the Congregation of the Sisters of Mercy, mother superior, and educator
  • WALSH , WILLIAM , Roman Catholic priest, archbishop, and author

312 Leave a comment on paragraph 312 0 Indeed, the top one hundred in this network are either connected with the church (women who appear are predominantly nuns or teachers or both), the state, or the law. While to call Guibord a typographer is correct, it hides what he was typesetting. Guibord’s life encapsulated battles within Catholicism over liberal learning (the Institut contained a library whose books placed it in opposition to mainstream Catholic teachings at the time). These individuals speak to the playing out of battles within Catholocism and Protestantism, and between them, in the development of modern Canada. In the British North America Act of 1867, the spheres of religious influence are rigorously defined, down to specific rights and privileges certain English-speaking (Protestant) ridings within the new province of Quebec were to have, and in other territories under the control of the new state. These decisions continue to have ramifications to this day; we can perhaps find their origins in these particular lives. When we look at the distribution over time of topics, one topic is striking by its absence: Topic 21: indian indians chief detroit river american fort johnson white lake british treaty band michilimackinac war reserve hist ed nations. That’s not to say it’s not present in the 19th century (figure 4.23); this is after all a graph of the greatest contribution of a single topic to a biographical sketch. Thus, this topic is relegated to a second place, much like the First Nations peoples it describes.

313 Leave a comment on paragraph 313 0 [insert Figure 4.23 Topics along the x axis, greatest proportion along the y; each dot is an individual biography. 19th century]

314 Leave a comment on paragraph 314 0 In these thirty lives, we see a picture of Canadian history familiar and strange at the same time, suggesting deeper questions, further research, other avenues, to explore. We do not do this kind of data mining with the idea that we are trying to provide definitive, conclusive, justification for the historical stories we are trying to tell. Rather, we are trying to generate new insights, and new kinds of questions.

315 Leave a comment on paragraph 315 0 Conclusion

316 Leave a comment on paragraph 316 0 In this chapter, you have been introduced to a number of tools for fitting topic models to a body of text: ranging from one document, to an entire diary, to multiple documents. Different tools have contrasting algorithms that implement topic modelling in particular ways, leading to unique outputs. This is worth underlining! Indeed, this mathematical approach to human history does not produce the same specific pattern each time we run it, but rather a distribution of probabilistic outcomes. This can be difficult for us to get our heads around, but we must resist the temptation to run a topic model, and accept at face value its results as being ‘the’ answer about our corpora of materials. Turns out historians still have a job! Phew.

317 Leave a comment on paragraph 317 0 The topic model generates hypotheses, new perspectives, and new questions: not simple answers, let alone some sort of mythical “truth.” The act of visualization of the results too is as much art as it is science, introducing new layers of interpretation and engagement. Andrew Goldstone and Ted Underwood published an article called ‘The Quiet Transformations of Literary Studies: What Thirteen Thousand Scholars Could Tell Us’, a topic modelling exercise on what scholars had written about literary history.35 In a companion blog post to that piece, Goldstone details the construction and implication of a ‘topic browser’ that he built to support the conclusions of the journal article.36 Goldstone writes,

318 Leave a comment on paragraph 318 0 just as social scientists and natural scientists grit their teeth and learn to program and produce visualizations when they need to in order to support their analysis, so too must those of us in the humanities. Solutions off the shelf are not great at answering expert research questions. What should come off the shelf are components that the researcher knows how to put together.

319 Leave a comment on paragraph 319 0 In the next chapter, we suggest some more bits and pieces that will help you build another component of your macroscope for history: the ways of representing and querying the social relationships between historical actors in the past. We turn now to social network analysis.


  1. 320 Leave a comment on paragraph 320 0
  2. Bayes, Thomas; Price, Mr. (1763). “An Essay towards solving a Problem in the Doctrine of Chances.”. Philosophical Transactions of the Royal Society of London 53 (0): 370–418. doi:10.1098/rstl.1763.0053
  3. Silver, Nate. The Signal and the Noise: Why so Many Predictions Fail–but Some Don’t. New York: Penguin Press, 2012. pp243-247
  4. Ted Underwood, ‘Topic Modeling made just simple enough’ The Stone and the Shell April 7 2012 http://tedunderwood.com/2012/04/07/topic-modeling-made-just-simple-enough/
  5. Box, George E. P.; Norman R. Draper (1987). Empirical Model-Building and Response Surfaces, p. 424
  6. When we describe topic modeling here, we have in mind the most commonly used approached, ‘Latent Dirichlet Allocation’. There are many other possible algorithms and approaches, but most usages of topic modeling amongst digital humanists and historians treat LDA as synonymous with topic modeling. It is worth keeping in mind that there are other options, which might shed useful light on your problem at hand. A special issue of the Journal of Digital Humanities treats topic modeling across a variety of domains and is a useful jumping off point for a deeper exploration of the possibilities. Journalofdigitalhumanities.org. The LDA technique was not the first technique now considered topic modeling, but it is by far the most popular. The myriad variations of topic modeling have resulted in an alphabet soup of techniques and programs to implement them that might be confusing or overwhelming to the uninitiated; for the beginner it is enough to know that they exist. MALLET primarily utilizes LDA.
  7. Blei, David. ‘Topic modeling and the digital humanities’. Journal of Digital Humanities 2.1 2012. http://journalofdigitalhumanities.org/2-1/topic-modeling-and-digital-humanities-by-david-m-blei/
  8. Originally pointed out by Ben Schmidt, “When you have a MALLET, everything looks like a nail” on his wonderful blog SappingAttention.com. See http://sappingattention.blogspot.ca/2012/11/when-you-have-mallet-everything-looks.html .
  9. We have previously published an on-line tutorial to help the novice install and use the most popular of the many different topic modeling programs available, MALLET, at programminghistorian.org. This section republishes elements of that tutorial but we recommend checking the online version in case of any upgrades or version changes.
  10. http://www.scottbot.net/HIAL/?p=16713
  11. http://arxiv.org/abs/1003.6087/
  12. http://mallet.cs.umass.edu/index.php
  13. http://www.oracle.com/technetwork/java/javase/downloads/index.html
  14. One thing to be aware of is that since many of the tools we are about to discuss rely on Java, changes to the Java run-time environment and to the Java development kit (as for instance when Oracle updates Java, periodically) can break the other tools. We have tested everything and know that these tools work with Java 7. If you are finding that the tools do not run, you should check what version of Java is on your machine. In a terminal window, type ‘java –version’ at the prompt. You should then see something like “java version”1.7.0_05″ If you’re not, it could be that you need to install a different version of Java.
  15. https://code.google.com/p/topic-modeling-tool/
  16. Available from http://cran.rstudio.com/. We also recommend using RStudio (http://www.rstudio.com/) as a more user-friendly environment for keeping track of what’s happening inside R.
  17. http://cran.r-project.org/doc/contrib/Torfs+Brauer-Short-R-Intro.pdf . A very good interactive tutorial for R, currently available, is from codeschool.com at http://tryr.codeschool.com/
  18. http://fredgibbs.net/tutorials/document-similarity-with-r/
  19. https://gist.github.com/benmarwick/5403048
  20. For an interesting use-case of this package, please see Ben Marwick’s analysis of the 2013 Day of Archaeology. He published both the analysis and the scripts he has used to perform the analysis on Github itself, making it an interesting experiment in publishing data, digital methods, and discussion. https://github.com/benmarwick/dayofarchaeology
  21. See Hanna Wallach, David Mimno and Andrew McCallum “Rethinking LDA: Why Priors Matter.” In proceedings of Advances in Neural Information Processing Systems (NIPS), Vancouver, BC, Canada, 2009.
  22. Matthew Jockers, Text Analysis with R for Students of Literature Springer: 2014. p147.
  23. http://nlp.stanford.edu/software/tmt/tmt-0.4/
  24. Assuming you are using Microsoft Excel, and the first cell where you wish to put a unique ID number is cell A1: put ‘1’ in that cell. In cell A2, type =a1+1 and hit return. Then, copy that cell, select the remaining cells you wish to fill with numbers, and hit enter. Other spreadsheet programs will have similar functionality.
  25. The scripts that we used at the time of writing are at https://github.com/shawngraham/stanford-tmt-scripts .
  26. If the scripts on the Stanford site are now different that what is recounted in this passage, please use the ones housed in Graham’s github repository instead.
  27. Notepad++ is available for free at http://notepad-plus-plus.org/
  28. The code for this plugin is open source, and may be found at https://github.com/chrisjr/papermachines. The Paper Machines website is at http://papermachines.org
  29. P.B. Waite, ‘Journeys through thirteen volumes: The Dictionary of Canadian Biography’ The Canadian Historical Review 76.3 (Sept. 1995):464-481
  30. At www.biographi.ca
  31. Hogue, personal communication. It is worth pointing out that early reviewers of the DCB thought that its focus on a chronological arrangement (in this case, the year of death of an individual determining in which volume they would appear) rather than an alphabetical arrangement might reveal patterns in Canadian history that would otherwise be hidden- see for instance Conway, John. ’Review of The Dictionary of Canadian Biography. Volume 1: 1000 to 1700, The Catholic Historical Review , Vol. 55, No. 4 (Jan., 1970), pp. 645-648. p646. Of course every volume of the DCB will be a product of its time; senior figures in Canadian history have all contributed essays to the DCB and it is works such as Berger’s The Writing of Canadian History: Aspects of English-Canadian Historical Writing: 1900-1970 Oxford UP: 1976 that signpost the world(s) in which these contributors were working. Waite’s 1995 review of the 13 volumes to that point note the various lacunae – women, for one – and contexts (the 1960s put their indelible stamp on volume 10) that flavour the DCB.
  32. A plugin for the Firefox browser called ‘Outwit Hub’ can for instance be used to extract the biographical text from each webpage, saving it in a csv spreadsheet. The free version of Outwit Hub is limited to 100 rows of information. Using Outwit Hub, one can examine the html source of a page, identify which tags embrace the information one is interested in, and then direct the program to automatically page through the website, scraping the text between those tags. More information about using Outwit Hub is provided in the example concerning John Adams’ Diaries.
  33. Ian Milligan, “Automated Downloading with Wget,” Programming Historian 2, August 2012, available online, http://programminghistorian.org/lessons/automated-downloading-with-wget
  34. Harold Innis. The Fur Trade in Canada: An Introduction to Canadian Economic History. Revised edition (1956). (Toronto: University of Toronto Press, 1930).
  35. Ian McKay, “The Liberal Order Framework: A Prospectus for a Reconnaissance of Canadian History,” Canadian Historical Review 81, 3 (September 2000), 617-645.
  36. Forthcoming, New Literary History preprint available at http://www.rci.rutgers.edu/~ag978/quiet/preprint.pdf
  37. Discussion of method: http://andrewgoldstone.com/blog/2014/05/29/quiet/Browser: Goldstone, Andrew, and Ted Underwood. Quiet Transformations: A Topic Model of Literary Studies Journals. http://www.rci.rutgers.edu/~ag978/quiet. 2014.
Page 77

Source: http://www.themacroscope.org/?page_id=553