|
An experiment in writing in public, one page at a time, by S. Graham, I. Milligan, & S. Weingart

Topic Modeling with the Stanford Topic Modeling Toolbox

1 Leave a comment on paragraph 1 0 Stanford Topic Modeling Toolbox

2 Leave a comment on paragraph 2 0 General Principles

3 Leave a comment on paragraph 3 0 STMT gives us a different perspective on our topic models by allowing us to see how topics and key words distribute over time within our model. It was developed for the Stanford Natural Language Processing Group www-nlp.stanford.edu by Daniel Ramage and Evan Rosen.

4 Leave a comment on paragraph 4 0 Description

5 Leave a comment on paragraph 5 0 The STMT requires that the user write a series of commands in a single file (a “script”) to point it to the user’s data, to describe the data (‘here are dates, there is the actual text of the documents’) and otherwise how to manipulate it.

6 Leave a comment on paragraph 6 0 Critique

7 Leave a comment on paragraph 7 0 From a usability perspective, modifying scripts can be quite daunting. The tool is no longer maintained by How to Use It

8 Leave a comment on paragraph 8 0 Open a sample script from the website in a text editor, change certain lines in the script to point to your own data and to describe it correctly. Save the script. Open the STMT; increase the available memory in the memory option box. Load your script and then hit ‘run’. The results will be written in the same folder.

9 Leave a comment on paragraph 9 0 How to Analyze the Results

10 Leave a comment on paragraph 10 0 The results can be visualized by opening the csv files in MS Excel or another spreadsheet program. Using the tools of the spreadsheet, one can explore how word usage plays out across various user-defined groupings, especially over time.

11 Leave a comment on paragraph 11 0 Different tools give different results even if they are all still ‘topic modeling’ writ large. This is a useful way to further understand how these algorithms shape our research, and is a useful reminder to always be up front about the tools that you are using in your research. One tool that historians should become familiar with is the Stanford Topic Modeling Toolbox, (STMT) because its outputs make it easy to see how topics play out over time (it ‘slices’ its output by the dates associated with each document). Indeed, while it is not easy to use this tool, this particular feature makes it a useful addition to our toolbox, which is why we are going into some depth about this tool. The STMT allows us to track the evolution of topics within a corpus at a much finer level, at the level of each individual entry. STMT is written in Java, and requires the latest version of Java to be installed on your machine (available for free from https://www.java.com/en/download/). You can find it at http://nlp.stanford.edu/software/tmt/tmt-0.4/.

12 Leave a comment on paragraph 12 0  

13 Leave a comment on paragraph 13 0 A brief review: scraping data from a website for the purposes of the STMT

14 Leave a comment on paragraph 14 0 Because the STMT is most useful for us as historians when we have documents with dates, let us consider the workflow for getting a digitized diary off a website and into the STMT. You may wish to skip this section; you can obtain the scraped data csv from our website.

15 Leave a comment on paragraph 15 0 Return, for a moment, to one of the online pages showing Adams’ diary entries (for example, http://www.masshist.org/digitaladams/archive/doc?id=D0) Right-click on the page in your web browser, select ‘view source’ on the pop up menu, and examine the tags used to mark up the html. Every so often, you will see something like:

16 Leave a comment on paragraph 16 0 <a id=”D0.17530608″></a><div class=”entry”>  <div class=”head”>HARVARD <span title=”College”>COLLEDGE</span> JUNE 8TH. 1753.</div>  <div class=”dateline”><span title=”1753-06-08″>8 FRIDAY.</span></div>  <div class=”p”>At <span title=”College”>Colledge</span>.

17 Leave a comment on paragraph 17 0  

18 Leave a comment on paragraph 18 0 Using the scraper tool in Outwit Hub we can grab everything that falls within the “entry” class. We want to make sure we get the dates first. In this case, we want the computer to find “1753-06-08” and grab it in a separate column. We see that every time a date appears, it is preceded by  <div class=”dateline”><span title=” and followed by “>.

19 Leave a comment on paragraph 19 0 To do this, let’s refresh how to use Outwit Hub. Load the program, and enter the URL http://www.masshist.org/digitaladams/archive/doc?id=D0 into Outwit’s navigation bar. Hit enter and be brought to the page. Now, click ‘scrapers’ in the left-hand column, and then ‘new’ below. Our goal is to set up a scraper that can produce two columns: one with dates, and the other one with entries.

20 Leave a comment on paragraph 20 0 In the ‘scrapers’ section, we want to make this happen. We begin by double-clicking on empty fields in the table below, which allow us to enter information (figure 4.7). We type a descriptive label under ‘description’ (so we can remember what is supposed to be scraped there) – which will either be ‘date’ or ‘entry.’ We then peruse the HTML to again identify the relevant tag or ‘marker’ before the information we want, and the marker afterwards, like so. So remember above, what we had before the date we want – 1760-12-02 – and what we have afterwards. Fill out the form like so:

description marker before marker after
Date <div ><span title=” “>

21 Leave a comment on paragraph 21 0  

22 Leave a comment on paragraph 22 0 Now it is time to grab the actual text of the diary entry. It’s a bit messier. When we look at the HTML around entries, we see the following:

23 Leave a comment on paragraph 23 0 <a id=”D0.17530608″></a><div class=”entry”>  <div class=”head”>HARVARD <span title=”College”>COLLEDGE</span> JUNE 8TH. 1753.</div>  <div class=”dateline”><span title=”1753-06-08″>8 FRIDAY.</span></div>  <div class=”p”>At <span title=”College”>Colledge</span>. … </div> </div>

24 Leave a comment on paragraph 24 0  

25 Leave a comment on paragraph 25 0 (We added the elipses there!) There is a lot of text cut out of here, but in general, entries begin with the code <div class=”entry”> and they end with two </div> </div> tags (with two spaces between them). In Outwit hub, then, we put the following as the next row in our scraper:

entry class=”entry”> </div>  </div>

26 Leave a comment on paragraph 26 0  

27 Leave a comment on paragraph 27 0 This is telling us to get everything that starts with that paragraph marker and ends with the beginning of the link. Don’t worry, you’ll see what we mean below.

28 Leave a comment on paragraph 28 0  

29 Leave a comment on paragraph 29 0 Screenshot of the scraper screen in Outwit HubScreenshot of the scraper screen in Outwit Hub

 

31 Leave a comment on paragraph 31 0 The resulting output will look like this, opening in a new panel at the bottom of the screen when we hit the ‘execute’ button and then the ‘catch’ button (important: make sure to ‘catch’ the data or you won’t be saving it):

1753-06-08 At Colledge. A Clowdy Dull morning and so continued till about 5 a Clock when it began to rain moderately But continued not long But remained Clowdy all night in which night I watched with Powers.
1753-06-09  At Colledge the weather still remaining Clowdy all Day till 6 o’Clock when the Clowds were Dissipated and the sun brake forth in all his glory.
1753-06-10  At Colledge a clear morning. Heard Mr. Appleton expound those words in I.Cor.12 Chapt. 7 first verses and in the afternoon heard him preach from those words in 26 of Mathew 41 verse watch and pray that ye enter not into temptation

32 Leave a comment on paragraph 32 0  

33 Leave a comment on paragraph 33 0 Perfect! We have a spreadsheet of that first online page of John Adam’s diary thanks to the magic of Outwit Hub! The paid version of Outwit can crawl through the entire website automatically, exploring all subpages, matching your scraper against each page it finds, saving the data to a single spreadsheet file.

34 Leave a comment on paragraph 34 0 Using the free version, you can manually page through the site, getting Outwit to add the new data to your spreadsheet. Beside the ‘catch’ button there is a drop-down arrow. Set this to ‘auto-catch’. Then, in the URL at the top of the screen, which ends with /doc?id=D0, you can page through the site by changing the 0 to 1, then to 2, then to 3… and so on. As an interesting exercise, one of these pages (the one ending id=D1) will be slightly broken and won’t work – you won’t get the date information. Click back on ‘scrapers’, look at the code, and you’ll see that for this one page alone there is a space between <div class=”dateline”> and <span title=”. While we debated omitting this point, this is what research “in the wild” presents you with. So for this page, change your scraper to have the space, execute it, and then change it back for subsequent ones.

35 Leave a comment on paragraph 35 0 Each time the subsequent pages load, Outwit will automatically apply the scraper and paste the data into your spreadsheet. You can then click on the ‘export’ button at the bottom of the screen (the drop-down arrow beside export will let you choose your desired format). The free version is limited to 100 rows of data. When you push up against that barrier, you can hit the ‘empty’ button (after exporting, of course!) and continue onwards, using a new file name.

36 Leave a comment on paragraph 36 0 After exporting, open your file in your spreadsheet program of choice (when you have more than one file, you can copy one and paste it into the other). Delete the first row (the one that reads URL, Date, and Entry). We then need to insert a column at the beginning of this data (in most spreadsheet programs, click ‘add column’ in the ‘Insert’ menu), so that each diary entry gets its own unique record number (you can do this by either manually inputting numbers, or by creating a quick formula in the first cell, and pasting it in the remaining cells)[1]:

1

1753-06-08 At Colledge. A Clowdy Dull morning and so continued till about 5 a Clock when it began to rain moderately But continued not long But remained Clowdy all night in which night I watched with Powers.

2

1753-06-09  At Colledge the weather still remaining Clowdy all Day till 6 o’Clock when the Clowds were Dissipated and the sun brake forth in all his glory.

3

1753-06-10  At Colledge a clear morning. Heard Mr. Appleton expound those words in I.Cor.12 Chapt. 7 first verses and in the afternoon heard him preach from those words in 26 of Mathew 41 verse watch and pray that ye enter not into temptation

40 Leave a comment on paragraph 40 0  

41 Leave a comment on paragraph 41 0 You might need to rearrange some of the columns, but the data is there: some wrangling will put it into the format you see. We save this as ‘johnadamsscrape.csv’.[2] The STMT always imagines the data to be structured in three (or more) columns – a record id, a date, and the text itself (of course, it could be structured differently, but for our purposes here, this suffices). Make sure to save the file in the same directory that you are about to unzip the tmt-040.0.jar file too.

42 Leave a comment on paragraph 42 0  

43 Leave a comment on paragraph 43 0 Installing the STMT and Inputting the Data

44 Leave a comment on paragraph 44 0 With our data extracted, we turn to the toolbox. Installing the STMT is a matter of downloading, unzipping, and then double-clicking the file named tmt-0.40.0.jar, which brings up this interface (figure 4.8):

45 Leave a comment on paragraph 45 0  

tmt-opening 

47 Leave a comment on paragraph 47 0 The STMT operates by running various scripts the user creates.This allows a lot of flexibility, but it also can seem very daunting for the first time user. However, it is not as scary as it might first seem. The STMT uses scripts written in the Scala language. For our purposes, we can simply modify the sample scripts provided by the STMT team.[3] Just scroll down from the download page (http://nlp.stanford.edu/software/tmt/tmt-0.4/) to see them listed below.

48 Leave a comment on paragraph 48 0 The four scripts that are provided in the documentation show how to build up a workflow of several different operations into a single script. The scripts that we are interested in are Example 2 and Example 4. Example 2 creates a topic model; Example 4 slices it so that it can be visualized in a spreadsheet by various groupings (such as chronology).

49 Leave a comment on paragraph 49 0 Download the two scripts (example-2-lda-learn.scala, and example-4-lda-slice.scala), and make sure to save them with the .scala file extension. Save them in the same folder as your data csv file. We will use these example scripts are building blocks to work with our own data!

50 Leave a comment on paragraph 50 0 Example 2 from the STMT website trains a topic model (http://nlp.stanford.edu/software/tmt/tmt-0.3/examples/example-2-lda-learn.scala). To see how it works, open this script using a text editor such as Notepad++ or TextWrangler, which will automatically provide line numbers (which are quite handy when examining code). We have used these programs earlier in the regular expression section.

51 Leave a comment on paragraph 51 0 The critical line is line #16:

52 Leave a comment on paragraph 52 0 16 val source = CSVFile(“pubmed-oa-subset.csv”) ~> IDColumn(1);

53 Leave a comment on paragraph 53 0 This line is telling STMT where to find your data, and that the first column is the unique ID number. Change the example file name to whatever you called your data file (in our case, “johnadamsscrape.csv”). It should now read:

54 Leave a comment on paragraph 54 0  

55 Leave a comment on paragraph 55 0 16 val source = CSVFile(“johnadamsscrape.csv”) ~> IDColumn(1);

56 Leave a comment on paragraph 56 0  

57 Leave a comment on paragraph 57 0 The next line to examine is line #27, in this block (in this language, comments are noted by the two forward slashes // which explain what each line is doing):
25 val text = {
26 source ~> // read from the source file
27 Column(4) ~> // select column containing text
28 TokenizeWith(tokenizer) ~> // tokenize with tokenizer above
29 TermCounter() ~> // collect counts (needed below)
30 TermMinimumDocumentCountFilter(4) ~> // filter terms in // filter out 60 most common terms
31 DocumentMinimumLengthFilter(5) // take only docs with >=5 terms

58 Leave a comment on paragraph 58 0 32 }

59 Leave a comment on paragraph 59 0 If you run the script as is, it will look for text in the fourth column – there is no data there, so you will get a java.lang.IndexOutOfBoundsException error message. You need to change line 27 so it reads like:

60 Leave a comment on paragraph 60 0  

61 Leave a comment on paragraph 61 0 27 Column(3) ~> // select column containing text

62 Leave a comment on paragraph 62 0 This tells STMT to look in the third column for the actual text we wish to topic model (if, for example, you had the text in column 2 and the dates in column 3, you would have to change some of these lines). It is also extracting and filtering common words that might create noise; whether or not 60 is an appropriate number for this corpus is something with which we should experiment.

63 Leave a comment on paragraph 63 0 Finally, you may wish to examine line 38 and line 39:

val params = LDAModelParams(numTopics = 30, dataset = dataset,
  topicSmoothing = 0.01, termSmoothing = 0.01);

64 Leave a comment on paragraph 64 0 Changing the numTopics allows you to change, well, the number of topics fitted in the topic model. Save your script (we used the name “johnadams-topicmodel.scala”). In the TMT interface, select File >> open script. Select your script. Your TMT interface will now look like figure 4.9:

65 Leave a comment on paragraph 65 0  

tmt-opening2 

67 Leave a comment on paragraph 67 0 We find it useful to increase the available memory available to the STMT to do its calculations by changing the number in the Memory box in the interface (the default is 256 mb; type in 1024 or another multiple of 256, as appropriate to your machine’s available memory). This is because sometimes, if we are putting too much text in the system, we might run out of room in the system’s memory!

68 Leave a comment on paragraph 68 0 Now click ‘run’, and you should soon have a new folder in your directory named something along the lines of “lda-afbfe5c4-30-4f47b13a”. Note that the algorithm for generating the topics is our old friend, LDA, as indicated in the folder name (the lda part of that long string).

69 Leave a comment on paragraph 69 0 Open this folder. There are many subfolders, each one corresponding to a progressive iteration of the topic model. Open the last one, labeled ’01000′. There are a variety of .txt files; you will want to examine the one called ‘summary.txt’. The information within that file is arranged by topic and descending importance of words within each topic:

70 Leave a comment on paragraph 70 0  

71 Leave a comment on paragraph 71 0 Topic 00           454.55948070466843
company           24.665848270843448
came                21.932877059649453
judge               14.170879738912884
who                 13.427171229569737
sir                    10.826043463533079

72 Leave a comment on paragraph 72 0  

73 Leave a comment on paragraph 73 0 If you return to the folder, there is also a csv file indicating the distribution of topics over each of the diary entries, which can be visualized or explored further in a variety of ways; one in particular worth looking at in more detail is STMT’s ability to ‘slice’ topics by time.



74 Leave a comment on paragraph 74 0 [1] Assuming you are using Microsoft Excel, and the first cell where you wish to put a unique ID number is cell A1: put ‘1’ in that cell. In cell A2, type =a1+1 and hit return. Then, copy that cell, select the remaining cells you wish to fill with numbers, and hit enter. Other spreadsheet programs will have similar functionality.

75 Leave a comment on paragraph 75 0 [2] Our version of this file may be found at http://themacroscope.org/2.0/datafiles/johnadamsscrape.csv

76 Leave a comment on paragraph 76 0 [3] If the scripts on the Stanford site are now different that what is recounted in this passage, please use the ones at http://themacroscope.org/2.0/code/stmt/

Page 46

Source: http://www.themacroscope.org/?page_id=803