|
An experiment in writing in public, one page at a time, by S. Graham, I. Milligan, & S. Weingart

Slicing a topic model

1 Leave a comment on paragraph 1 0 The Stanford Topic Modeling Toolbox allows us to ‘slice’ a model to show its evolution over time, and its contribution to the corpus as a whole. That is, we can look at the proportion of a topic at a particular point in time. To achieve a visualization comparing two topics (say) over the entire duration of John Adam’s diary entries, one has to create a “pivot table report” (a summary of your data in aggregate categories in Microsoft Excel or in a similar spreadsheet). The exact steps will differ based on which version of the spreadsheet you have installed; more on that in a moment.

2 Leave a comment on paragraph 2 0 In the code for the slicing script  (http://nlp.stanford.edu/software/tmt/tmt-0.3/examples/example-4-lda-slice.scala), pay attention to line 16:

3 Leave a comment on paragraph 3 0 16 val modelPath = file("lda-59ea15c7-30-75faccf7");

4 Leave a comment on paragraph 4 0  

5 Leave a comment on paragraph 5 0 Remember seeing a bunch of letters and numbers that looked like that before? Make sure the file name within the quotations is to the output folder you created previously. In line 24, make sure that you have the original csv file name inserted so it reads like:

6 Leave a comment on paragraph 6 0  

7 Leave a comment on paragraph 7 0 val source = CSVFile(“pubmed-oa-subset.csv”) ~> IDColumn(1);

8 Leave a comment on paragraph 8 0  

9 Leave a comment on paragraph 9 0 In line 28, make sure that the column indicated is the column with your text in it, i.e. Column 3 (by default you’ll change it from 4 to 3 again). Line 36 is the key line for ‘slicing’ the topic model by date:

10 Leave a comment on paragraph 10 0  

11 Leave a comment on paragraph 11 0 36  val slice = source ~> Column(2);

12 Leave a comment on paragraph 12 0 37  // could be multiple columns with: source ~> Columns(2,7,8)

13 Leave a comment on paragraph 13 0  

14 Leave a comment on paragraph 14 0 Thus, in our data, column 2 contains the year-month-day date for the diary entry (whereas column 3 has the entry itself; check to make sure that your data is arranged the same way). By default, it just so happens that this script should work! One could have three separate columns for year, month, day, or indeed, whatever other slicing criterion. Once you’ve made your edits, save the script as johnadams-slicing.scala.

15 Leave a comment on paragraph 15 0 Once you load and run that script (same process as before), you will end up with a .csv file that looks something like this. The location of it will be noted in the TMT window – for us, for example, it was lda-8bbb972c-30-28de11e5/johnadamsscrape-sliced-top-terms.csv:

Topic Group ID Documents Words
Topic 00 1753-06-08

0.047680088

0.667521

Topic 00 1753-06-09

2.79E-05

2.23E-04

Topic 00 1753-06-10

0.999618435

12.99504

Topic 00 1753-06-11

1.62E-04

0.001781

Topic 00 1753-06-12

0.001597659

0.007988

26 Leave a comment on paragraph 26 0  

27 Leave a comment on paragraph 27 0 …and so on for every topic, for every document (‘Group ID’), in your corpus. The numbers under ‘documents’ and ‘words’ will be decimals, because in the current version of the topic modeling toolbox, each word in the corpus is not assigned to a single topic, but over a distribution of topics (ie ‘cow’ might be .001 of topic 4 – or 0.1%, but .23 or 23% of topic 11). Similarly, the ‘documents’ number indicates the total number of documents associated with the topic (again, as a distribution).  Creating a pivot table report will allow us to take these individual slices and aggregate them in interesting ways to see the evolution of patterns over time in our corpus.

28 Leave a comment on paragraph 28 0 To create a pivot table (see figures 4.10 – 4.15)

  1. 29 Leave a comment on paragraph 29 0
  2.  Highlight all of the data on the page, including the column header (figure 4.10).
  3.  Select ‘pivot table’ (under the ‘data’ menu option). The pivot table wizard will open. You can drag and drop the various options at the top of the box to other locations. Arrange ‘topic’ under ‘column labels’, ‘Group ID’ under row labels, and under ‘values’ select either documents or words. Under ‘values’, select the ‘i’ and make sure that the value being represented is ‘sum’ rather than ‘count’ (figure 4.11).

30 Leave a comment on paragraph 30 0 4.10-sliced-14.10

31 Leave a comment on paragraph 31 0 4.11-sliced-24.11

32 Leave a comment on paragraph 32 0

4.124.12

33 Leave a comment on paragraph 33 0 You’ve now got a table as in Figure 4.12 that sums up how much the various topics contribute to each document. Let us now visualize the trends using simple line charts.

34 Leave a comment on paragraph 34 0  

35 Leave a comment on paragraph 35 0 3. Highlight two columns: try ‘row labels’ and ‘topic 00’. Click charts, then line chart. You now have a visualization of topic 00 over time (figure 4.13).

36 Leave a comment on paragraph 36 0  

37 Leave a comment on paragraph 37 0 4.134.13

38 Leave a comment on paragraph 38 0 4. To compare various topics over time, click the drop down arrow beside column labels, and select the topics you wish to visualize. You may have to first unselect ‘select all,’ and then click a few. For example, topics 04, 07, and 11 as in figure 4.14.

39 Leave a comment on paragraph 39 0  

40 Leave a comment on paragraph 40 0 4.144.14

41 Leave a comment on paragraph 41 0 6. Your table will repopulate with just those topics displayed. Highlight row labels and the columns (don’t highlight the ‘grand totals’). Select line chart – and you can now see the evolution over fine-grained time of various topics within the documents as in figure 4.15.

42 Leave a comment on paragraph 42 0 4.154.15

43 Leave a comment on paragraph 43 0 Let’s look at our example topic model again. In our data, the word ‘congress’ appears in three different topics:

44 Leave a comment on paragraph 44 0  

45 Leave a comment on paragraph 45 0 Topic 10       498.75611573955666

46 Leave a comment on paragraph 46 0 town 16.69292048824643

47 Leave a comment on paragraph 47 0 miles 13.89718543152377

48 Leave a comment on paragraph 48 0 tavern    12.93903988493706

49 Leave a comment on paragraph 49 0 through   9.802415532979191

50 Leave a comment on paragraph 50 0 place 9.276480769212077

51 Leave a comment on paragraph 51 0 round 9.048239826246022

52 Leave a comment on paragraph 52 0 number    8.488670753159125

53 Leave a comment on paragraph 53 0 passed    7.799961749511179

54 Leave a comment on paragraph 54 0 north 7.484974235702917

55 Leave a comment on paragraph 55 0 each 6.744740259678034

56 Leave a comment on paragraph 56 0 captn 6.605002560249323

57 Leave a comment on paragraph 57 0 coll 6.504975477980229

58 Leave a comment on paragraph 58 0 back 6.347642624820879

59 Leave a comment on paragraph 59 0 common    6.272711370743526

60 Leave a comment on paragraph 60 0 congress  6.1549912410407135

61 Leave a comment on paragraph 61 0 side 6.058441654893633

62 Leave a comment on paragraph 62 0 village   5.981146620989283

63 Leave a comment on paragraph 63 0 dozen 5.963423616121272

64 Leave a comment on paragraph 64 0 park 5.898152600754463

65 Leave a comment on paragraph 65 0 salem 5.864463108247379

66 Leave a comment on paragraph 66 0  

67 Leave a comment on paragraph 67 0 Topic 15       377.279139869195

68 Leave a comment on paragraph 68 0 should    14.714242395918141

69 Leave a comment on paragraph 69 0 may  11.427645785723927

70 Leave a comment on paragraph 70 0 being 11.309756818192291

71 Leave a comment on paragraph 71 0 congress  10.652337301569547

72 Leave a comment on paragraph 72 0 children  8.983289013109097

73 Leave a comment on paragraph 73 0 son  8.449087061231712

74 Leave a comment on paragraph 74 0 well 8.09746455155195

75 Leave a comment on paragraph 75 0 first 7.432256959926409

76 Leave a comment on paragraph 76 0 good 7.309576510891309

77 Leave a comment on paragraph 77 0 america   7.213459745318859

78 Leave a comment on paragraph 78 0 shall 6.9669200007792345

79 Leave a comment on paragraph 79 0 thus 6.941222002768462

80 Leave a comment on paragraph 80 0 state 6.830011194555543

81 Leave a comment on paragraph 81 0 private   6.688248638768475

82 Leave a comment on paragraph 82 0 states    6.546277272369566

83 Leave a comment on paragraph 83 0 navy 5.9781329069165015

84 Leave a comment on paragraph 84 0 must 5.509903082873842

85 Leave a comment on paragraph 85 0 news 5.462992821996899

86 Leave a comment on paragraph 86 0 future    5.105010412312934

87 Leave a comment on paragraph 87 0 present   4.907616840233855

88 Leave a comment on paragraph 88 0  

89 Leave a comment on paragraph 89 0 Topic 18       385.6024287288036

90 Leave a comment on paragraph 90 0 french    18.243384948219443

91 Leave a comment on paragraph 91 0 written   15.919785193963612

92 Leave a comment on paragraph 92 0 minister  12.110373497509345

93 Leave a comment on paragraph 93 0 available 10.615420801791679

94 Leave a comment on paragraph 94 0 some 9.903407524395778

95 Leave a comment on paragraph 95 0 who  9.245823795980353

96 Leave a comment on paragraph 96 0 made 8.445444930945051

97 Leave a comment on paragraph 97 0 congress  8.043713670428902

98 Leave a comment on paragraph 98 0 other 7.923965049197159

99 Leave a comment on paragraph 99 0 character 7.1039611800997005

100 Leave a comment on paragraph 100 0 king 7.048852185761656

101 Leave a comment on paragraph 101 0 english   6.856574786621914

102 Leave a comment on paragraph 102 0 governor  6.762114646057875

103 Leave a comment on paragraph 103 0 full 6.520903036682074

104 Leave a comment on paragraph 104 0 heard 6.255137288426042

105 Leave a comment on paragraph 105 0 formed    5.870660807641354

106 Leave a comment on paragraph 106 0 books 5.837244336904303

107 Leave a comment on paragraph 107 0 asked 5.83306916947137

108 Leave a comment on paragraph 108 0 send 5.810249556108117

109 Leave a comment on paragraph 109 0 between   5.776470078486788

110 Leave a comment on paragraph 110 0  

111 Leave a comment on paragraph 111 0 These numbers give a sense of the general overall weight or importance of these words to the topic and the corpus as a whole. Topic 10 seems to be a topic surrounding a discourse concerning local governance, while Topic 15 seems to be about ideas of what governance, at a national scale, ought to be, and Topic 18 concerns what is actually happening, in terms of the nation’s governance. Thus, we might want to explore how these three topics play out against each other over time to get a sense of Adams’ differing scales of ‘governance’ discourses play out over time. Accordingly, we select topic 10, 15, and 18 from the drop down menu. The chart updates automatically, plotting the composition of the corpus with these three topics over time (Figure 4.16).

112 Leave a comment on paragraph 112 0 4.164.16

113 Leave a comment on paragraph 113 0  

114 Leave a comment on paragraph 114 0 The chart is a bit difficult to read, however.  We can see a spike in Adams’ ‘theoretical’ musings on governance, in 1774 followed by a rapid spike in his ‘realpolitick’ writings. It would be nice to be able to zoom in on a particular period. On the dropdown arrow under the dates column, we can select the relevant periods. We could also achieve a dynamic time visualization by copying and pasting the entire pivot table (filtered for our three topics) into a new Google spreadsheet.  At http://j.mp/ja-3-topics there is a publicly available spreadsheet that does just that (Figure 4.17 is a screen shot). We copied and pasted the filtered pivot table into a blank sheet. Then we clicked on the ‘insert chart’ button on the toolbar. Google recognized the data as having a date column, and automatically selected a scrollable/zoomable time series chart. At the bottom of that chart, we can simply drag the time slider to bracket the period that we are interested in.

115 Leave a comment on paragraph 115 0  

116 Leave a comment on paragraph 116 0 4.174.17

117 Leave a comment on paragraph 117 0 The ability to slice our topic model into time chunks is perhaps, for the historian, the greatest attraction of the Stanford tool. That it also accepts input from a csv file, which we can generate from scraping online sources, is another important feature of the tool.

118 Leave a comment on paragraph 118 0 Working with scripts can be daunting at first, but the learning curve is worth the power that they bring us! We would suggest keeping an unchanged copy of the example scripts from the STMT website in their own folder. Then, copy and paste them to a unique folder for each dataset you will work with. Edit them in Notepad++ (or similar) to point to the particular dataset you wish to work with. Keep your data and scripts together, preferably in a Dropbox folder as a backup strategy. Folding Google spreadsheets in your workflow is also a handy tool, especially if you plan on sharing or writing about your work online.

119 Leave a comment on paragraph 119 0  

Page 47

Source: http://www.themacroscope.org/?page_id=806