|
An experiment in writing in public, one page at a time, by S. Graham, I. Milligan, & S. Weingart

Slicing a Topic Model

1 Leave a comment on paragraph 1 0 The Stanford Topic Modeling Toolbox allows us to ‘slice’ a model to show its evolution over time, and its contribution to the corpus as a whole. To achieve a visualization comparing two topics (say) over the entire duration of John Adam’s diary entries, one has to create a pivot table report in Microsoft Excel (or in a similar spreadsheet). The exact steps will differ based on which version of the spreadsheet you have installed.

2 Leave a comment on paragraph 2 0 In the code for the script, pay attention to line 16:

3 Leave a comment on paragraph 3 0 val modelPath = file("lda-59ea15c7-30-75faccf7");

4 Leave a comment on paragraph 4 0 Make sure the file name within the quotations is to the output folder your created previously. In line 24, make sure that you have the original csv file name inserted. In line 28, make sure that the column indicated is the column with your text in it. Line 36 is the key line for ‘slicing’ the topic model by date:

5 Leave a comment on paragraph 5 0 val slice = source ~> Column(3);
// could be multiple columns with: source ~> Columns(2,7,8)

6 Leave a comment on paragraph 6 0 Thus, in our data, column 3 contains the year-month-day date for the diary entry. One could have three separate columns for year, month, day, or indeed,  whatever other slicing criterion.

7 Leave a comment on paragraph 7 0 Once you load and run that script, you will end up with a .csv file that looks something like this:

Topic Group ID Documents Words
Topic 00 1753-06-08

0.047680088

0.667521

Topic 00 1753-06-09

2.79E-05

2.23E-04

Topic 00 1753-06-10

0.999618435

12.99504

Topic 00 1753-06-11

1.62E-04

0.001781

Topic 00 1753-06-12

0.001597659

0.007988

18 Leave a comment on paragraph 18 0  

19 Leave a comment on paragraph 19 0 …and so on for every topic, for every document (‘Group ID’), in your corpus. The numbers under ‘documents’ and ‘words’ will be decimals, because in the current version of the topic modeling toolbox, each word in the corpus is not assigned to a single topic, but over a distribution of topics (ie ‘cow’ might be .001 of topic 4 – or 0.1%, but .23 or 23% of topic 11). Similarly, the ‘documents’ number indicates the total number of documents associated with the topic (again, as a distribution).  Creating a pivot table report will allow us to take these individual slices and aggregate them in interesting ways to see the evolution of patterns over time in our corpus.

20 Leave a comment on paragraph 20 0 Making the pivot table chart outputMaking the pivot table chart output

21 Leave a comment on paragraph 21 0 Select all of your data, and click on the insert pivot table option in Excel. Then, click on ‘insert chart’ and insert a line chart. Each version of Excel has a slightly different version of the pivot tables wizard, so you will need to play with your version. What you are looking to create is a chart where ‘sum of words’ or ‘sum of documents’ is in the Values field, ‘topic’ is in the legend field, and ‘group ID’ is in the axis fields, as in figure. Then, on the spreadsheet, beside the ‘column labels’ title there is a drop down arrow. Click on this, and you are presented with a filter list. Select just the topic(s) which you wish to examine or compare. The chart will update automatically. Imagine we were interested in Adams’ views on governance. We might reasonably expect that ‘congress’ would be a useful word to zero in on. In our data, the word ‘congress’ appears in three different topics:

22 Leave a comment on paragraph 22 0 Topic 10 498.75611573955666
town 16.69292048824643
miles 13.89718543152377
tavern 12.93903988493706
through 9.802415532979191
place 9.276480769212077
round 9.048239826246022
number 8.488670753159125
passed 7.799961749511179
north 7.484974235702917
each 6.744740259678034
captn 6.605002560249323
coll 6.504975477980229
back 6.347642624820879
common 6.272711370743526
congress 6.1549912410407135
side 6.058441654893633
village 5.981146620989283
dozen 5.963423616121272
park 5.898152600754463
salem 5.864463108247379

23 Leave a comment on paragraph 23 0 Topic 15 377.279139869195
should 14.714242395918141
may 11.427645785723927
being 11.309756818192291
congress 10.652337301569547
children 8.983289013109097
son 8.449087061231712
well 8.09746455155195
first 7.432256959926409
good 7.309576510891309
america 7.213459745318859
shall 6.9669200007792345
thus 6.941222002768462
state 6.830011194555543
private 6.688248638768475
states 6.546277272369566
navy 5.9781329069165015
must 5.509903082873842
news 5.462992821996899
future 5.105010412312934
present 4.907616840233855

24 Leave a comment on paragraph 24 0 Topic 18 385.6024287288036
french 18.243384948219443
written 15.919785193963612
minister 12.110373497509345
available 10.615420801791679
some 9.903407524395778
who 9.245823795980353
made 8.445444930945051
congress 8.043713670428902
other 7.923965049197159
character 7.1039611800997005
king 7.048852185761656
english 6.856574786621914
governor 6.762114646057875
full 6.520903036682074
heard 6.255137288426042
formed 5.870660807641354
books 5.837244336904303
asked 5.83306916947137
send 5.810249556108117
between 5.776470078486788

25 Leave a comment on paragraph 25 0 The numbers beside the key words give a sense of the general overall weight or importance of these words to the topic and the corpus as a whole. Topic 10 seems to be a topic surrounding a discourse concerning local governance, while Topic 15 seems to be about ideas of what governance, at a national scale, ought to be, and Topic 18 concerns what is actually happening, in terms of the nation’s governance. Thus, we might want to explore how these three topics play out against each other over time to get a sense of Adams’ differing scales of ‘governance’ discourses play out over time. Accordingly, we select topic 10, 15, and 18 from the drop down menu. The chart updates automatically, plotting the composition of the corpus with these three topics over time.

26 Leave a comment on paragraph 26 1 The chart is a bit difficult to read, however.  We can see a spike in Adams’ ‘theoretical’ musings on governance, in 1774 followed by a rapid spike in his ‘realpolitick’ writings. It would be nice to be able to zoom in on a particular period. We could copy and paste the dates relevant to the era we wish to explore by clicking on the down arrow in the pivot table, and ticking off the exact dates we are interested in. We could also achieve a dynamic time visualization by copying and pasting the entire pivot table (filtered for our three topics) into a new Google spreadsheet.  At http://j.mp/ja-3-topics there is a publicly available spreadsheet that does just that. We copied and pasted the filtered pivot table into a blank sheet. Then we clicked on the ‘insert chart’ button on the toolbar. Google recognized the data as having a date column, and automatically selected a scrollable/zoomable time series chart. At the bottom of that chart, we can simply drag the time slider to bracket the period which we are interested in.

27 Leave a comment on paragraph 27 0  

28 Leave a comment on paragraph 28 0 The ability to slice our topic model into time chunks is perhaps, for the historian, the greatest attraction of the Stanford tool. That it also accepts input from a csv file, which we can generate from scraping online sources, is another important feature of the tool. Working with scripts can be daunting at first. We would suggest keeping a ‘vanilla’ copy of the example scripts from the TMT website in their own folder. Then, copy and paste them to a unique folder for each dataset you will work with. Edit them in Notepad++ (or similar) to point to the particular dataset you wish to work with. Keep your data and scripts together, preferably in a Dropbox folder as a backup strategy. Folding Google spreadsheets in your workflow is also a handy tool, especially if you plan on sharing or writing about your work online.

29 Leave a comment on paragraph 29 0  

Page 85

Source: http://www.themacroscope.org/?page_id=429