¶ 1 Leave a comment on paragraph 1 0 The Stanford Topic Modeling Toolbox allows us to ‘slice’ a model to show its evolution over time, and its contribution to the corpus as a whole. To achieve a visualization comparing two topics (say) over the entire duration of John Adam’s diary entries, one has to create a pivot table report in Microsoft Excel (or in a similar spreadsheet). The exact steps will differ based on which version of the spreadsheet you have installed.
¶ 4 Leave a comment on paragraph 4 0 Make sure the file name within the quotations is to the output folder your created previously. In line 24, make sure that you have the original csv file name inserted. In line 28, make sure that the column indicated is the column with your text in it. Line 36 is the key line for ‘slicing’ the topic model by date:
¶ 6 Leave a comment on paragraph 6 0 Thus, in our data, column 3 contains the year-month-day date for the diary entry. One could have three separate columns for year, month, day, or indeed, whatever other slicing criterion.
¶ 19 Leave a comment on paragraph 19 0 …and so on for every topic, for every document (‘Group ID’), in your corpus. The numbers under ‘documents’ and ‘words’ will be decimals, because in the current version of the topic modeling toolbox, each word in the corpus is not assigned to a single topic, but over a distribution of topics (ie ‘cow’ might be .001 of topic 4 – or 0.1%, but .23 or 23% of topic 11). Similarly, the ‘documents’ number indicates the total number of documents associated with the topic (again, as a distribution). Creating a pivot table report will allow us to take these individual slices and aggregate them in interesting ways to see the evolution of patterns over time in our corpus.
¶ 21 Leave a comment on paragraph 21 0 Select all of your data, and click on the insert pivot table option in Excel. Then, click on ‘insert chart’ and insert a line chart. Each version of Excel has a slightly different version of the pivot tables wizard, so you will need to play with your version. What you are looking to create is a chart where ‘sum of words’ or ‘sum of documents’ is in the Values field, ‘topic’ is in the legend field, and ‘group ID’ is in the axis fields, as in figure. Then, on the spreadsheet, beside the ‘column labels’ title there is a drop down arrow. Click on this, and you are presented with a filter list. Select just the topic(s) which you wish to examine or compare. The chart will update automatically. Imagine we were interested in Adams’ views on governance. We might reasonably expect that ‘congress’ would be a useful word to zero in on. In our data, the word ‘congress’ appears in three different topics:
Leave a comment on paragraph 22 0
Topic 10 498.75611573955666
Leave a comment on paragraph 23 0
Topic 15 377.279139869195
Leave a comment on paragraph 24 0
Topic 18 385.6024287288036
¶ 25 Leave a comment on paragraph 25 0 The numbers beside the key words give a sense of the general overall weight or importance of these words to the topic and the corpus as a whole. Topic 10 seems to be a topic surrounding a discourse concerning local governance, while Topic 15 seems to be about ideas of what governance, at a national scale, ought to be, and Topic 18 concerns what is actually happening, in terms of the nation’s governance. Thus, we might want to explore how these three topics play out against each other over time to get a sense of Adams’ differing scales of ‘governance’ discourses play out over time. Accordingly, we select topic 10, 15, and 18 from the drop down menu. The chart updates automatically, plotting the composition of the corpus with these three topics over time.
¶ 26 Leave a comment on paragraph 26 1 The chart is a bit difficult to read, however. We can see a spike in Adams’ ‘theoretical’ musings on governance, in 1774 followed by a rapid spike in his ‘realpolitick’ writings. It would be nice to be able to zoom in on a particular period. We could copy and paste the dates relevant to the era we wish to explore by clicking on the down arrow in the pivot table, and ticking off the exact dates we are interested in. We could also achieve a dynamic time visualization by copying and pasting the entire pivot table (filtered for our three topics) into a new Google spreadsheet. At http://j.mp/ja-3-topics there is a publicly available spreadsheet that does just that. We copied and pasted the filtered pivot table into a blank sheet. Then we clicked on the ‘insert chart’ button on the toolbar. Google recognized the data as having a date column, and automatically selected a scrollable/zoomable time series chart. At the bottom of that chart, we can simply drag the time slider to bracket the period which we are interested in.
¶ 28 Leave a comment on paragraph 28 0 The ability to slice our topic model into time chunks is perhaps, for the historian, the greatest attraction of the Stanford tool. That it also accepts input from a csv file, which we can generate from scraping online sources, is another important feature of the tool. Working with scripts can be daunting at first. We would suggest keeping a ‘vanilla’ copy of the example scripts from the TMT website in their own folder. Then, copy and paste them to a unique folder for each dataset you will work with. Edit them in Notepad++ (or similar) to point to the particular dataset you wish to work with. Keep your data and scripts together, preferably in a Dropbox folder as a backup strategy. Folding Google spreadsheets in your workflow is also a handy tool, especially if you plan on sharing or writing about your work online.