¶ 1 Leave a comment on paragraph 1 0 The Stanford Topic Modeling Toolbox allows us to ‘slice’ a model to show its evolution over time, and its contribution to the corpus as a whole. That is, we can look at the proportion of a topic at a particular point in time. To achieve a visualization comparing two topics (say) over the entire duration of John Adam’s diary entries, one has to create a “pivot table report” (a summary of your data in aggregate categories in Microsoft Excel or in a similar spreadsheet). The exact steps will differ based on which version of the spreadsheet you have installed; more on that in a moment.
¶ 5 Leave a comment on paragraph 5 0 Remember seeing a bunch of letters and numbers that looked like that before? Make sure the file name within the quotations is to the output folder you created previously. In line 24, make sure that you have the original csv file name inserted so it reads like:
¶ 9 Leave a comment on paragraph 9 0 In line 28, make sure that the column indicated is the column with your text in it, i.e. Column 3 (by default you’ll change it from 4 to 3 again). Line 36 is the key line for ‘slicing’ the topic model by date:
¶ 14 Leave a comment on paragraph 14 0 Thus, in our data, column 2 contains the year-month-day date for the diary entry (whereas column 3 has the entry itself; check to make sure that your data is arranged the same way). By default, it just so happens that this script should work! One could have three separate columns for year, month, day, or indeed, whatever other slicing criterion. Once you’ve made your edits, save the script as johnadams-slicing.scala.
¶ 15 Leave a comment on paragraph 15 0 Once you load and run that script (same process as before), you will end up with a .csv file that looks something like this. The location of it will be noted in the TMT window – for us, for example, it was lda-8bbb972c-30-28de11e5/johnadamsscrape-sliced-top-terms.csv:
¶ 27 Leave a comment on paragraph 27 0 …and so on for every topic, for every document (‘Group ID’), in your corpus. The numbers under ‘documents’ and ‘words’ will be decimals, because in the current version of the topic modeling toolbox, each word in the corpus is not assigned to a single topic, but over a distribution of topics (ie ‘cow’ might be .001 of topic 4 – or 0.1%, but .23 or 23% of topic 11). Similarly, the ‘documents’ number indicates the total number of documents associated with the topic (again, as a distribution). Creating a pivot table report will allow us to take these individual slices and aggregate them in interesting ways to see the evolution of patterns over time in our corpus.
- ¶ 29 Leave a comment on paragraph 29 0
- Highlight all of the data on the page, including the column header (figure 4.10).
- Select ‘pivot table’ (under the ‘data’ menu option). The pivot table wizard will open. You can drag and drop the various options at the top of the box to other locations. Arrange ‘topic’ under ‘column labels’, ‘Group ID’ under row labels, and under ‘values’ select either documents or words. Under ‘values’, select the ‘i’ and make sure that the value being represented is ‘sum’ rather than ‘count’ (figure 4.11).
¶ 33 Leave a comment on paragraph 33 0 You’ve now got a table as in Figure 4.12 that sums up how much the various topics contribute to each document. Let us now visualize the trends using simple line charts.
¶ 38 Leave a comment on paragraph 38 0 4. To compare various topics over time, click the drop down arrow beside column labels, and select the topics you wish to visualize. You may have to first unselect ‘select all,’ and then click a few. For example, topics 04, 07, and 11 as in figure 4.14.
¶ 41 Leave a comment on paragraph 41 0 6. Your table will repopulate with just those topics displayed. Highlight row labels and the columns (don’t highlight the ‘grand totals’). Select line chart – and you can now see the evolution over fine-grained time of various topics within the documents as in figure 4.15.
¶ 111 Leave a comment on paragraph 111 0 These numbers give a sense of the general overall weight or importance of these words to the topic and the corpus as a whole. Topic 10 seems to be a topic surrounding a discourse concerning local governance, while Topic 15 seems to be about ideas of what governance, at a national scale, ought to be, and Topic 18 concerns what is actually happening, in terms of the nation’s governance. Thus, we might want to explore how these three topics play out against each other over time to get a sense of Adams’ differing scales of ‘governance’ discourses play out over time. Accordingly, we select topic 10, 15, and 18 from the drop down menu. The chart updates automatically, plotting the composition of the corpus with these three topics over time (Figure 4.16).
¶ 114 Leave a comment on paragraph 114 0 The chart is a bit difficult to read, however. We can see a spike in Adams’ ‘theoretical’ musings on governance, in 1774 followed by a rapid spike in his ‘realpolitick’ writings. It would be nice to be able to zoom in on a particular period. On the dropdown arrow under the dates column, we can select the relevant periods. We could also achieve a dynamic time visualization by copying and pasting the entire pivot table (filtered for our three topics) into a new Google spreadsheet. At http://j.mp/ja-3-topics there is a publicly available spreadsheet that does just that (Figure 4.17 is a screen shot). We copied and pasted the filtered pivot table into a blank sheet. Then we clicked on the ‘insert chart’ button on the toolbar. Google recognized the data as having a date column, and automatically selected a scrollable/zoomable time series chart. At the bottom of that chart, we can simply drag the time slider to bracket the period that we are interested in.
¶ 117 Leave a comment on paragraph 117 0 The ability to slice our topic model into time chunks is perhaps, for the historian, the greatest attraction of the Stanford tool. That it also accepts input from a csv file, which we can generate from scraping online sources, is another important feature of the tool.
¶ 118 Leave a comment on paragraph 118 0 Working with scripts can be daunting at first, but the learning curve is worth the power that they bring us! We would suggest keeping an unchanged copy of the example scripts from the STMT website in their own folder. Then, copy and paste them to a unique folder for each dataset you will work with. Edit them in Notepad++ (or similar) to point to the particular dataset you wish to work with. Keep your data and scripts together, preferably in a Dropbox folder as a backup strategy. Folding Google spreadsheets in your workflow is also a handy tool, especially if you plan on sharing or writing about your work online.