|
An experiment in writing in public, one page at a time, by S. Graham, I. Milligan, & S. Weingart

Topic Modeling as Index Locorum: The Networked Corpus Tool

1 Leave a comment on paragraph 1 0 It may help to visualize what, precisely, a topic model produces by remembering that the word ‘topic’ derives from the Greek for ‘place’, topos. Figuratively, topos in this sense came to mean the rhetorical places in an argument where themes and ideas were drawn out. Jeffrey Binder and Collin Jennings of CUNY and NYU use this observation to connect ancient habits of rhetoric and speech-making (Cicero’s house of memory for instance associates rhetorical topoi with the places within a house, through which the orator moves while performing the speech) with the generation of indices in early print culture.1 ‘Index’ is a short form for ‘index locorum’ or ‘index locorum communium’, something that points out places or common places. Binder and Jennings argue that a topic model and the process by which indices are generated both emerge from a similar impetus to provide navigation and structure to a text.

2 Leave a comment on paragraph 2 0 The index and the topic both belonged to cultural practices of organizing and storing ideas that could be recalled by memory. The development of written and print cultures through the early modern period in Europe transformed these oral rhetorical concepts into our modern notions of indexicality and topicality that refer to any system of categorical organization.2

3 Leave a comment on paragraph 3 0 The emergence of a printed index took place at a similar moment to our own, when media of expression were shifting from one kind of ‘platform’ to another. Binder and Jenning’s insight is that this earlier shift can help us understand the signals in the noise found through such techniques as topic modeling. Rather than worry about what label to assign to a given ‘topic’, Binder and Jennings imagine that topic modeling (as with MALLET) acts first to identify passages of similar discourses. Their tool, ‘the Networked Corpus’, then generates linkages between passages of text that share topic similarities (rather than at the level of the document itself, which is what the java GUI for MALLET does). They describe their tool as ‘echoing’ the practice of creating a commonplace book (where items are collected together under thematic headings).

4 Leave a comment on paragraph 4 0 They built their tool in Python, and used it to compare the results of a topic model generated to fit Adam Smith’s ‘The Wealth of Nations’ against its index (both text and index were downloaded from the Project Gutenberg website).3 They found that many of the topics generated by MALLET, when read through the Networked Corpus tool (which ties passages together, rather than documents, remember), had many strong points where the tool and the index touched very well: “wages labour common workmen employments year employment” for instance with the index heading “Labour”. What is more interesting are those points where the two do not match – Binder and Jennings suggest that exploring these points of non-congruence will provide insights that reveal the assumptions of both the computational, and the human, models that underlie both processes.

5 Leave a comment on paragraph 5 0 This kind of computational work to explore things that do not work, and to enable the unpacking of both the human and the computational assumptions at play, is in our view a very important outcome of being able to play with ‘big data’ in this way. It defamiliarizes things which have become so familiar as to be transparent. Treating a topic model as a kind of index (and thinking through what that means) becomes a rich exercise for humanities scholars.

6 Leave a comment on paragraph 6 0 Binder and Jennings have made their code available via github – see their networkedcorpus.com page for the links and the code. Running it on your own machine is quite straightforward, though it does require having a few more packages installed. You will need to download python 2.7, ‘numpy’, and ‘scipy’. Google for these and follow the installation instructions for whatever operating system you use (Windows for instance has self-extracting installers).

7 Leave a comment on paragraph 7 0 Then download the networked corpus zip file, and unzip it somewhere on your machine.

8 Leave a comment on paragraph 8 0 In the networked-corpus folder, there is a subfolder called ‘res’ and a script called gen-networked-corpus.py. Move these two items to your MALLET folder.

9 Leave a comment on paragraph 9 0 Generate a topic model as you would normally do, from the command line (it is very important to follow the instructions at ‘Preparing the texts’ on the Networked Corpus github page, https://github.com/jeffbinder/networkedcorpus).

10 Leave a comment on paragraph 10 0 Once your topic model is generated, type at the command prompt:

11 Leave a comment on paragraph 11 0 [code]
gen-networked-corpus.py --input-dir --output-dir [/code]

12 Leave a comment on paragraph 12 0 The output folder will contain an index.html file, all of the supporting html and style sheets, and the data, to enable you to browse your output flipping between documents with topics, and exemplary passages that contain those topics.

References
  1. 13 Leave a comment on paragraph 13 0
  2. http://ach.org/2013/12/30/cultures-of-visualization-adam-smiths-index-and-topic-modeling/ []
  3. http://ach.org/2013/12/30/cultures-of-visualization-adam-smiths-index-and-topic-modeling/ []
  4. The results may be viewed at http://www.networkedcorpus.com/ []
Page 82

Source: http://www.themacroscope.org/?page_id=412