An experiment in writing in public, one page at a time, by S. Graham, I. Milligan, & S. Weingart

Topic Modeling with the GUI Topic Modeling Tool

1 Leave a comment on paragraph 1 0 The GUI Topic Modeling Tool (GTMT) is an excellent way to introduce topic modelling to classroom settings and other areas where technical expertise may be limited (our experience is that this is a good entryway into simple topic modelling), or if you wish to quickly explore a body of materials. Because it is a Java-based program, it also has the advantage of being natively cross-platform: it has been tested and will work on Windows, OS X, and even Linux systems.

2 Leave a comment on paragraph 2 0 Available in a Google Code repository at https://code.google.com/p/topic-modeling-tool/, the GTMT provides quick and easy topic model generation and navigation. With a working Java instance on any platform, simply download the GTMT program and double click on its icon to run it. Java will open the program, which will present you with a menu interface (figure 4.2). One then clicks on ‘select input file or dir’ to select the materials you wish to fit a topic model to, select an output location for the output, identify the number of topics you are looking for, and click ‘train topics’. The actual topic modeling is done using the topic modeling routines incorporated from the MALLET toolkit (see below), although you do not need to install MALLET separately – it comes included!

3 Leave a comment on paragraph 3 0 Advanced SettingsAdvanced Settings

4 Leave a comment on paragraph 4 0 Let’s work through an example. Imagine that we were interested in knowing how discourses around the commemoration of heritage sites played out across a city (so we want to know not just what the topics or discourses are, but also if there are any spatial or temporal associations too). The first part of such a project would be to topic model the text of the historical plaques. The reader may download the full text of 612 heritage plaques in

5 Leave a comment on paragraph 5 0 Toronto as a zip file from http://themacroscope.org/2.0/datafiles/toronto- plaques.zip to follow along. Unzip that folder, and start the GTMT.

6 Leave a comment on paragraph 6 0 Select the data to be imported by clicking the “Select Input File or Dir” button, which allows you to pick an individual file or an entire direc- tory of documents. You tell the system where you want your output to be generated (by default it will be where you have the GTMT installed, so beware if you’re running it out of your “Applications” folder as it can get a bit cluttered), note the number of topics, and then click “Learn Topics” to generate a topic model. The advanced settings are important as well, as they can let you remove stopwords, normalize text by standardizing case, and tweak your iterations, size of topic descriptors, and the threshold at which you want to cut topics off (Fig. 4.3).

7 Leave a comment on paragraph 7 0 Advanced SettingsAdvanced Settings

8 Leave a comment on paragraph 8 0 Let’s run it! When you click Learn Topics, you’ll see a stream of text in the program’s console output. Pay attention to what happens. For example, you might notice that it finishes very quickly. In that case, you may need to fiddle with the number of iterations or other parameters.

9 Leave a comment on paragraph 9 0 In the directory you selected as your output, you will now have two folders: output csv and output html. Take  a  moment  to  explore them.  In the former, you will see three files: DocsInTopics.csv, Topics Words.csv, and TopicsInDocs.csv. The first one will be a big file, which you can open in a spreadsheet program. It is arranged by topic, and then by the relative ranks of each file within each topic. For example, using our sample data, you might find:

topicId rank docId filename
1 1 5 184 Roxborough Drive-info.txt
1 2 490 St Josephs College  School-info.txt
1 3 328 Moulton College-info.txt
1 4 428 Ryerson Polytechnical Institute-info.txt

10 Leave a comment on paragraph 10 0 In the above, we see that in topic number 1, we have a declining order of relevant documents, which are probably about education:  three of the plaques are obviously educational institutions. The first (184 Rox- borough) is the former home of Nancy Ruth, a feminist activist who helped found the Canadian Women’s Legal Education Fund. By opening the Topics Words.csv file, our suspicions are confirmed: topic #1 is school, college, university, women, toronto, public, institute, opened, association, residence.

11 Leave a comment on paragraph 11 0 GTMT shines best, however, when you explore the HTML output. This allows you to navigate all of the information in an easy-to-use interface. In the output html folder, open up the file all topics.html. It should open in your default browser. The results of our model are visualized below (Fig. 4.4).

12 Leave a comment on paragraph 12 0 Topics visualized in an HTML Document.Topics visualized in an HTML Document.

13 Leave a comment on paragraph 13 0  

14 Leave a comment on paragraph 14 0 The Constituent Parts of Our Education TopicThe Constituent Parts of Our Education Topic

15 Leave a comment on paragraph 15 0 This is similar to the Topics Words.csv file, but the difference is that each topic can be explored further. If we click on the first topic, the school- related one, we see our top-ranked documents from before (Fig. 4.5).

16 Leave a comment on paragraph 16 0 We can then click on each individual document: we get a snippet of the text and also the various topics attached to each file (Fig. 4.6). These are each individually hyperlinked as well, letting you explore the various topics and documents that comprise your model. If you have  your own server space, or use a service like Dropbox, you can easily move all these files online so that others can explore your   results.

17 Leave a comment on paragraph 17 0 From beginning to end, then, we quickly go through all the stages  of a topic model and have a useful graphical user interface to deal with. While the tool can be limiting, and we prefer the versatility of the com- mand line, this is an essential and useful component of our topic modeling toolkit.

18 Leave a comment on paragraph 18 0 The MALLET GUI interface output for a single document, showing text and related topics.The MALLET GUI interface output for a single document, showing text and related topics.

19 Leave a comment on paragraph 19 0  

Page 44

Source: http://www.themacroscope.org/?page_id=796