¶ 1 Leave a comment on paragraph 1 0 The GUI Topic Modeling Tool (GTMT) is an excellent way to introduce topic modelling to classroom settings and other areas where technical expertise may be limited (our experience is that this is a good entryway into simple topic modelling), or if you wish to quickly explore a body of materials. Because it is a Java-based program, it also has the advantage of being natively cross-platform: it has been tested and will work on Windows, OS X, and even Linux systems.
¶ 2 Leave a comment on paragraph 2 0 Available in a Google Code repository at https://code.google.com/p/topic-modeling-tool/, the GTMT provides quick and easy topic model generation and navigation. With a working Java instance on any platform, simply download the GTMT program and double click on its icon to run it. Java will open the program, which will present you with a menu interface (figure 4.2). One then clicks on ‘select input file or dir’ to select the materials you wish to fit a topic model to, select an output location for the output, identify the number of topics you are looking for, and click ‘train topics’. The actual topic modeling is done using the topic modeling routines incorporated from the MALLET toolkit (see below), although you do not need to install MALLET separately – it comes included!
¶ 4 Leave a comment on paragraph 4 0 Let’s work through an example. Imagine that we were interested in knowing how discourses around the commemoration of heritage sites played out across a city (so we want to know not just what the topics or discourses are, but also if there are any spatial or temporal associations too). The first part of such a project would be to topic model the text of the historical plaques. The reader may download the full text of 612 heritage plaques in
¶ 6 Leave a comment on paragraph 6 0 Select the data to be imported by clicking the “Select Input File or Dir” button, which allows you to pick an individual file or an entire direc- tory of documents. You tell the system where you want your output to be generated (by default it will be where you have the GTMT installed, so beware if you’re running it out of your “Applications” folder as it can get a bit cluttered), note the number of topics, and then click “Learn Topics” to generate a topic model. The advanced settings are important as well, as they can let you remove stopwords, normalize text by standardizing case, and tweak your iterations, size of topic descriptors, and the threshold at which you want to cut topics oﬀ (Fig. 4.3).
¶ 8 Leave a comment on paragraph 8 0 Let’s run it! When you click Learn Topics, you’ll see a stream of text in the program’s console output. Pay attention to what happens. For example, you might notice that it finishes very quickly. In that case, you may need to fiddle with the number of iterations or other parameters.
¶ 9 Leave a comment on paragraph 9 0 In the directory you selected as your output, you will now have two folders: output csv and output html. Take a moment to explore them. In the former, you will see three files: DocsInTopics.csv, Topics Words.csv, and TopicsInDocs.csv. The first one will be a big file, which you can open in a spreadsheet program. It is arranged by topic, and then by the relative ranks of each file within each topic. For example, using our sample data, you might find:
|1||1||5||184 Roxborough Drive-info.txt|
|1||2||490||St Josephs College School-info.txt|
|1||4||428||Ryerson Polytechnical Institute-info.txt|
¶ 10 Leave a comment on paragraph 10 0 In the above, we see that in topic number 1, we have a declining order of relevant documents, which are probably about education: three of the plaques are obviously educational institutions. The first (184 Rox- borough) is the former home of Nancy Ruth, a feminist activist who helped found the Canadian Women’s Legal Education Fund. By opening the Topics Words.csv file, our suspicions are confirmed: topic #1 is school, college, university, women, toronto, public, institute, opened, association, residence.
¶ 11 Leave a comment on paragraph 11 0 GTMT shines best, however, when you explore the HTML output. This allows you to navigate all of the information in an easy-to-use interface. In the output html folder, open up the file all topics.html. It should open in your default browser. The results of our model are visualized below (Fig. 4.4).
¶ 15 Leave a comment on paragraph 15 0 This is similar to the Topics Words.csv file, but the diﬀerence is that each topic can be explored further. If we click on the first topic, the school- related one, we see our top-ranked documents from before (Fig. 4.5).
¶ 16 Leave a comment on paragraph 16 0 We can then click on each individual document: we get a snippet of the text and also the various topics attached to each file (Fig. 4.6). These are each individually hyperlinked as well, letting you explore the various topics and documents that comprise your model. If you have your own server space, or use a service like Dropbox, you can easily move all these files online so that others can explore your results.
¶ 17 Leave a comment on paragraph 17 0 From beginning to end, then, we quickly go through all the stages of a topic model and have a useful graphical user interface to deal with. While the tool can be limiting, and we prefer the versatility of the com- mand line, this is an essential and useful component of our topic modeling toolkit.