An experiment in writing in public, one page at a time, by S. Graham, I. Milligan, & S. Weingart

Topic Modelling with the GUI Topic Modelling Tool

1 Leave a comment on paragraph 1 0 Once you have an understanding of how to run MALLET from the command line, we want to introduce you to a graphical user interface-based version that can facilitate basic computations. This is also an excellent way to introduce topic modelling to classroom settings and other areas where technical expertise may be limited (our experience is that this is a good entryway into simple topic modelling). However, we recommend that you begin on the command line as that lets you have the most control over your topic model.

2 Leave a comment on paragraph 2 0 Available in a Google Code repository, the “Topic Modelling Tool” (TMT) provides quick and easy topic model generation and navigation. It can be found at https://code.google.com/p/topic-modeling-tool/. Compared to the previous section, the TMT is easy to install. With a working Java instance on any platform, simply download the TMT program and click on it to run it. Java will open the program.

3 Leave a comment on paragraph 3 0 The Topic Modelling Tool on OS XThe Topic Modelling Tool on OS X

4 Leave a comment on paragraph 4 0 While there is not much documentation built into the TMT, since you have a firm understanding from the previous section of what the various categories do this is not an issue. To select the data to be imported, the “Select Input File or Dir” button allows you to pick an individual file or an entire directory of documents. You tell the system where you want your output to be generated (by default it will be where you have the TMT installed, so beware if you’re running it out of your “Applications” folder as it can get a bit cluttered), note the number of topics, and then can simply click “Learn Topics” to generate a topic model. The advanced settings are important as well (see in Figure 2 below): they can let you decide whether you want to remove stopwords, normalize text by making it all one case or not, and then finesse your iterations, size of topic descriptors, and the threshold at which you want to cut topics off.

5 Leave a comment on paragraph 5 0 Advanced SettingsAdvanced Settings

6 Leave a comment on paragraph 6 0 Let’s run it! When you click Learn Topics, you’ll see Console output that’s remarkably similar to what you saw when you ran MALLET on the command line. Pay attention to what happens. For example, you might notice that it finishes quicker than before. In that case, you may need to fiddle with the number of iterations and other settings. 

7 Leave a comment on paragraph 7 0 In the below case, we have taken the full text of 612 Toronto heritage plaques and run them through the tool. In the directory you selected as your output, you will now have two folders: output_csv and output_html. Take a moment to explore them. In the former, you will see three files: DocsInTopics.csv, Topics_Words.csv, and TopicsInDocs.csv. The first one will be a big file, which you can open in a spreadsheet program, that is arranged by topic, and then the relative ranks of each file within each topic. For example:

topicId rank docId filename
1 1 5 /PATH/184_Roxborough_Drive-info.txt
1 2 490 /PATH/St_Josephs_College_School-info.txt
1 3 328 /PATH/Moulton_College-info.txt
1 4 428 /PATH/Ryerson_Polytechnical_Institute-info.txt

8 Leave a comment on paragraph 8 0 In the above, we see that in topic number 1, we have a declining order of relevant documents. Indeed, three of the above are obviously educational institutions. The first (184 Roxborough) is the former home of Nancy Ruth, a feminist activist who helped found the Canadian Women’s Legal Education Fund. To view this information in another way, we can also explore the TopicsInDocs.csv file. At this point, however, our suspicions are that this is an education related topic. By opening the Topics_Words.csv file, our suspicions are confirmed: topic #1 is school, college, university, women, toronto, public, institute, opened, association, residence.

9 Leave a comment on paragraph 9 0 TMT shines best, however, when you explore the HTML output. This allows you to navigate all of this information in an easy-to-use manner. In the output_html folder, open up the file all_topics.html. It should open in your default browser. The results of our model are visualized below:

10 Leave a comment on paragraph 10 0 Topics visualized in an HTML Document.Topics visualized in an HTML Document.

11 Leave a comment on paragraph 11 0 This is similar to the aforementioned Topics_Words.csv file, but the difference is that each topic can be explored further. If we click on the first topic, the school-related one, we see our top-ranked documents from before.

12 Leave a comment on paragraph 12 0 The Constituent Parts of Our Education TopicThe Constituent Parts of Our Education Topic

13 Leave a comment on paragraph 13 0 We can then click on each individual document: we get a snippet of the text, and also the various topics attached to each file. Those are each individually hyper-linked as well, letting you explore the various topics and documents that comprise your model.

14 Leave a comment on paragraph 14 0 The MALLET GUI interface output for a single document, showing text and related topics.The MALLET GUI interface output for a single document, showing text and related topics.

15 Leave a comment on paragraph 15 1 From beginning to end, then, we quickly go through all the stages of a topic model and have a useful graphical user interface to deal with. While the tool can be limiting, and we prefer the versatility of the command line, this is an essential and useful component of our topic modelling toolkit. From classroom settings to complicated and sophisticated research projects, new windows into your documents can be opened.

Page 81

Source: http://www.themacroscope.org/?page_id=391