An experiment in writing in public, one page at a time, by S. Graham, I. Milligan, & S. Weingart

Building a topic model with MALLET

1 Leave a comment on paragraph 1 0 While the GTMT allows us to build a topic model quite quickly, there is very little tweaking or fine-tuning that can be done. For more in-depth analysis and modeling, the current standard solution to use is to employ directly the topic modeling routines of the MALLET natural-language processing tool kit. There is an irony of course in naming the major topic modeling toolkit after a hammer, with all the caveats about the entire world looking like a nail once you have it installed.[1] Nevertheless, here we describe the most basic usage and how to get started with this tool.[2]

2 Leave a comment on paragraph 2 0 To install MALLET on any platform, please follow these steps:

  1. 3 Leave a comment on paragraph 3 0
  2. Go to the MALLET project page at http://mallet.cs.umass.edu/index.php, and download MALLET. (As of this writing, we are working with version 2.0.7.) Unzip it in your home directory – that is in C:/ on Windows, or the directory with your username in OS X (it should have a picture of a house).
  3. You will also need the Java Development Kit, or JDK – that is, not the regular Java that one will find on every computer, but the one that lets you program things. You can find it at http://www.oracle.com/technetwork/java/javase/downloads/index.html. Install this on your computer.

4 Leave a comment on paragraph 4 0 For computers running OS X or Linux, you’re ready to go! For Windows systems, you have a few more steps:

  1. 5 Leave a comment on paragraph 5 0
  2. Unzip MALLET into your C: directory. This is important: it cannot be anywhere else. You will then have a directory called C:\mallet-2.0.7 or similar. For simplicity’s sake, rename this directory to just mallet.
  3. MALLET uses an environment variable to tell the computer where to find all the various components of its processes when it is running. It’s rather like a shortcut for the program. A programmer cannot know exactly where every user will install a program, so the programmer creates a variable in the code that will always stand in for that location. We tell the computer, once, where that location is by setting the environment variable. If you moved the program to a new location, you’d have to change the variable.

6 Leave a comment on paragraph 6 0 To create an environment variable in Windows 7, click on your Start Menu -> Control Panel -> System -> Advanced System Settings. Click new and type MALLET_HOME in the variable name box. It must be like this – all caps, with an underscore – since that is the shortcut that the programmer built into the program and all of its subroutines. Then type the exact path (location) of where you unzipped MALLET in the variable value, e.g., c:\mallet

7 Leave a comment on paragraph 7 0 MALLET is run from the command line, also known as a Command Prompt. If you remember MS-DOS, or have ever played with a Unix computer Terminal (or have seen ‘hackers’ represented in movies or on television shows), this will be familiar. The command line is where you can type commands directly, rather than clicking on icons and menus.

8 Leave a comment on paragraph 8 0 On Windows, click on your Start Menu -> All Programs -> Accessories -> Command Prompt. On a Mac, open up your Applications -> Utilities -> Terminal.

9 Leave a comment on paragraph 9 0 You’ll get the command prompt window, which will have a cursor at c:\user\user> for Windows or ~ username$ on Windows.

10 Leave a comment on paragraph 10 0 On Windows, type cd .. (That is: cd-space-period-period) to change directory. Keep doing this until you’re at the C:\. For OS X, type cd — and you’ll be brought to your home directory.

11 Leave a comment on paragraph 11 0 Then type :

12 Leave a comment on paragraph 12 0 cd mallet

13 Leave a comment on paragraph 13 0 and you will be in the MALLET directory. Anything you type in the command prompt window is a command. There are commands like cd (change directory) and to see all the fiels in a directory you could type dir (Windows) or ls (OS X). You have to tell the computer explicitly that ‘this is a MALLET command’ when you want to use MALLET. You do this by telling the computer to grab its instructions from the MALLET bin, a subfolder in MALLET that contains the core operating routines. Type:

14 Leave a comment on paragraph 14 0  

15 Leave a comment on paragraph 15 0 bin\mallet on Windows, or

16 Leave a comment on paragraph 16 0 ./bin/mallet   on OS X

17 Leave a comment on paragraph 17 0  

18 Leave a comment on paragraph 18 0 at the prompt. If all has gone well, you should be presented with a list of MALLET commands – congratulations! If you get an error message, check your typing. Did you use the wrong slash? Did you set up the environment variable correctly? Is MALLET located at C:\mallet ?

19 Leave a comment on paragraph 19 0 For more instructions on using MALLET from the command line and on Mac OS X, please see our online tutorial at The Programming Historian (http://programminghistorian.org/lessons/topic-modeling-and-mallet). There are many options for fine tuning your results, and one can build chains of commands as well (such as removing stopwords, or filtering out numbers or leaving them in).[3]

20 Leave a comment on paragraph 20 0 [1] Originally pointed out by Ben Schmidt, “When you have a MALLET, everything looks like a nail” on his wonderful blog SappingAttention.com. See http://sappingattention.blogspot.ca/2012/11/when-you-have-mallet-everything-looks.html .

21 Leave a comment on paragraph 21 0 [2] We have previously published an on-line tutorial to help the novice install and use the most popular of the many different topic modeling programs available, MALLET, at programminghistorian.org. This section republishes elements of that tutorial but we recommend checking the online version in case of any upgrades or version changes.

22 Leave a comment on paragraph 22 0 [3] One thing to be aware of is that since many of the tools we are about to discuss rely on Java, changes to the Java run-time environment and to the Java development kit (as for instance when Oracle updates Java, periodically) can break the other tools. We have tested everything and know that these tools work with Java 7. If you are finding that the tools do not run, you should check what version of Java is on your machine. In a terminal window, type ‘java –version’ at the prompt. You should then see something like “java version “1.7.0_05″ If you’re not, it could be that you need to install a different version of Java.

Page 45

Source: http://www.themacroscope.org/?page_id=799