¶ 1 Leave a comment on paragraph 1 0 JSTOR, the database of academic papers, allows a certain amount of data mining of their collections through their ‘Data for Research’ gateway (http://dfr.jstor.org/). Using their ‘submit requests’ function, one can frame a search and have JSTOR pull together a dataset for your use. The data returned do not include full texts of the articles found via your search, but rather various kinds of metadata, bi-grams, tri-grams, quadgrams, and word counts. There is the option of downloading the data as xml or as csv. Each data download contains a citations.csv file that contains the original citation and the digital object identifier for the article. The digital object identifier is used as the individual csv file name in the bigram (or trigram, or quadgram, or wordcounts) subfolder in the downloaded data.
¶ 2 Leave a comment on paragraph 2 0 We would recommend downloading as csv, since the various scripts and tools that we have been describing are predicated on reading the csv format. Ben Marwick has provided a package for R that lets you do various kinds of data mining on Data for Research datasets, including word frequencies and topic modeling.1 His instructions are very complete; the key thing in using this package is to remember to download as csv and to request word counts and bigrams.
¶ 3 Leave a comment on paragraph 3 0 If you are not interested in the full suite of functions that Marwick provides in his package, he has also crafted a script that will take the downloaded csv files and turn them into individual texts files containing the ‘bag of words’ from each article.2 The resulting text files can then be used as one would normally do with MALLET. To use the Stanford Topic Modeling Tool, one would run the script that takes those individual text files and appends them into a single csv table.
¶ 4 Leave a comment on paragraph 4 0 One thing to be aware of is that you might want to create a custom stop-word list, if for instance you’re grabbing data from multiple articles in a single journal. If one grabbed every article over several years from the Journal of Archaeological Method and Theory, words like ‘archaeology’, ‘archaeological’, ‘method’, ‘theory’ are probably not going to be useful. To customize your stopword list, open the en.txt file in mallet-2.0.7\stoplists in Notepad++. Add your new words to the top of the list, one word per line. Save the file with a new name. Then, when you import the directory of your files in MALLET,
--input pathway\to\the\directory\with\the\files --output tutorial.mallet --keep-sequence --remove-stopwords