|
An experiment in writing in public, one page at a time, by S. Graham, I. Milligan, & S. Weingart

Basic Text Mining: Word Clouds, their Limitations, and Moving Beyond Them

1 Leave a comment on paragraph 1 0 [in our third chapter, this opens things up - we then will move into more sophisticated text mining, including a basic intro to regular expressions, NER, compression distance, geocoding, and then a sidebar into more advanced techniques as a 'further reading' idea. We know that this is pretty basic stuff, but our experience shows that starting out basic helps with grasping concepts.]

2 Leave a comment on paragraph 2 0 Having large datasets does not mean that you need to jump into programming right away to extract meaning from them: far from it. There are three approaches that, while each have their limits, can shed light on your research question very quickly and easily. In many ways, these are “gateway drugs” into the deeper reaches of data visualization. In this section then, we briefly explore word clouds (via Wordle), as well as the comprehensive data analysis suites Voyant-Tools.

3 Leave a comment on paragraph 3 0 The simplest data visualization is a word cloud. In brief, they are generated through the following process. First, a computer program takes a text and counts how frequent each word is. In many cases, it will normalize the text to some degree, or at least give the user options: if “racing” appears 80 times and “Racing” appears 5 times, you may want it to register as a total of 85 times to that term. Of course, there may be other times when you do not, for example if there was a character named “Dog” as well as generic animals referred to as “dog.” You may also want to remove stop words, which add little to the final visualization in many cases. Second, after generating a word frequency list and incorporating these modifications, the program then puts them into order and begins to print them. The word that appears the most frequently is placed as the largest. The second most frequent a bit smaller, the third most frequent a bit smaller than that, and continuing on to dozens of words.

4 Leave a comment on paragraph 4 0 Try to create one yourself using the web site Wordle.net. You simply click on ‘Create,’ paste in a bunch of text, and see the results. For example, you could take the plain text of Leo Tolstoy’s novel War and Peace (a synonym for an overly long book, with over half a million words) and paste it in: you would see major character names (Pierre, Prince, Natasha who recur throughout), locations (Moscow), themes (warfare and nations such as France and Russia), and at a glance get a sense of what this novel might be about.

5 Leave a comment on paragraph 5 0 A Wordle.Net visualization of 'War and Peace'A Wordle.Net visualization of ‘War and Peace’

6 Leave a comment on paragraph 6 0 Yet in such a visualization the main downside becomes clear: we lose context. Who are the protagonists? Who are the villains? As adjectives are separated from other concepts, we lose the ability to derive meaning. For example, a politician speaks of “taxes” frequently: but from a word cloud, it is difficult to learn whether they are positive or negative references.

7 Leave a comment on paragraph 7 0 With these shortcomings in mind, however, historians can find utility in word clouds. If we are concerned with change over time, we can trace how words evolved in historical documents. While this is fraught with issues – words change meaning over time, different terms are used to describe similar concepts, and we still face the issues outlined above – we can arguably still learn something from this.

8 Leave a comment on paragraph 8 0 Take the example of a dramatically evolving political party in Canada: the New Democratic Party (NDP), which occupies similar space on the political spectrum as Britain’s Labour Party. While we understand that most of our readers are not Canadian, this will help with the example. It had its origins in agrarian and labour movements, being formed in 1933 as the Co-Operative Commonwealth Federation (CCF). It’s defining and founding document was the ‘Regina Manifesto’ of that same year, drafted at the height of the Great Depression. Let’s visualize it as a word cloud:

9 Leave a comment on paragraph 9 0  

10 Leave a comment on paragraph 10 0 testA Word Cloud of the ‘Regina Manifesto,’ 1933.

11 Leave a comment on paragraph 11 0 Stop! What do you think that this document was about? When you have a few ideas, read on for our interpretation.

12 Leave a comment on paragraph 12 0 At a glance, we argue that you can see the main elements of the political spirit of the movement. We see the urgency of the need to profoundly change Canada’s economic system, with the boldness of “must,” an evocative call that change needed to come immediately. Other main words such as “public,” “system,” “economic,” and “power” speak to their critique of the economic system, and words such as capitalist, socialized, ownership, and worker speaking to a socialist frame of analysis. You can piece together key components of their platform, from this visualization alone.

13 Leave a comment on paragraph 13 0 For historians, though, the important element comes in change over time. Remember, we need to keep in mind that words might change. Let’s take two other major documents within this singular political tradition. In 1956, the CCF, in the context of the Cold War released its second major political declaration in Winnipeg. Again, via a Word Cloud:

14 Leave a comment on paragraph 14 0 The CCF's Winnipeg Declaration, 1956.The CCF’s Winnipeg Declaration, 1956.

15 Leave a comment on paragraph 15 0  

16 Leave a comment on paragraph 16 0 New words appear, representing a new thrust: “opportunity,” “freedom,” “international,” “democratic,” “world,” “resources,” and even “equality.” Compared to the more focused, trenchant words found in the previous declaration, we see a different direction here. There is more focus on the international, Canada has received more attention than before, and most importantly, words like socialized have disappeared. Indeed, the CCF here was beginning to change its emphasis, backing away from overt calls for socialism. But the limitations of the word cloud also rear their head: for example, the word “private.” Is private good? Bad? Is opportunity good, bad? Is freedom good, or bad? Without context, we cannot know from the image alone. But the changing words are useful.

17 Leave a comment on paragraph 17 0 For a more dramatic change, let’s compare it to modern platforms. Today’s New Democratic Party grew out of the CCF in 1961 and continues largely as an opposition party (traditionally a third party, although in 2012 was propelled to the second-party Official Opposition status). By 2012, what do party platforms speak of?

18 Leave a comment on paragraph 18 0  

19 Leave a comment on paragraph 19 0 NDP's Platform, 2011.NDP’s Platform, 2011.

20 Leave a comment on paragraph 20 1 Taxes, families, Canada (which keeps building in significance over the period), work, employment, funding, insurance, homes, and so forth. In three small images, we have seen the evolution of a political party morph from an explicitly socialist party in 1933, to a waning of that during the Cold War climate of 1956, to the mainstream political party that it is today.

21 Leave a comment on paragraph 21 1 Word clouds can tell us something. On his blog, digital historian Adam Crymble ran a quick study to see if historians would be able to reconstruct the contents of documents from these word clouds – could they look at a word cloud of a trial, for example, and correctly ascertain what the trial was about. He noted that while substantial guesswork is involved, “an expert in the source material can, with reasonable accuracy, reconstruct some of the more basic details of what’s going on.”1

22 Leave a comment on paragraph 22 2 Word clouds need to be used cautiously, but they are a useful entryway into the world of data visualization. Complementary to wider reading and other forms of inquiry, they present a quick and easy way into the world of data visualization. Once we are using these, we’re visualizing data, and it’s only a matter of how. In the pages that follow, we move from this very basic stage, to other basic techniques including AntConc and Voyant Tools, before moving into more sophisticated methods involving text patterns (or regular expressions), spatial techniques, and programs that can detect significant patterns and phrases within your corpus.

23 Leave a comment on paragraph 23 0 AntConc

24 Leave a comment on paragraph 24 0 AntConc is an invaluable way to carry out some forms of textual analysis on data sets. While it does not scale to the largest datasets terribly well, if you have somewhere in the ballpark of 500 or even 1,000 newspaper-length articles you should be able to crunch data and receive tangible results. AntConc can be downloaded online from Dr. Laurence Anthony’s personal webpage.2. Anthony, a researcher in corpus linguistics among many other varied pursuits, has created this software to carry out detailed textual analysis. Let’s take a quick tour.

25 Leave a comment on paragraph 25 0 Installation, on all three operating systems, is a snap: one downloads the executables directly for OS X or Windows, and on Linux the user needs to change the file permissions to allow it to run as an executable. Let’s explore a quick example to see what we can do with AntConc.

26 Leave a comment on paragraph 26 0 Once AntConc is running, you can import files by going to the File menu, and clicking on either Import File(s) or Import Dir… In the screenshot below, I opened up a directory containing plain text files of Toronto heritage plaques. The first visualization panel is ‘Concordance.’ I type in the search term ‘York,’ the old name of Toronto (pre-1834) and visualize the results below:

27 Leave a comment on paragraph 27 0 Screen Shot 2014-01-02 at 5.02.35 PM

28 Leave a comment on paragraph 28 0  

29 Leave a comment on paragraph 29 0  

30 Leave a comment on paragraph 30 0 Later in this book, we will explore various ways that you could do this yourself using the Programming Historian – but, for the rest of your career, quick and dirty programs like this can get you to your results very quickly! In this case, we can see various contexts in which York is being used: North York (a later municipality until 1998), ties to New York state and city, various companies, other boroughs, and so forth. A simple search for the keyword ‘York’ would reveal many plaques that might not fit our specific query.

31 Leave a comment on paragraph 31 0 The other possibilities are even more exciting. The Concordance Plot traces where various keywords appear in files, which can be useful to see the overall density of a certain term. For example, in the below visualization of newspaper articles, I was able to trace when frequent media references to ‘community’ in the old Internet website GeoCities declined:

32 Leave a comment on paragraph 32 0 Screen Shot 2014-01-02 at 3.50.54 PMIt was dense in 1998 and 1999, but declined dramatically by 2000 – and even more dramatically as that year went on. It turns out, upon some close reading, that this is borne out by the archival record: Yahoo! acquired GeoCities, and discontinued the neighbourhood and many internal community functions that had defined that archival community.

33 Leave a comment on paragraph 33 0 Collocates are an especially fruitful realm of exploration. Returning to our Toronto plaque example, if we look for the colocates of York we see several interesting results: “Fort” (referring to the military installation Fort York), “Infantry,” “Mills” (the area of York Mills), “radial” (referring to the York Radial Railway), and even slang such as “Muddy” (“Muddy York” being a Toronto nickname). With several documents, one could trace how collocates change over time: perhaps early documents refer to Fort York and subsequently we see more collocates referring to North York? Finally, AntConc also provides options for overall word and phrase frequency, as well as specific n-gram searching.

34 Leave a comment on paragraph 34 0 A free, powerful program, AntConc deserves to be the next step beyond Wordle for many undergraduates. It takes textual analysis to the next level. Finally, let’s move to the last of our three tools that we explore in this section: Voyant Tools. This set of tools takes some of the graphical sheen of Wordle and weds it to the underlying, sophisticated, textual analysis of AntConc.

35 Leave a comment on paragraph 35 0 Voyant Tools

36 Leave a comment on paragraph 36 0 With your tongue whetted, you might want to have another sophisticated way to explore large quantities of information. The suite of tools known as Voyant (previously known as Voyeur) provides this. It provides complicated output with simple input. Growing out of the Hermeneuti.ca project, Voyant is an integrated textual analysis platform. Getting started is quick. Simply navigate to http://voyant-tools.org/ and either paste a large body of text into the box, provide a website address, or click on the ‘upload’ button to put your text(s) into the system.

37 Leave a comment on paragraph 37 1  

38 Leave a comment on paragraph 38 0 The Voyant Interface, as of October 2013.The Voyant Interface, current as of October 2013.

39 Leave a comment on paragraph 39 0 As you can see in the figure above, you have an array of basic visualization and text analysis tools at your disposal. Voyant works on a single document or on a larger corpus. For the former, just upload one file or paste the text in; for the latter, upload multiple files at that initial stage. After uploading, the workbench will appear as demonstrated above. For customization or more options, remember that for each of the smaller panes you can click on the ‘gear’ icon to get advanced icons, including how you want to treat cases (do you want upper and lower case to be treated the same) and whether you want to include or exclude common stop words.

40 Leave a comment on paragraph 40 0 With a large corpus, you can do the following things:

  • 41 Leave a comment on paragraph 41 2
  • In the “Summary” box, track words that rise and fall. For example, with multiple documents uploaded in order of their year, you can see what words see significant increases over time, or significant decreases.
  • For each individual word, see how its frequency varies over the length of the corpus. Clicking on a word in the text box will generate a line chart in the upper right. You can control for case.
  • For each individual word, you can also see the “Keyword-in-Context” in the lower right hand – by default, three words to the left and the three words to the right.
  • Track the distribution of a word by clicking on it and seeing where it is located within the document alongside the left hand of the central text column.
  • Share a corpus and its visualizations through a unique URL to your uploaded corpus (via the disc icon in the top right of the interface.).

42 Leave a comment on paragraph 42 0 These are all useful ways to interpret documents, and a low barrier to entering this sort of textual analysis works. Voyant is ideal for smaller corpuses of information or classroom purposes.

43 Leave a comment on paragraph 43 0 Voyant, however, is best understood – like Wordle – as a “gateway drug” when it comes to textual analysis.3 It is hosted on the McGill University servers and cannot be hosted on your home computer, which limits the ability to process very large datasets. They do offer an API to extract more in-depth information from the database, but in many of these cases the Programming Historian lessons can achieve your outcome more efficiently.

44 Leave a comment on paragraph 44 0 None of this however is to minimize the importance and utility of Voyant Tools, arguably for historians the best text analysis portal in existence. Even the most seasoned Big Data humanist can turn to Voyant for quick checks, or when they are dealing with smaller (yet still large) repositories. A few megabytes of textual data is no issue for Voyant, and the lack of programming expertise required is a good thing: even for old hands. We have several years of programming experience amongst us, and often use Voyant for both specialized and generalized inquiries: if a corpus is small enough, Voyant is the right tool to use.

References
  1. 45 Leave a comment on paragraph 45 0
  2. Adam Crymble, “Can We Reconstruct a Text from a Wordcloud?” 5 August 2013, Thoughts on Digital and Public Historyhttp://adamcrymble.blogspot.ca/2013/08/can-we-reconstruct-text-from-wordcloud.html. []
  3. http://www.antlab.sci.waseda.ac.jp/software.html []
  4. As, in some other respects, is the suite of textual analysis software Many Eyes created by IBM Research. Many Eyes’ main shortcoming is that unless you have a paid account, all of your data falls into the public domain: copyright concerns can make this problematic for many of our sources, unfortunately. However, for basic text analysis, Many Eyes is worth considering for both research and classroom purposes []
Page 64

Source: http://www.themacroscope.org/?page_id=362