|
An experiment in writing in public, one page at a time, by S. Graham, I. Milligan, & S. Weingart

Basic Text Mining: Word Clouds, their Limitations, and Moving Beyond

1 Leave a comment on paragraph 1 0 Previous section: Data Mining Tools: Techniques, and Visualizations

2 Leave a comment on paragraph 2 0 Having large datasets does not mean that you need to jump into programming right away to extract meaning from them: far from it. There are three approaches that, while each have their limits, can shed light on your research question very quickly and easily. In many ways, these are “gateway drugs” into the deeper reaches of data visualization. In this section then, we briefly explore word clouds (via Wordle), as well as the comprehensive data analysis suite Voyant Tools. There are alternatives, of course, that can do many of the same kinds of analysis, including Many Eyes (many-eyes.com), an IBM product. While also very simple to use – one simply copies and pastes ones text or numeric data into a text box on the website, and fairly smart algorithms take care of the rest – the price for this simplicity is the transference of intellectual property rights of your data to IBM. The data also becomes available to any other user. For these reasons we will not go into any deeper discussion.

3 Leave a comment on paragraph 3 0 The simplest data visualization is a word cloud. In brief, they are generated through the following process. First, a computer program takes a text and counts how frequent each word is. In many cases, it will normalize the text to some degree, or at least give the user options: if “racing” appears 80 times and “Racing” appears 5 times, you may want it to register as a total of 85 times to that term. Of course, there may be other times when you do not, for example if there was a character named “Dog” as well as generic animals referred to as “dog.” You may also want to remove stop words, which add little to the final visualization in many cases. Second, after generating a word frequency list and incorporating these modifications, the program then puts them into order, sizing by frequency and begins to print them. The word that appears the most frequently is placed as the largest (and usually, in the centre). The second most frequent a bit smaller, the third most frequent a bit smaller than that, and continuing on to dozens of words. While, as we shall see, word clouds have strong critics, they are a very useful entryway into the world of basic text mining.

4 Leave a comment on paragraph 4 0 Try to create one yourself using the web site Wordle.net. You simply click on ‘Create,’ paste in a bunch of text, and see the results. For example, you could take the plain text of Leo Tolstoy’s novel War and Peace (a synonym for an overly long book, with over half a million words) and paste it in: you would see major character names (Pierre, Prince, Natasha who recur throughout), locations (Moscow), themes (warfare and nations such as France and Russia), and at a glance get a sense of what this novel might be about (Figure 3.1). As English speakers, we have elected to use a translated version of the text.

5 Leave a comment on paragraph 5 0 3.1-war-and-peace

[insert Figure 3.1 War and Peace as a word cloud]

7 Leave a comment on paragraph 7 0 Yet with such a visualization the main downside becomes clear: we lose context. Who are the protagonists? Who are the villains? As adjectives are separated from other concepts, we lose the ability to derive meaning. For example, a politician speaks of “taxes” frequently: but from a word cloud, it is difficult to learn whether they are positive or negative references. With these shortcomings in mind, however, historians can find utility in word clouds. If we are concerned with change over time, we can trace how words evolved in historical documents. While this is fraught with issues – words change meaning over time, different terms are used to describe similar concepts, and we still face the issues outlined above – we can arguably still learn something from this.

8 Leave a comment on paragraph 8 0 Take the example of a dramatically evolving political party in Canada: the New Democratic Party (NDP), which occupies similar space on the political spectrum as Britain’s Labour Party. While we understand that most of our readers are not Canadian, this will help with the example. It had its origins in agrarian and labour movements, being formed in 1933 as the Co-Operative Commonwealth Federation (CCF). It’s defining and founding document was the ‘Regina Manifesto’ of that same year, drafted at the height of the Great Depression. Let’s visualize it as a word cloud (figure 3.2):

9 Leave a comment on paragraph 9 0 3.2-new-regina-wordle

[insert Figure 3.2 The Regina Manifesto as word cloud]

11 Leave a comment on paragraph 11 0 Stop! What do you think that this document was about? When you have a few ideas, read on for our interpretation.

12 Leave a comment on paragraph 12 0 At a glance, we argue that you can see the main elements of the political spirit of the movement. We see the urgency of the need to profoundly change Canada’s economic system, with the boldness of “must,” an evocative call that change needed to come immediately. Other main words such as “public,” “system,” “economic,” and “power” speak to their critique of the economic system, and words such as capitalist, socialized, ownership, and worker speaking to a socialist frame of analysis. You can piece together key components of their platform, from this visualization alone.

13 Leave a comment on paragraph 13 0 For historians, though, the important element comes in change over time. Remember, we need to keep in mind that words might change. Let’s take two other major documents within this singular political tradition. In 1956, the CCF, in the context of the Cold War released its second major political declaration in Winnipeg. Again, via a word cloud, figure 3.3:

14 Leave a comment on paragraph 14 0 3.3-winnipeg-declaration

[Insert Figure 3.3 The CCF’s Winnipeg Declaration of 1956, as word cloud]

16 Leave a comment on paragraph 16 0 New words appear, representing a new thrust: “opportunity,” “freedom,” “international,” “democratic,” “world,” “resources,” and even “equality.” Compared to the more focused, trenchant words found in the previous declaration, we see a different direction here. There is more focus on the international, Canada has received more attention than before, and most importantly, words like socialized have disappeared. Indeed, the CCF here was beginning to change its emphasis, backing away from overt calls for socialism. But the limitations of the word cloud also rear their head: for example, the word “private.” Is private good? Bad? Is opportunity good, bad? Is freedom good, or bad? Without context, we cannot know from the image alone. But the changing words are useful.

17 Leave a comment on paragraph 17 0 For a more dramatic change, let’s compare it to modern platforms. Today’s New Democratic Party grew out of the CCF in 1961 and continues largely as an opposition party (traditionally a third party, although in 2012 was propelled to the second-party Official Opposition status). By 2012, what do party platforms speak of? Figure 3.4 gives us a sense:

18 Leave a comment on paragraph 18 0 3.4-ndp-platform

19 Leave a comment on paragraph 19 0 [insert Figure 3.4 NDP platform for the 2012 Canadian General Election]

20 Leave a comment on paragraph 20 0 Taxes, families, Canada (which keeps building in significance over the period), work, employment, funding, insurance, homes, and so forth. In three small images, we have seen the evolution of a political party morph from an explicitly socialist party in 1933, to a waning of that during the Cold War climate of 1956, to the mainstream political party that it is today.

21 Leave a comment on paragraph 21 0 Thus, while they need to be used with caution, we believe that word clouds can tell us something. On his blog, digital historian Adam Crymble ran a quick study to see if historians would be able to reconstruct the contents of documents from these word clouds – could they look at a word cloud of a trial, for example, and correctly ascertain what the trial was about. He noted that while substantial guesswork is involved, “an expert in the source material can, with reasonable accuracy, reconstruct some of the more basic details of what’s going on.”[1] It also represents the inversion of the traditional historical process: rather than looking at documents that we think may be important to our project and pre-existing thesis, we are looking at documents more generally to see what they might be about. With Big Data, it is sometimes important to let the sources speak to you, rather than looking at them with pre-conceptions of what you might find.

22 Leave a comment on paragraph 22 0 Word clouds need to be used cautiously. They do not explain context – so you can see that “taxes,” for example, is mentioned repeatedly in a speech but you would not be able to learn whether the author liked taxes, hated taxes or was simply telling the audience about them. This is the biggest shortcoming of word clouds, but we still believe that they are a useful entryway into the world of data visualization. Complementary to wider reading and other forms of inquiry, they present a quick and easy way into the world of data visualization. Once we are using these, we’re visualizing data, and it’s only a matter of how. In the pages that follow, we move from this very basic stage, to other basic techniques including AntConc and Voyant Tools, before moving into more sophisticated methods involving text patterns (or regular expressions), spatial techniques, and programs that can detect significant patterns and phrases within your corpus.

23 Leave a comment on paragraph 23 0 Next section: AntConc


24 Leave a comment on paragraph 24 0 [1] Adam Crymble, “Can We Reconstruct a Text from a Wordcloud?” 5 August 2013, Thoughts on Digital and Public Historyhttp://adamcrymble.blogspot.ca/2013/08/can-we-reconstruct-text-from-wordcloud.html.

Page 34

Source: http://www.themacroscope.org/?page_id=633