|
An experiment in writing in public, one page at a time, by S. Graham, I. Milligan, & S. Weingart

Topic Modeling by Hand

1 Leave a comment on paragraph 1 0 Let’s look at the Gettysburg Address:

2 Leave a comment on paragraph 2 0 Four score and seven years ago our fathers brought forth on this continent a new nation, conceived in liberty, and dedicated to the proposition that all men are created equal.

3 Leave a comment on paragraph 3 0 Now we are engaged in a great civil war, testing whether that nation, or any nation so conceived and so dedicated, can long endure. We are met on a great battlefield of that war. We have come to dedicate a portion of that field, as a final resting place for those who here gave their lives that that nation might live. It is altogether fitting and proper that we should do this.

4 Leave a comment on paragraph 4 0 But, in a larger sense, we can not dedicate, we can not consecrate, we can not hallow this ground. The brave men, living and dead, who struggled here, have consecrated it, far above our poor power to add or detract. The world will little note, nor long remember what we say here, but it can never forget what they did here. It is for us the living, rather, to be dedicated here to the unfinished work which they who fought here have thus far so nobly advanced. It is rather for us to be here dedicated to the great task remaining before us—that from these honored dead we take increased devotion to that cause for which they gave the last full measure of devotion—that we here highly resolve that these dead shall not have died in vain—that this nation, under God, shall have a new birth of freedom—and that government of the people, by the people, for the people, shall not perish from the earth.

5 Leave a comment on paragraph 5 0 This is a single document. How many topics are present, and what are their most significant words? Let us generate a topic model by hand. We have used this exercise in class as an introduction to text analysis more generally, but it can also highlight the differences in the ways computers ‘know’ something versus the ways historians ‘know’.

6 Leave a comment on paragraph 6 0 Take out some highlighters, and mark up this passage (print out a readily accessible copy online if you do not want to sully this book). Use one colour to highlight words related to ‘war’ and another for words related to ‘governance’. You might find that some words get double-marked. As you do this, you are doing some rather complicated inferences. What are ‘war’ words anyway? Have someone else do this exercise at the same time. Depending on your experience with 19th century rhetoric, or politics, or history, your marked up pages will be subtly different. One person will see that ‘field’ perhaps ought to be part of the ‘war’ topic, while another will not make that connection. What makes the difference? Fundamentally, given our individual backgrounds, it comes down to a matter of probability.

7 Leave a comment on paragraph 7 0 Then, list the ‘war’ words down one side of a page, and the ‘governance’ words on the other. Have a ‘count’ column beside your list in which you will display the number of times the word appears. You might like to do this using a spreadsheet, so you can then sort your lists so that the words that appear most often are at the top. The Gettysburg Address, as reproduced here, has 271 words. The most frequently occurring word, ‘that’, is found 13 times, or roughly 5%. Add up your ‘counts’ column, and figure out the proportion that those words, these topics, account for your total document, by dividing by 271. (Incidentally, if you visualize your results as a histogram, swapping out the bars in the chart for the words themselves, you have more or less created a word cloud).

8 Leave a comment on paragraph 8 0 Hollis Peirce, an undergraduate student at Carleton, created the following spreadsheet, working this out for himself (figure 4.1).

9 Leave a comment on paragraph 9 0 gettysburg-markup

 

11 Leave a comment on paragraph 11 0 His count looks like this:

12 Leave a comment on paragraph 12 0

13 Leave a comment on paragraph 13 0 War Words Count

14 Leave a comment on paragraph 14 0 Dead 3

15 Leave a comment on paragraph 15 0 War 2

16 Leave a comment on paragraph 16 0 Gave 2

17 Leave a comment on paragraph 17 0 Living 2

18 Leave a comment on paragraph 18 0 Have 2

19 Leave a comment on paragraph 19 0 Engaged 1

20 Leave a comment on paragraph 20 0 Battlefield 1

21 Leave a comment on paragraph 21 0 Field 1

22 Leave a comment on paragraph 22 0 Final 1

23 Leave a comment on paragraph 23 0 Resting 1

24 Leave a comment on paragraph 24 0 Place 1

25 Leave a comment on paragraph 25 0 Lives 1

26 Leave a comment on paragraph 26 0 Live 1

27 Leave a comment on paragraph 27 0 Hallow 1

28 Leave a comment on paragraph 28 0 Ground 1

29 Leave a comment on paragraph 29 0 Brave 1

30 Leave a comment on paragraph 30 0 Struggled 1

31 Leave a comment on paragraph 31 0 Remember 1

32 Leave a comment on paragraph 32 0 Never 1

33 Leave a comment on paragraph 33 0 Forget 1

34 Leave a comment on paragraph 34 0 Fought 1

35 Leave a comment on paragraph 35 0 Nobly 1

36 Leave a comment on paragraph 36 0 Advanced 1

37 Leave a comment on paragraph 37 0 Honoured 1

38 Leave a comment on paragraph 38 0 Take 1

39 Leave a comment on paragraph 39 0 Cause 1

40 Leave a comment on paragraph 40 0 Died 1

41 Leave a comment on paragraph 41 0 Vain 1

42 Leave a comment on paragraph 42 0 Perish 1

43 Leave a comment on paragraph 43 0  

44 Leave a comment on paragraph 44 0 Total: 35

45 Leave a comment on paragraph 45 0  

46 Leave a comment on paragraph 46 0 35/271= 9%

47 Leave a comment on paragraph 47 0  

48 Leave a comment on paragraph 48 0  

49 Leave a comment on paragraph 49 0 This was Hollis’ hand-made topic model, but the magical, computational part – deciding what constituted a topic – was done in his head. We decided, a priori, that there were two topics in this document, and that they dealt specifically with ‘war’ and ‘governance’. We poured over the words, and fitted them probabilistically into one or the other topic (and sometimes, both). The finished list of words and their counts is the distribution-over-words that characterizes a topic; the proportion that those words account for the entire document demonstrates the document’s topical composition. If we had several more documents, we could use our lists that we’ve generated as a guide to colour-code, to mark up, these other documents. In the argot of topic modeling, this would be a ‘trained’ topic model (we use our intuition about the Gettysburg Address to find patterns in other documents). We can run the same process from scratch on each of our new documents, and then iterate again through our lists, to understand the latent or hidden structure of our corpus as a whole. We should point out that while ‘document’ in everyday use means a diary entry, a single speech, an entire book, for the purpose of data mining, a document could be just every paragraph within that book, or every 1000 words.

50 Leave a comment on paragraph 50 0 When the computer does the work for us, it pays close attention to those words that might appear in multiple documents. Ted Underwood asks us to think about the word ‘lead’ which might be a verb, and thus part of a topic related to leadership, (ie, he took the lead in the charge) or it might be a noun, and thus part of a topic related to environmental contamination (ie, lead in the pipes was responsible for the poisoning). How can we know the difference? That is, how can we encode our understanding of the semantic differences and word usage in a series of steps for the computer to undertake? We ask the computer to figure out the probability that ‘lead’ belongs in a given topic, versus other topics. Additionally, we start by initially assigning words to topics at random. Hollis already knew that some words were more likely to be about war than governance; the computer does not.

51 Leave a comment on paragraph 51 0 Instead, we instruct the computer to pick topics for us, and it begins with a series of blind guesses, assigning words to bins at random. The computer knows a warehouse full of word bins exists, but it cannot see inside it. The topic model is the computer’s attempt at inferring the contents of each bin by looking at each document and working backwards to the topic bins it likely drew from. The computer starts from the assumption that if several documents contain the same groups of words, those words likely form a ‘topic’. As the computer scans through the text over and over again, it reorganizes its initially random bins into closer and closer approximations of what it guesses the “real” topic bins must look like. Internally, the computer is optimizing for this problem: given a distribution of words over an entire collection of documents, what is the probability that this distribution of words within a document belong to a particular topic?

52 Leave a comment on paragraph 52 0 This is a Bayesian approach to probability. Thomas Bayes was an 18th century clergyman who dabbled in mathematics. He was interested in problems of conditional probabilities, in light of prior knowledge.[1] The formula which now bears Bayes’ name depends on assigning a prior probability, and then re-evaluating that probability in the light of what it finds. As the computer goes through this process over and over for each word in the collection, it changes its assumptions about the distribution. In his book The Signal and the Noise, statistician Nate Silver’s example examines the chances that you are being cheated on when you discover a pair of underwear in your house not belonging to your partner. [2] To estimate the chances that you are being cheated on, you have to decide (or estimate) three conditions.

53 Leave a comment on paragraph 53 0  

54 Leave a comment on paragraph 54 0 1-    What are the chances that the underwear are there because you are being cheated on (call this ‘y’)?

55 Leave a comment on paragraph 55 0 2-    What are the chances that the underwear are there because you are not being cheated on (call this ‘z’)?

56 Leave a comment on paragraph 56 0 3-    And what would you have estimated, before finding the underwear, that your partner would have been prone to cheat (call this ‘x’, the prior probability)?

57 Leave a comment on paragraph 57 0  

58 Leave a comment on paragraph 58 0 The formula is:

59 Leave a comment on paragraph 59 0 xy / xy + z(1-x)

60 Leave a comment on paragraph 60 0 You can do the math for yourself. You can also feed your result back into the equation, changing your prior probability as new information comes to light.

61 Leave a comment on paragraph 61 0 Is this what topic modeling does? Yes, in essence, though the maths are a bit more complicated than this. Underwood writes,

62 Leave a comment on paragraph 62 0  

63 Leave a comment on paragraph 63 0 [...As we iterate our estimates, adjusting our probabilities, fitting words into topics, fitting topics across documents], a) words will gradually become more common in topics where they are already common. And also, b) topics will become more common in documents where they are already common. Thus our model will gradually become more consistent as topics focus on specific words and documents. But it can’t ever become perfectly consistent, because words and documents don’t line up in one-to-one fashion. […] the tendency for topics to concentrate on particular words and documents will eventually be limited by the actual, messy distribution of words across documents.

64 Leave a comment on paragraph 64 0  

65 Leave a comment on paragraph 65 0 That’s how topic modeling works in practice. You assign words to topics randomly and then just keep improving the model, to make your guess more internally consistent, until the model reaches an equilibrium that is as consistent as the collection allows.”[3]

66 Leave a comment on paragraph 66 0 There is a fundamental difficulty however. When we began looking at the Gettysburg Address, Hollis was instructed to look for two topics that we had already named ‘war’ and ‘governance’. When the computer looks for two topics, it does not know beforehand that there are two topics present, let alone what they might mean in human terms. In fact, we as the investigators have to tell the computer ‘look for two topics in this corpus of material’, at which point the machine will duly find two topics. At the moment of writing, there is no easily-instantiated method to automatically determine the ‘best’ number of topics in a corpus, although this will no doubt be resolved. For the time being, the investigator has to try out a number of different scenarios to find out what’s best. This is not a bad thing, as it forces the investigator continually to confront (or even, close-read) the data, the model, and the patterns that might be emerging.

67 Leave a comment on paragraph 67 0 The late statistician George Box once wrote, “Essentially, all models are wrong, but some are useful.”[4] A topic model is a way of fitting semantic meaning against a large volume of text. The researcher has to generate many models against the same corpus until she finds one that reaches Box’s utility.[5] We create topic models not to prove that our idea about phenomenon x in the past is true, but rather to generate new ways of looking at our materials, to deform it. In fact there is a danger in using topic models as historical evidence; they are configurable and ambiguous enough that no matter what you are looking for, you just might find it. Remember, a topic model is in essence a statistical model that describes the way that topics are formed. It might not be the right model for your corpus. It is however a starting point, and the topics that it finds (or fails to find) should become a lens through which you look at your material, reading closely to understand this productive failure. Ideally, you would then re-run the model, tweaking it so that it better describes the kind of structure you believe exists. You generate a model to embody your instincts and beliefs about how the material you are working on was formed. Your model could represent ideas about syntax, or ideas about the level of ‘token’ (n-grams of a particular length, for instance) that is appropriate to model. Then you use the algorithm to discover that structure in your real collection. And then repeat.[6]

68 Leave a comment on paragraph 68 0 If the historian used it on a series of political speeches for example, topic modeling tools would return a list of topics and the keywords composing those topics. Each of these lists is a topic according to the algorithm. Using the example of political speeches, the list might look like:

  1. 69 Leave a comment on paragraph 69 0
  2. Job Jobs Loss Unemployment Growth
  3. Economy Sector Economics Stock Banks
  4. Afghanistan War Troops Middle-East Taliban Terror
  5. Election Opponent Upcoming President
  6. … etc.

70 Leave a comment on paragraph 70 0 There are many dangers that face those who use topic modeling without fully understanding it.[7] For instance, we might be interested in word-use as a proxy for placement along a political spectrum. Topic modeling could certainly help with that, but we have to remember that the proxy is not in itself the thing we seek to understand – as Andrew Gelman demonstrates in his mock study of zombies using Google Trends.[8]

71 Leave a comment on paragraph 71 0 The tools we discuss below are organized in order of their progressive difficulty-of-use. The trade off is of course that the easier to use the tool, the less powerful (in certain regards) the tool. When we introduce topic modeling in our own classes, we like to begin with the ‘GUI Topic Modeling Tool’ to hook our students, and then, depending on the nature of the materials we’re working with, move to MALLET on the command line or the Stanford Topic Modeling Tool. Finally, in our own research, we find that we are more and more often using the various topic modeling tools afforded in the R statistical programming environment (which we access through ‘R Studio’) since R also allows us to manipulate and transform the results with (comparative) ease. Which tool to use, and how to use it, is always dependent on where you wish to go.



72 Leave a comment on paragraph 72 0 [1] Thomas Bayes and Mr. Price, “An Essay towards solving a Problem in the Doctrine of Chances.”. Philosophical Transactions of the Royal Society of London, 53.0 (1763): 370–418.

73 Leave a comment on paragraph 73 0 [2] Nate Silver, The Signal and the Noise: Why so Many Predictions Fail–but Some Don’t (New York: Penguin Press, 2012), 243-247.

74 Leave a comment on paragraph 74 0  

75 Leave a comment on paragraph 75 0 [3] Ted Underwood, “Topic Modeling made just simple enough.” The Stone and the Shell, 7 April 2012, http://tedunderwood.com/2012/04/07/topic-modeling-made-just-simple-enough/

76 Leave a comment on paragraph 76 0  

77 Leave a comment on paragraph 77 0 [4] George E.P. Box,and Norman R. Draper, Empirical Model-Building and Response Surfaces (Wiley: 1987), p. 424.

78 Leave a comment on paragraph 78 0 [5] When we describe topic modeling here, we have in mind the most commonly used approached, ‘Latent Dirichlet Allocation’. There are many other possible algorithms and approaches, but most usages of topic modeling amongst digital humanists and historians treat LDA as synonymous with topic modeling. It is worth keeping in mind that there are other options, which might shed useful light on your problem at hand. A special issue of the Journal of Digital Humanities treats topic modeling across a variety of domains and is a useful jumping off point for a deeper exploration of the possibilities. Journalofdigitalhumanities.org. The LDA technique was not the first technique now considered topic modeling, but it is by far the most popular. The myriad variations of topic modeling have resulted in an alphabet soup of techniques and programs to implement them that might be confusing or overwhelming to the uninitiated; for the beginner it is enough to know that they exist. MALLET primarily utilizes LDA.

79 Leave a comment on paragraph 79 0  

80 Leave a comment on paragraph 80 0 [6] David Blei, “Topic modeling and the digital humanities,” Journal of Digital Humanities, 2.1 (2012), http://journalofdigitalhumanities.org/2-1/topic-modeling-and-digital-humanities-by-david-m-blei/

81 Leave a comment on paragraph 81 0 [7] Scott Weingart, “The Myth of Text Analytics and Unobtrusive Measurement,” scottbot,net, 6 May 2012, http://www.scottbot.net/HIAL/?p=16713 .

82 Leave a comment on paragraph 82 0 [8] Andrew Gelman, “‘How Many Zombies Do You Know?’ Using Indirect Survey Methods to Measure Alien Attacks and Outbreaks of the Undead,” arXiv:1003.6087 [physics], March 31, 2010, http://arxiv.org/abs/1003.6087/

Page 43

Source: http://www.themacroscope.org/?page_id=791