|
An experiment in writing in public, one page at a time, by S. Graham, I. Milligan, & S. Weingart

Topic Modeling by Hand

1 Leave a comment on paragraph 1 0

2 Leave a comment on paragraph 2 0 Let’s look at the Gettysburg Address:

3 Leave a comment on paragraph 3 0 Four score and seven years ago our fathers brought forth on this continent a new nation, conceived in liberty, and dedicated to the proposition that all men are created equal.

4 Leave a comment on paragraph 4 0 Now we are engaged in a great civil war, testing whether that nation, or any nation so conceived and so dedicated, can long endure. We are met on a great battlefield of that war. We have come to dedicate a portion of that field, as a final resting place for those who here gave their lives that that nation might live. It is altogether fitting and proper that we should do this.

5 Leave a comment on paragraph 5 0 But, in a larger sense, we can not dedicate, we can not consecrate, we can not hallow this ground. The brave men, living and dead, who struggled here, have consecrated it, far above our poor power to add or detract. The world will little note, nor long remember what we say here, but it can never forget what they did here. It is for us the living, rather, to be dedicated here to the unfinished work which they who fought here have thus far so nobly advanced. It is rather for us to be here dedicated to the great task remaining before us—that from these honored dead we take increased devotion to that cause for which they gave the last full measure of devotion—that we here highly resolve that these dead shall not have died in vain—that this nation, under God, shall have a new birth of freedom—and that government of the people, by the people, for the people, shall not perish from the earth.

6 Leave a comment on paragraph 6 2 This is a single document. How many topics are present, and what are their most significant words? Take out some highlighters, and mark up this page. Use one colour to highlight words related to ‘war’ and another for words related to ‘governance’. You might find that some words get double-marked. As you do this, you are doing some rather complicated inferences. What are ‘war’ words anyway? Have someone else do this exercise at the same time. Depending on your experience with 19th century rhetoric, or politics, or history, your marked up pages will be subtlety different. One person will see that ‘field’ perhaps ought to be part of the ‘war’ topic, while another will not make that connection. What makes the difference? Fundamentally, given our indvidual backgrounds, it comes down to a matter of probability.

7 Leave a comment on paragraph 7 0 Then, list the ‘war’ words down one side of a page, and the ‘governance’ words on the other. Have a ‘count’ column beside your list in which you will display the number of times the word appears.. You might like to do this using a spreadsheet, so you can then sort your lists so that the words that appear most often are at the top. The Gettysburg Address, as reproduced here, has 271 words. The most frequently occurring word, ‘that’, is found 13 times, or roughly 5%. Add up your ‘counts’ column, and figure out the proportion that those words, these topics, account for your total document by dividing by 271.

8 Leave a comment on paragraph 8 0 Hollis Pierce, an undergraduate student at Carleton, created the following spreadsheet, working this out for himself.

9 Leave a comment on paragraph 9 0 Hollis' marked-up version

Hollis’ marked-up version

10 Leave a comment on paragraph 10 0 His spreadsheet looks like this:

11 Leave a comment on paragraph 11 0 War Words Count
Dead 3
War 2
Gave 2
Living 2
Have 2
Engaged 1
Battlefield 1
Field 1
Final 1
Resting 1
Place 1
Lives 1
Live 1
Hallow 1
Ground 1
Brave 1
Struggled 1
Remember 1
Never 1
Forget 1
Fought 1
Nobly 1
Advanced 1
Honoured 1
Take 1
Cause 1
Died 1
Vain 1
Perish 1
total 35

12 Leave a comment on paragraph 12 0 35/271= 9%

13 Leave a comment on paragraph 13 1 This was Hollis’ hand-made topic model by hand, but the magical, computational part – deciding what constituted a topic – was done in his head. We decided, a-priori, that there were two topics in this document. We poured over the words, and fitted them probabilistically into one or the other (and sometimes, both). The finished list of words and their counts is the distribution-over-words that characterizes a topic; the proportion that those words accounts for the entire document demonstrates the document’s composition. If we had several more documents, we could use our lists that we’ve generated as a guide to colour-code, to mark up, these other documents. In the argot of topic modeling, this would be a ‘trained’ topic model (we use our intuition about the Gettysburg Address to find patterns in other documents). We can however run the same process from scratch on each of our new documents, and then iterate again through our lists, to understand the latent or hidden structure of our corpus as a whole. We should point out that while ‘document’ in every day use means a diary entry, a single speech, an entire book, for the purpose of data mining, a document could be just every paragraph within that book, or every 1000 words.

14 Leave a comment on paragraph 14 0 When the computer does the work for us, it pays close attention to those words that might appear in multiple documents. Ted Underwood asks us to think about the word ‘lead’ which might be a verb, and thus part of a topic related to leadership, (ie, he took the lead in the charge) or it might be a noun, and thus part of a topic related to environmental contamination (ie, lead in the pipes was responsible for the poisoning). How can we know the difference? That is, how can we encode our understanding of the semantic differences and word usage in a series of steps for the computer to undertake? We ask the computer to figure out the probability that ‘lead’ belongs in a given topic, versus other topics. And we start by initially assigning words to topics at random. Hollis already knew that some words were more likely to be about war than governance; the computer does not so it has to start with a blind guess. Ted Underwood puts it this way:

15 Leave a comment on paragraph 15 0 “For each possible topic Z we’ll multiply the frequency of this word type W in Z by the number of other words in the document D that already belong to Z”.

16 Leave a comment on paragraph 16 0 If you’ve read Nate Silver’s book The Signal and the Noise you’ll recognize that this is a Bayesian approach to probability. That is, as the computer goes through this process over and over for each word in the collection, it changes its assumptions about the distribution. Underwood goes on to say,

17 Leave a comment on paragraph 17 0 “As we do that, a) words will gradually become more common in topics where they are already common. And also, b) topics will become more common in documents where they are already common. Thus our model will gradually become more consistent as topics focus on specific words and documents. But it can’t ever become perfectly consistent, because words and documents don’t line up in one-to-one fashion. […]the tendency for topics to concentrate on particular words and documents will eventually be limited by the actual, messy distribution of words across documents.

18 Leave a comment on paragraph 18 0 That’s how topic modeling works in practice. You assign words to topics randomly and then just keep improving the model, to make your guess more internally consistent, until the model reaches an equilibrium that is as consistent as the collection allows.“  http://tedunderwood.com/2012/04/07/topic-modeling-made-just-simple-enough/

19 Leave a comment on paragraph 19 1 There is a fundamental difficulty however. When we began looking at the Gettysburg Address, Hollis was instructed to look for two topics that we had already named ‘war’ and ‘governance’. When the computer looks for two topics, it does not know before hand that there are two topics present, let alone what they might mean in human terms. In fact, we as the investigators have to tell the computer ‘look for two topics in this corpus of material’, at which point the machine will duly find two topics. At the moment of writing, there is no easy way to determine the ‘best’ number of topics in a corpus. Instead, the investigator has to try out a number of different versions to find out what’s best. We examined the output at each step to see whether or not the model seems to capture the thematic complexity of the corpus in a way that is useful for generating further insights.

20 Leave a comment on paragraph 20 0 The late George Box, who was a statistician, once wrote, “Essentially, all models are wrong, but some are useful”. A topic model is a way of fitting semantic meaning against a large volume of text. The researcher has to generate many models against the same corpus until she finds one that reaches Box’s utility. We create topic models not to prove that our idea about phenomenon x in the past is true, but rather to generate new ways of looking at our materials, to deform it.

21 Leave a comment on paragraph 21 0  

Page 79

Source: http://www.themacroscope.org/?page_id=47