An experiment in writing in public, one page at a time, by S. Graham, I. Milligan, & S. Weingart

Normalizing and Tokenizing Your Data

1 Leave a comment on paragraph 1 0 Previous section: Basic Scraping: Getting Your Data

2 Leave a comment on paragraph 2 0 Imagine that you’re using a computer to explore the Old Bailey Online, trying to train it to find things out that you would find if you read it yourself. There are questions that you may take for granted that we need to lay out for the computer. For example:

  • 3 Leave a comment on paragraph 3 0
  • Do you want to pay attention to cases? If we are counting words, should we treat “hello” and “Hello” the same, or treat them differently? What if certain words are capitalized, such as those at the beginning of a chapter, or names? The Old Bailey Online, for example, capitalizes names at times (I.e. BENJAMIN BOWSEY’ at one point, but perhaps “Bowsey’s” elsewhere).
  • What about punctuation? Should “Benjamin’s” and “Benjamin” be treated the same, or differently?
  • Do you want to count common words? If you count all the words in a document, chances are that words like “the,” “it,” “and,” and so forth will appear frequently. This may occlude your analysis, and it may thus be best to remove the words: they do not tell you anything particular about the case itself.

4 Leave a comment on paragraph 4 0 In general, when counting words or doing more complicated procedures such as topic modeling, we go through these steps and decide to normalize, or make normal, all of the text. Text is all made lower-case (a simple Python command), punctuation is stripped out, and common words are removed based on a stop-words (“the”, “it”, etc.) dictionary.

5 Leave a comment on paragraph 5 0 For historians, there are no simple answers: we work with diverse source bases, and thus a conceptual understanding of normalization is more important than knowing the specific code. The trick is to understand your documents and to be prepared to do this a few times. For some cases, the solution can be as simple as a “find and replace,” either using Python (it is taught by the Programming Historian) or even in your favourite word processor.

6 Leave a comment on paragraph 6 0 Here are some places where normalization will be essential, from simple to more complicated. On a simple level, perhaps you are studying shifting mentions to newspapers. Take the Canadian newspaper, the Globe and Mail. If in a document it is mostly spelled Globe and Mail, but occasionally Globe & Mail, and even more infrequently G & M, you would want to capture all three of those under one category, called a lemma. A find and replace could help fold those together in a document. A slightly more complicated case would be currency: $20, twenty dollars, twenty bucks, $20.00, 20$, and other colloquial usages. If you are interested in financial flows, you do not want to miss these things – and a computer will unless it is normalized. Finally, on a more complicated level, comes stemming: reducing words down to their core concept: I.e. “Buying”/“bought”/“buy” to “buy”, or “sold”/“selling”/”sell” to “sell”.

7 Leave a comment on paragraph 7 0 Tokenization is another key concept. It involves taking a sentence and breaking it down to the smallest, individual parts, that can then be easily compared to each other. Let’s take the short phrase, as used in the Programming Historian after basic normalization: “it was the best of times it was the worst of times it was the age of wisdom it was the age of foolishness.” If we turn the sentence into tokens, based on word boundaries, we get the following single tokens: “it”, “was”, “the”, “best”, “of”, “times”, “it”, “was”, “the”, “worst” … and so on.

8 Leave a comment on paragraph 8 0 Why does it matter that we’ve taken a straightforward sentence and turned it just into a bunch of individual words? These are now easy to count, and we can tally up the number of times they appear: “times” twice, “best” once,” “was”, four times. From this, one can apply a list of common words and remove them. More importantly, this stage is a crucial one to catch any normalization errors that might have crept in: does “can” appear frequently as well as “t” – maybe you’re breaking up “can’t” and creating two word tokens where only one should be, which may throw off your analysis. When a computer is counting words, it is tokenizing them before counting them. Making sure to tokenize it yourself makes sure that you catch these niggling little errors. You may want to tokenize contractions so “can’t” appears together. More importantly, you may want to treat URLs as http://macroscope.org rather than “http,” “macroscope,” and “org.” With a large enough corpus, meaningful information can be retrieved by counting data. But if data are not normalized and tokenized, it is easy to miss something.

9 Leave a comment on paragraph 9 0 Next section: Bringing it All Together: What’s Ahead in the Great Unread

Page 31

Source: http://www.themacroscope.org/?page_id=627