|
An experiment in writing in public, one page at a time, by S. Graham, I. Milligan, & S. Weingart

Delving into Big Data

1 Leave a comment on paragraph 1 0 Previous section: Intro to Several Key Digital History Terms

2 Leave a comment on paragraph 2 0 So far we have been looking at the overall contour of why going digital matters, now we want to transition towards what you can do with all of this information! In this respect, textual analysis and basic visualizations need to be part of our toolkit. While we will go into depth about these terms, we first want to provide a brief sketch of some of the basic techniques that underlie the rest of the book. These techniques include, as a very basic and gentle introduction:

  • 3 Leave a comment on paragraph 3 0
  • Counting Words: How often does a given word appear in a document? We can then move beyond that and see how often a word appears in dozens, or hundreds, or even thousands of documents, and establish change over time.
  • N-Grams, or Phrase Frequency: When counting words, we are technically counting unigrams: frequency of strings of one word. For phrases, we speak of n-grams: bigrams, trigrams, quadgrams, and fivegrams, although one could theoretically do any higher number. A bigram can be a combination of two characters, such as ‘bi,’ two syllables, or two words. An example: “canada is.” A trigram is three words, quadgram is four words, and a fivegram is five words, and so on.
  • Keyword-in-Context: This is important. Imagine that you are looking for a specific term that also has a broader name, such as the Globe and Mail newspaper, colloquially referred to as the Globe. If you wanted to see how often that newspaper appeared, a search for Globe would capture all of the appearances you were looking for, but also others: globe also refers to three-dimensional models of the earth, or perhaps Earth, or sphere-shaped objects, cities like Globe, Arizona, or a number of newspapers around the world. So in the following case:
    he read the globe and mail it
    picked up a globe newspaper in toronto
    jonathan studied the globe in his parlour
    favourite newspaper the globe and mail smelled
    the plane to globe arizona was late

    In the middle, we have the keyword we are looking for (globe) and on the left and right we have the context. Without requiring sophisticated programming skills, we can see in this limited sample of five that three probably refer to the Globe and Mail, one is ambiguous (one could study a globe of the Earth or the newspaper in one’s parlour), and one is clearly referring to the city of Globe, Arizona.

  • Line Chart: Many of the visualizations used in this book and elsewhere, such as the Google Books n-gram viewer, rely on a simple line graph.

4 Leave a comment on paragraph 4 0 None of these terms are meant to be intimidating, but rather a gentle introduction to some of the major issues that you may encounter as you move forward in this area of research. Copyright matters, open access matters, and basic visualization terms can help you make sense of what other digital humanists are doing.

5 Leave a comment on paragraph 5 0 Next section: Why We’re All Digital Now

Page 25

Source: http://www.themacroscope.org/?page_id=615