|
An experiment in writing in public, one page at a time, by S. Graham, I. Milligan, & S. Weingart

Building the Historian’s Toolkit

1 Leave a comment on paragraph 1 0 (note to our CommentPress readers: this will come very early in the book)

Let’s begin to delve into things, shall we? We like the metaphor of a “toolkit” for historians beginning to tackle Big Data problems: an assortment of various software programs that can each shed new light on the past. Some of these programs have explicit tool-like names: MALLET, the MAchine Learning for LanguagE Toolkit; others, more methods-based names such as Voyant, the Stanford Topic Modelling Toolkit; others, more whimsical names such as Python, a programming language named after the 1970s British comedy series Monty Python’s Flying Circus.1 They all have two shared characteristics: they are free and open source, and they have all been fruitfully employed by historians in the past.

2 Leave a comment on paragraph 2 2 In Stephen King’s On Writing, A Memoir of the Craft, he tells us a story about his Uncle Oren. It’s an instructive story for historians.The screen door was broken. Uncle Oren lugged the enormous toolbox that had once belonged to his father to the door, found that only a small screwdriver was required, and fixed the door. In response to his young nephew’s question about why didn’t just grab the screwdriver in the first place, Uncle Oren said,“I didn’t know what else I might find to do once I got out here, did I? It’s best to have your tools with you. If you don’t you’re apt to find something you didn’t expect and get discouraged.”.Working with digital materials is a bit like King’s suggestions for writing. You want to have a variety of tools handy, in order to ask different kinds of questions, and to see different kinds of patterns.

3 Leave a comment on paragraph 3 0 The tools we use all presuppose different ways of looking at the world. Wolfram Alpha, for instance (the so-called ‘computational knowledge engine’) allows one to parse text automatically, assigning each name it finds a gender tag.2 But what of the name ‘Shawn’? Is that a male name, or a female name? Shawn Graham (male) Shawn Colvin (female). Who compiled the list of names for this tool? How were they tagged? Who decides that ‘Shawn’ is a male or female name? What about names that at one point were predominantly male (Ainsley) but are now more typically used by females?  Alison Prentice, in her brave mea culpa ‘Vivian Pound was a man?’ discusses how her research into the history of women members of the University of Toronto’s Physics Department revolved around the figure of Vivian Pound, whom she thought was a woman:3

4 Leave a comment on paragraph 4 0 “…it was certainly a surprise, and not a little humbling, to learn in the spring of 2000 that Pound, a physicist who earned a doctorate from the University of Toronto in 1913, was not a female of the species, as I had thought, but a male. In three essays on early twentieth-century women physicists published between 1996 and 1999, I had erroneously identified Vivian Pound not only as a woman, but as the first woman at the University of Toronto to earn a Ph.D. in physics.” ((Ibid, 99))

5 Leave a comment on paragraph 5 1 In Prentice’s case, the tool – simple lists of names – led her to erroneous outcomes. Her work demonstrates how to reflect and build outwards from such reflection. This is an essential component to using digital tools (indeed, tools of any kind). In later chapters we will discuss in more detail how the worldviews built into our tools can lead us similarly astray.

6 Leave a comment on paragraph 6 0  

7 Leave a comment on paragraph 7 0 Automatic Retrieval of Data

8 Leave a comment on paragraph 8 0 First, however, we need data. How can historians find Big Data repositories to work on? As with everything, this depends on your unique needs. We want to use a few different examples that scale in difficulty. At its easiest, the document is sitting there on one or two webpages and can be downloaded with a few clicks of your mouse, as in the case of second American president John Adam’s diary. With a little more effort, we can automatically access large collections from the exhaustive Internet Archive. And with some added effort and a few more lines of commands, we can access even larger collections from national institutions including the Library of Congress and Library and Archives Canada. Join us as we bring Big Data to your home computer.

9 Leave a comment on paragraph 9 2 Have you ever sat at your computer, looking at a long list of records held at a place like Library and Archives Canada, the British Library, or the Library of Congress, right-clicking and downloading each record-by-painstaking-record? Or have you pulled up a query in the Internet Archive that brings up hundreds of results? There are tools to ease this sort of work. There is a bit of a learning curve, in that the first download might actually take longer to do than if you did it manually, but there is a cumulative savings: the next collection will download far quicker, and from then on, you are saving time with every query.

10 Leave a comment on paragraph 10 1 One of the most powerful ways for a historian to download data is the free and open-source program wget, available on every platform. It allows the automated, rule-based and recursive (repeating processes in a similar way every time) programs that can speed up your data gathering process by several orders of magnitude. If you ever find yourself staring at a list, right-clicking and downloading each individual file in a repetitious way, fear not: this is the situation that something like wget was designed for.

11 Leave a comment on paragraph 11 0 Unlike many other software programs that you may be familiar with, wget runs on a command-line interface. If you were a computer user in the 1980s or early 1990s, this will be familiar: MS-DOS was a command-line interface. If not, however, this may seem somewhat foreign. Most users today interact with computer systems through a graphical-user interface, which lets you navigate files and programs through images (icons, pictures, and rendered text). Several of the tools discussed in this book use the command line. While this has a bit of a learning curve, it offers a degree of precision and – eventually – speed that compensates for the initial difficulty.

12 Leave a comment on paragraph 12 0 <BEGIN SIDEBAR – IN THE FINAL VERSION, THIS WILL BE OFF TO THE SIDE OF THE PAGE> 

13 Leave a comment on paragraph 13 0 Wget can be relatively easily installed on your system. Linux users will have it pre-installed by default. Mac users have a slightly more complicated procedure. From the App Store or from the Apple website itself, install ‘XCode.’ Once it has installed, install the ‘Command Line Tools’ kit from the ‘Preferences’ tab in the program. With that complete, a popular package manager named Homebrew can help you install wget in a few lines of code. Open your ‘Terminal’ window and type the following to install Homebrew:

14 Leave a comment on paragraph 14 2 ruby -e "$(curl -fsSL https://raw.github.com/mxcl/homebrew/go)"

15 Leave a comment on paragraph 15 0 Get it configured to make sure it works by typing

16 Leave a comment on paragraph 16 0 brew doctor

17 Leave a comment on paragraph 17 0 And then install wget with the following easy command:

18 Leave a comment on paragraph 18 0 brew install wget

19 Leave a comment on paragraph 19 0 Wget will download and install itself on your system.

20 Leave a comment on paragraph 20 0 For Windows users, the easiest way is to download WGET for Windows. You simply download the file (wget.exe) to your c:\windows directory so you can access it from anywhere else on your system.4 Then it is as easy as opening up your command line (which is ‘cmd.exe’ and can be found by searching for that term through your Start Menu or finding it under ‘Accessories’). If you have downloaded wget.exe into the C:\windows directory, typing wget into that box will work.

21 Leave a comment on paragraph 21 0 <END SIDEBAR>

22 Leave a comment on paragraph 22 0 How to Become a Programming Historian, A Gentle Introduction

23 Leave a comment on paragraph 23 1 While there is certainly less stigma than there used to be, the idea of a “programming historian” can lead to a piqued eyebrow or two, or an easy laugh line at times.5 Of course, as Chapter One discussed, historians have been actively programming since the 1970s as part of the first two waves of computational history. The difference with the third wave (open-source software, powerful computing, and big data) is that the barriers to entry have never been lower. This is most obvious with the open-source, open-access textbook the Programming Historian and its sequel, the aptly-named Programming Historian 2. In short, the Programming Historian 2 has a simple goal: to teach the basics of programming, drawing on historical examples and with an emphasis of bringing you up to speed quickly and practically. It is not a computer science course: it is instead a series of hands-on examples, focused on the what and how rather than the underlying architecture that drives programming languages. It’s also a challenge to conventional forms of publishing, being a “community-driven collaborative textbook,” soliciting submissions, constantly refining, and inviting comments at all stages.

24 Leave a comment on paragraph 24 3 In early May 2008, the first iteration of the book went online, co-authored by two historians from the University of Western Ontario, William J. Turkel and Alan MacEachern. They came at it with varying levels of experience: Turkel, a life-long programmer, and MacEachern, who began programming only on New Year’s day 2008.6 Their argument behind why historians need to program was both extensive and disarmingly simple. In short, as they put it, “if you don’t program, your research process will always be at the mercy of those who do … programming is for digital historians what sketching is for artists or architects: a mode of creative expression and a means of exploration.”7 This book, and its second iteration as a collaboratively authored textbook (at programminghistorian.org), is the easiest way for a historian to learn the basics of historical programming.

25 Leave a comment on paragraph 25 2 The Programming Historian 2 uses all open-source software: the Python programming language, the Komodo Edit editing environment, the Omeka digital exhibit platform, or the MAchine Learning for LanguagE Toolkit (MALLET). The most important lessons, in keeping with the textbook’s name, involve the Python programming language. Beginning with the simple steps of installing the language itself onto your system, be it Linux, OS X, or Windows, it then moves through increasingly complicated lessons: automatically retrieving data, breaking it down into constituent parts, counting words, and creating basic visualizations. In short, a user in a few hours moves from having no knowledge to being able to programmatically interact with the exhaustive holdings of the Old Bailey Online.

26 Leave a comment on paragraph 26 0 In our experience as researchers and teachers, the word “programming” occasionally raises hackles. Beyond practical concerns of not having sufficient digital fluency to feel up to the task, there is the issue that programming can seem antithetical to the humanistic tradition. The lessons contained within the Programming Historian 2, however, as well as those in this book, show that programming should fundamentally be understood as a creative undertaking. Humanistic assumptions underlie the decisions that you will be making, and give us more control over the processes that produce results that we then parse into our narratives. Yes, it requires an attention to detail—a misplaced bracket here or there can throw off your program until you find the error—but for the most part we believe that it is a rewarding experience. Indeed, there’s something to be said about a process that gives you immediate feedback—your program works or crashes!—as opposed to the long cycles of feedback that we and our students are used to.

27 Leave a comment on paragraph 27 1 In this section, we do not intend to replicate the Programming Historian 2, which is both freely available online and continually updated to incorporate changes in operating systems, plugins, and even best practices. Instead, we introduce you to what we consider are the basic conceptual lessons needed by a humanist, as well as providing context around why these issues are important. We encourage you to follow the instructions provided online, install Python, and run through the very basic lessons discussed below.

28 Leave a comment on paragraph 28 0 Normalizing and Tokenizing Data

29 Leave a comment on paragraph 29 0 Imagine that you’re using a computer to explore the Old Bailey Online, trying to train it to find things out that you would if you read it yourself. There are questions that you may take for granted that we need to lay out for the computer. For example:

  • 30 Leave a comment on paragraph 30 0
  • Do you want to pay attention to cases? If we are counting words, should we treat “hello” and “Hello” the same, or treat them differently? What if certain words are capitalized, such as those at the beginning of a chapter, or names? The Old Bailey Online, for example, capitalizes names at times (I.e. BENJAMIN BOWSEY’ at one point, but perhaps “Bowsey’s” elsewhere).
  • What about punctuation? Should “Benjamin’s” and “Benjamin” be treated the same, or differently?
  • Do you want to count common words? If you count all the words in a document, chances are that words like “the,” “it,” “and,” and so forth will appear frequently. This may occlude your analysis, and it may thus be best to remove the words: they do not tell you anything particular about the case itself.

31 Leave a comment on paragraph 31 0 In general, when counting words or doing more complicated procedures such as topic modelling, we go through these steps and decide to “normalize,” or make normal, all of the text. Text is all made lower-case (a simple Python command), punctuation is stripped out, and common words are removed based on a stop-words (“the”, “it”, etc.) dictionary.

32 Leave a comment on paragraph 32 0 For historians, there are no simple answers: we work with diverse source bases, and thus a conceptual understanding of normalization is more important than knowing the specific code. The trick is to understand your documents and to be prepared to do this a few times. For some cases, the solution can be as simple as a “find and replace,” either using Python (it is taught by the Programming Historian) or even in your favourite word processor.

33 Leave a comment on paragraph 33 0 Here are some places where normalization will be essential, from simple to more complicated. On a simple level, perhaps you are studying shifting mentions to newspapers. Take the Canadian newspaper, the Globe and Mail. If in a document it is mostly spelled Globe and Mail, but occasionally Globe & Mail, and even more infrequently G & M, you would want to capture all three of those under one category, called a lemma. A find and replace could help fold those together in a document. A slightly more complicated case would be currency: $20, twenty dollars, twenty bucks, $20.00, 20$, and other colloquial usages. If you are interested in financial flows, you do not want to miss these things – and a computer will unless it is normalized. Finally, on a more complicated level, comes stemming: reducing words down to their core concept: I.e. “Buying”/“bought”/“buy” to “buy”, or “sold”/“selling”/”sell” to “sell”.

34 Leave a comment on paragraph 34 3 Tokenization is another key concept. Let’s take the short phrase, as used in the Programming Historian after basic normalization: “it was the best of times it was the worst of times it was the age of wisdom it was the age of foolishness.” If we turn the sentence into tokens, based on word boundaries, we get the following single tokens: “it”, “was”, “the”, “best”, “of”, “times”, “it”, “was”, “the”, “worst” … and so on. These are easy to count, then, and we can tally up the number of times they appear: “times” twice, “best” once,” “was”, four times. From this, one can apply a list of common words and remove them. More importantly, this stage is a crucial one to catch any normalization errors that might have crept in: does “can” appear frequently as well as “t” – maybe you’re breaking up “can’t” and creating two word tokens where only one should be, which may throw off your analysis. With a large enough corpus, meaningful information can be retrieved by counting data. But if data are not normalized and tokenized, it is easy to miss something.

References
  1. 35 Leave a comment on paragraph 35 0
  2. “Why is it Called Python?” Python documentation, last updated 31 July 2013, available online, http://docs.python.org/2/faq/general.html#why-is-it-called-python, accessed 31 July 2013. []
  3. As discussed in Ian Milligan, “Quick Gender Detection Using Wolfram|Alpha,” 28 July 2013, ianmilligan.ca, available online, http://ianmilligan.ca/2013/07/28/gender-detection-using-wolframalpha/ []
  4. Alison Prentice, “Vivian Pound was a Man? The Unfolding of a Research Project,” Historical Studies in Education/Revue d’histoire de l’éducation, 13, 2 (2001): 99-112, available online at http://historicalstudiesineducation.ca/index.php/edu_hse-rhe/article/view/1860/1961 []
  5. For example, at “WGET for Wndows (Win32),” available online, http://users.ugent.be/~bpuype/wget/, accessed 31 July 2013. []
  6. Here, an anecdote: when Ian Milligan accepted an award at the Canadian Historical Association’s annual meeting, the idea that historians might need to program was met with a pretty good chuckle from the audience. []
  7. William J. Turkel and Alan MacEachern, “The Programming Historian: About this Book,” May 2008, available online, http://niche-canada.org/member-projects/programming-historian/ch1.html, accessed 12 August 2013. []
  8. Turkel and MacEachern, “Programming Historian, ch. 2,” May 2008, available online, http://niche-canada.org/member-projects/programming-historian/ch2.html, accessed 12 August 2013. []
Page 63

Source: http://www.themacroscope.org/?page_id=330