1 Leave a comment on paragraph 1 0 There are three issues of critical importance to understanding Big Data as a historian: the open access and open source movements, copyright, and what we mean by textual analysis. As discussed in the previous chapter, open source software lies at the heart of the third computational wave of history. Copyright looms large for historians working on 20th century topics: when accumulating large troves of data, researchers will almost certainly bump against these questions. Finally, textual analysis and basic visualizations provide the heart of the lessons provided in this book. Yet, as we argue, chances are that you are already using textual analysis: you may just not be completely cognizant of it yet.

2 Leave a comment on paragraph 2 0 Many, although not all, of the tools discussed in this book are free software, or open source. While free software had been around since the 1980s, epitomized by movements such as GNU (a free operating system with a name stemming from “GNU’s Not Unix”) and its 1985 manifesto, 1998 saw it rise to the level of a movement.1 Christopher Kelty’s Two Bits carries out an indispensable anthropological and historical study of the movement, tracing its origins to the 1998 decision of Netscape to freely distribute the source code of its paradigm-breaking Netscape Communicator software. Netscape had been charging for its full-suite product, as opposed to Microsoft’s bundled browser, and the decision to release the source code was a path-breaking one.2 Defining open source can be tricky, as there are competing interests, but essentially to speak of “open source software” means adhering to the definition laid out by the “Open Source Definition.”3 In short, open source software requires free redistribution, the source code, allowing derived works (modifications to the program), keeping the integrity of the authors’ source code, no discrimination, and general licenses with a few other restrictions.4

3 Leave a comment on paragraph 3 1 Open source software has tremendous, mostly positive, implications for digital historians, but there are some cautionary notes as well. The programs discussed in this book can be used for free, which has obvious benefits in an era of diminishing government budgets and austerity. More importantly, the open nature of them means that they are often modular and can be continually improved: one scholar may note an issue with a certain algorithm, and suggest or implement a suggestion themselves. This lends itself well to experimentation, and a trip to the Digital Humanities annual conferences bears out the wisdom of this approach well. For publicly-supported scholars, whether by virtue of being in a public university or from a granting council, it also allows scholars to give something back to a community of practice or even the general public.

4 Leave a comment on paragraph 4 0 This ethos also informs other aspects of digital history. Digital history largely, but not completely, grew out of public history and is inspired by this ethos. Websites and portals are created that allow the public to interact with and engage with the past, and other scholars are beginning to call for the open source ethos to be applied to more scholarly practices: witness the rise of open access publishing. While open access may take an author-pays model (where the author pays the associated charges for his or her article) or be provided on a free or supported basis, it tries to take the ethos of free access to the public. This book, for example, was inspired by this ethos in our decision to live write it using the CommentPress platform – itself an open access WordPress theme and plugin.5

5 Leave a comment on paragraph 5 1 As with many movements, there is a darker side to some of this. Author pay models can discriminate against junior scholars, those at non-research-intensive universities, and graduate students who may not have funds to support their research. Similarly, releasing free software is advantageous in many ways and perhaps even morally right for tenure- and tenure-track faculty members who have generous salaries; it may be more of an imposition to ask for untenured and contingent researchers who have monetary need. These are all questions that digital historians, alongside the more specific research considerations discussed in this book, will have to ponder.

6 Leave a comment on paragraph 6 0 The second term that is an essential addition to a digital historian’s toolkit is copyright. This book does not purport to be a comprehensive guide to copyright law in your jurisdiction: it varies dramatically from country to country, and we are not lawyers. As historians, however, we want to provide context to the copyright wars and provide a call for our readers to pay considerable attention to this pressing issue.

7 Leave a comment on paragraph 7 2 Google emerged from the search engine battles of the early 2000s as the victor, controlling a dominant share of the user base for those who wanted to find information on the World Wide Web. Ironically, Tim Wu has suggested that copyright infringement may have been key to their success: its search index, first created in 1996, required a comprehensive copy of the World Wide Web. While it was an unsettled legal question at the time, nobody filed suit so under legal norms this sort of mass copying for indexing is probably okay.6 In any case, Google would not be out of the copyright weeds as they moved forward with their search empire.

8 Leave a comment on paragraph 8 0 Following their dominance of the World Wide Web search engine market, they then decided to turn their attention to another large repository of knowledge: books. As early as late 2003, Google was approaching university librarians, beginning to scan the first chapters of books, and began to quietly thread search results into the queries of users under the moniker Google Print. A year later, in December 2004, Google announced that they would enter into a partnership to digitize books held by Harvard University, Oxford University, the University of Michigan, Stanford University, and the New York Public Library.7 Google Print, as it was called, would achieve several objectives: they would democratize knowledge locked away in often-inaccessible elite institutions, they would facilitate keyword searching throughout books as one could throughout the World Wide Web, and they would gain users who could then be exposed to their advertisements. Institutions went into varying degrees: Michigan wanted their entire 7.4 million volume collection digitized, the NYPL digitized only out-of-copyright materials, and Harvard went forward with a 40,000 book trial. Books under copyright would only be available via a “snippet search.”8

9 Leave a comment on paragraph 9 0 This was a controversial project. France’s National Library raised concerns that this would give “prominence to Anglophone texts above those written in other languages,” due to the Anglophilic nature of the five institutions.9 More importantly, many authors and publishers became agitated. Questions were asked by unsettled associations, such as the Association of American University Presses: how long and frequent were the snippets of copyrighted books, and would this have an effect on sales?10 As industry groups raised their voices in protest, including the large and powerful American Association of Publishers, Google suspended its project in August, announcing a pause until at least November.11 The debate, playing out across the World Wide Web and in leading news organizations, was a fascinating one: the rights of authors and publishers, who produced knowledge; but also the desires of readers, who wanted a quick and easy way to find relevant information. Google’s idea was to slow things down so publishers could opt out of the system, whereas they wanted an opt-in system.

10 Leave a comment on paragraph 10 0 The lawsuit came quickly. On 20 September 2005, three representative authors filed suit in Manhattan’s United States District Court on behalf of their peers and alleged “massive copyright infringement.”12 Google claimed “fair use,” the authors claimed that their copyright-protected work was being used for commercial purposes by the search engine giant. A month later, a joint lawsuit was filed by McGraw-Hill, Pearson Education, Penguin Group, Simon & Schuster and John Wiley & Sons. Google then resumed its digitization process.13 A sweeping settlement was reached in October 2008, that would facilitate the digitization of out-of-print books, purchase options, and crucially, allow up to 20% of a book to be displayed as a free snippet and to let universities subscribe to a service to view them all.14 Authors’ groups not part of the lawsuit, as well as individual authors, raised significant objections to this settlement. The settlement was subject to court review, and February 2011 it was thrown out by Judge Chin.15

11 Leave a comment on paragraph 11 0 If Google was a commercial enterprise, universities saw value in digitizing books: they could be preserved longer, the search index was useful, and scholars could derive significant Big Data benefits form having all this knowledge together into one place. With these intentions in mind, a group of universities (twelve in the American Midwest, plus the University of California system) announced in late 2008 that they would take volumes previously digitized by Google as well as some of their own and pool them into a collective trust: HaithiTrust. It was an apt name, as the New York Times put it: “Haithi is Hindi for ‘elephant,’ an animal that is said to never forget.”16 They had lofty goals: “The partners aim to build a comprehensive archive of published literature from around the world and develop shared strategies for managing and developing their digital and print holdings in a collaborative way.”17 A similar lawsuit was forthcoming a few years later when the Authors Guild and other groups and individual authors launched a case over “unauthorized” scans. Instead of damages, however, they simply asked “that the books be taken off the HaithiTrust servers and held by a trustee.”18

12 Leave a comment on paragraph 12 0 With the basic contours of the Google Print/Books and HaithiTrust cases established, we should pause to discuss what this means for historians engaging in Big Data-related research. While media coverage understandably focused on the competing rights of authors and Google, these cases also lie at the heart of the ability of historians and others being able to conduct truly transformative, paradigm-shifting Big Data research. This argument was developed in a amicus curiae (“friend of the court”) filing made by digital humanities and legal scholars, written by the aforementioned author of Macroanalysis Matthew Jockers and two law professors, Matthew Sag and Jason Schultz. As they noted, the “court’s ruling in this case on the legality of mass digitization could dramatically affect the future of work in the Digital Humanities.”

13 Leave a comment on paragraph 13 0 Their highly-readable argument is essential reading for digital historians wondering about the future state of large-scale textual analysis as discussed in this book. After recounting several successful projects, such as the previously discussed Culturomics and findings of Macroanalysis, the brief notes that these methods have both “inspired many scholars to reconceptualize the very nature of humanities research,” or more simply the “role of providing new tools for testing old theories, or suggesting new areas of inquiry.”19 The authors then make a very powerful statement (emphasis is theirs):

14 Leave a comment on paragraph 14 0 None of this, however, can be done in the Twentieth-Century context if scholars cannot make nonexpressive uses of underlying copyrighted texts, which (as shown above) will frequently number in the thousands, if not millions. Given copyright law’s objective of promoting “the Progress of Science,”13 it would be perversely counterintuitive if the promise of Digital Humanities were extinguished in the name of copyright protection.

15 Leave a comment on paragraph 15 1 Without digital historians being active participants in copyright discussions, much of the promise of digital history will go unrealized for decades, if not centuries. This has had the effect of skewing digital historical research towards the 19th century: one recent author has characterized repositories like Google Books “as a place of scholarly afterlives, where forgotten authors and discarded projects are enjoying a certain reincarnation.”20 Digital investigations allow researchers to move beyond established canons of texts, discovering forgotten works through textual analysis. Unfortunately, while the 20th century is dominated by vast arrays of typewritten and born-digital sources, copyright rules have minimized the potential that these sources offer at present. In any case, copyright belongs in the pantheon of essential terms for digital historians.

16 Leave a comment on paragraph 16 0 Finally, textual analysis and basic visualizations need to be part of our toolkit. While we will go into depth on these, we want to provide a brief sketch of some of the basic techniques that underlie the rest of the book. These techniques include, as a very basic and gentle introduction:

  • Counting Words: How often does a given word appear in a document? We can then move beyond that and see how often a word appears in dozens, or hundreds, or even thousands of documents, and establish change over time.
  • N-Grams, or Phrase Frequency: When counting words, we are technically counting unigrams: frequency of strings of one word. For phrases, we speak of n-grams: bigrams, trigrams, quadgrams, and fivegrams, although one could theoretically do any higher number. A bigram can be a combination of two characters, such as ‘bi,’ two syllables, or two words. An example: “canada is.” A trigram is three words, quadgram is four words, and a fivegram is five words, and so on.
  • Keyword-in-Context: This is important. Imagine that you are looking for a specific term that also has a broader name, such as the Globe and Mail newspaper, colloquially referred to as the Globe. If you wanted to see how often that newspaper appeared, a search for Globe would capture all of the appearances you were looking for, but also others: globe also refers to three-dimensional models of the earth, or perhaps Earth, or sphere-shaped objects, cities like Globe, Arizona, or a number of newspapers around the world. So in the following case:

he read the globe and mail it
picked up a globe newspaper in toronto
jonathan studied the globe in his parlour
favourite newspaper the globe and mail smelled
the plane to globe arizona was late

20 Leave a comment on paragraph 20 0 In the middle, we have the keyword we are looking for (globe) and on the left and right we have the context. Without requiring sophisticated programming skills, we can see in this limited sample of five that three probably refer to the Globe and Mail, one is ambiguous (one could study a globe of the Earth or the newspaper in one’s parlour), and one is clearly referring to the city of Globe, Arizona.

  • Line Chart: Many of the visualizations used in this book and elsewhere, such as the Google Books n-gram viewer, rely on a simple line graph.

22 Leave a comment on paragraph 22 0 None of this is meant to be intimidating, but rather a gentle introduction to some of the major issues that you may encounter as you move forward in this area of research. Copyright matters, open access matters, and basic visualization terms can help you make sense of what other digital humanists are doing.

