¶ 2 Leave a comment on paragraph 2 0 There are three issues of critical importance to understanding Big Data as a historian: the open access and open source movements, copyright, and what we mean by textual analysis. As discussed in the previous chapter, open source software lies at the heart of the third computational wave of history. Issues of copyright loom large for historians working on 20th century topics: when accumulating large troves of data, researchers will almost certainly bump against these questions. Finally, textual analysis and basic visualizations provide the heart of the lessons provided in this book. Yet, as we argue, odds are that you are already using textual analysis: you may just not be completely cognizant of it yet. While these are not completely comprehensive or exhaustive terms – as we noted in our introduction, GIS is another area of research – we believe that these three areas help move you towards a broader understanding of the field of big data.
¶ 3 Leave a comment on paragraph 3 0 Many, although not all, of the tools discussed in this book are free software, or of the closely related yet distinct open source philosophy. While free software has been around since the 1980s, epitomized by movements such as GNU (a free operating system with a name stemming from “GNU’s Not Unix”) and its 1985 manifesto, 1998 saw it rise to the level of a movement. Christopher Kelty’s Two Bits carries out an indispensable anthropological and historical study of the movement, tracing its origins to the 1998 decision of Netscape to freely distribute the source code of its paradigm-breaking Netscape Communicator software (which would lead to today’s Mozilla Firefox browser). Netscape had been charging for its full-suite product, as opposed to Microsoft’s bundled browser, and the decision to release the source code was a path-breaking one. Defining open source can be tricky, as there are competing interests, but essentially to speak of “open source software” means adhering to the definition laid out by the “Open Source Definition.” In short, open source software requires free redistribution, the source code, allowing derived works (modifications to the program), keeping the integrity of the authors’ source code, no discrimination, and general licenses with a few other restrictions.
¶ 4 Leave a comment on paragraph 4 0 Open source software has tremendous, mostly positive, implications for digital historians, but there are some cautionary notes to make as well. The programs discussed in this book can largely be used for free, with a few exceptions for full features in a handful of programs, which are made explicit. This has obvious benefits in an era of diminishing government budgets and austerity. More importantly, the open nature of them means that they are often modular and can be continually improved: one scholar may note an issue with a certain algorithm, and suggest or implement a suggestion himself or herself. This lends itself well to experimentation. For publicly-supported scholars, whether by virtue of being in a public university or from a granting council, it also allows scholars to give something back to a community of practice or even the general public.
¶ 5 Leave a comment on paragraph 5 0 This ethos also informs other aspects of digital history. Digital history largely, but not completely, grew out of public history and is inspired by this ethos. Websites and portals are created that allow the public to interact with and engage with the past, and other scholars are beginning to call for the open source ethos to be applied to more scholarly practices: witness the rise of open access publishing. While open access may take an author-pays model (where the author pays the associated charges for his or her article) or be provided on a free or supported basis, it tries to take the ethos of free access to the public. This book, for example, was inspired by this ethos in our decision to live write it using the CommentPress platform – itself an open access WordPress theme and plugin.
¶ 6 Leave a comment on paragraph 6 0 As with many movements, there is a darker side to some of this. Author pay models can discriminate against junior scholars, those at non-research-intensive universities, and graduate students who may not have funds to support their research. Similarly, releasing free software is advantageous in many ways and perhaps even morally right for tenure- and tenure-track faculty members who have generous salaries; it may be more of an imposition to ask for untenured and contingent researchers who have monetary need. Furthermore, the demands on researcher time are substantial. Scholars may not have the necessary “start-up” funds to get into this world, and to this we add that they may not have the requisite “start-up” time either. The learning curve can be steep, and above all, digital tools and rhetoric requires an investment in time. These are all questions that digital historians, alongside the more specific research considerations discussed in this book, will have to consider.
¶ 7 Leave a comment on paragraph 7 0 An essential addition to a digital historian’s toolkit is an understanding of the issues surrounding copyright. This book does not purport to be a comprehensive guide to copyright law in your jurisdiction: it varies dramatically from country to country, and we are not lawyers. As historians, however, we want to provide context to the copyright wars and provide a call for our readers to pay considerable attention to this pressing issue. We do so by using the more focused example of Google Books and the ensuing legal imbroglio.
¶ 8 Leave a comment on paragraph 8 0 Google emerged from the search engine battles of the early 2000s as the victor, controlling a dominant share of the user base for those who wanted to find information on the World Wide Web. Ironically, Tim Wu has suggested that copyright infringement may have been key to their success: its search index, first created in 1996, required a comprehensive copy of the World Wide Web. While it was an unsettled legal question at the time, nobody filed suit so under legal norms this sort of mass copying for indexing is probably okay. In any case, Google would not be out of the copyright weeds as they moved forward with their search empire.
¶ 9 Leave a comment on paragraph 9 0 Following their dominance of the World Wide Web search engine market, Google then decided to turn their attention to another large repository of knowledge: books. As early as late 2003, Google was approaching university librarians, beginning to scan the first chapters of books, and began to quietly thread search results into the queries of users under the moniker Google Print. A year later, in December 2004, Google announced that they would enter into a partnership to digitize books held by Harvard University, Oxford University, the University of Michigan, Stanford University, and the New York Public Library. Google Print, as it was called, would achieve several objectives: they would democratize knowledge locked away in often-inaccessible elite institutions, they would facilitate keyword searching throughout books as one could throughout the World Wide Web, and Google would also gain users who could then be exposed to their advertisements. Institutions joined by varying degrees: Michigan wanted their entire 7.4 million volume collection digitized, the NYPL digitized only out-of-copyright materials, and Harvard went forward with a 40,000 book trial. Books under copyright would only be available via a “snippet search.”
¶ 10 Leave a comment on paragraph 10 0 This was a controversial project. France’s National Library raised concerns that this would give “prominence to Anglophone texts above those written in other languages,” due to the Anglophilic nature of the five institutions. More importantly, many authors and publishers became agitated. Questions were asked by worried organizations, such as the Association of American University Presses: how long and frequent were the snippets of copyrighted books, and would this have an effect on sales? As industry groups raised their voices in protest, including the large and powerful American Association of Publishers, Google suspended its project in August 2005, announcing a pause until at least November. The debate, playing out across the World Wide Web and in leading news organizations, was a fascinating one: the rights of authors and publishers, who produced knowledge; but also the desires of readers, who wanted a quick and easy way to find relevant information. Google’s idea was to slow things down so publishers could opt out of the system, whereas the publishers wanted an opt-in system.
¶ 11 Leave a comment on paragraph 11 0 The lawsuit came quickly. On 20 September 2005, three representative authors filed suit in Manhattan’s United States District Court on behalf of their peers and alleged “massive copyright infringement.” Google claimed “fair use,” the authors claimed that their copyright-protected work was being used for commercial purposes by the search engine giant. A month later, McGraw-Hill, Pearson Education, Penguin Group, Simon & Schuster and John Wiley & Sons filed a joint lawsuit. Google then resumed its digitization process. In October 2008, a sweeping settlement was reached that would facilitate the digitization of out-of-print books, purchase options, and crucially, allow up to 20% of a book to be displayed as a free snippet and to let universities subscribe to a service to view them all. Authors’ groups not part of the lawsuit, as well as individual authors, raised significant objections to this settlement. The settlement was subject to court review, and in February 2011 Judge Chin threw it out.
¶ 12 Leave a comment on paragraph 12 0 If Google was a commercial enterprise, universities saw value in digitizing books: they could be preserved longer, the search index was useful, and scholars could derive significant Big Data benefits form having all this knowledge together into one place. With these intentions in mind, a group of universities (twelve in the American Midwest, plus the University of California system) announced in late 2008 that they would take volumes previously digitized by Google as well as some of their own and pool them into a collective trust: HathiTrust. It was an apt name, as the New York Times put it: “Hathi is Hindi for ‘elephant,’ an animal that is said to never forget.” They had lofty goals: “The partners aim to build a comprehensive archive of published literature from around the world and develop shared strategies for managing and developing their digital and print holdings in a collaborative way.”A similar lawsuit was forthcoming a few years later when the Authors Guild and other groups and individual authors launched a case over “unauthorized” scans. Instead of damages, however, they simply asked “that the books be taken off the HathiTrust servers and held by a trustee.”
¶ 13 Leave a comment on paragraph 13 0 With the basic contours of the Google Print/Books and HathiTrust cases established, we should pause to discuss what this means for historians engaging in Big Data-related research. While media coverage understandably focused on the competing rights of authors and Google, these cases also lie at the heart of the ability of historians and others being able to conduct truly transformative, paradigm-shifting Big Data research. This argument was developed in a amicus curiae (“friend of the court”) filing made by digital humanities and legal scholars, written by the aforementioned author of Macroanalysis Matthew Jockers and two law professors, Matthew Sag and Jason Schultz. As they noted, the “court’s ruling in this case on the legality of mass digitization could dramatically affect the future of work in the Digital Humanities.”
¶ 14 Leave a comment on paragraph 14 0 Their highly-readable argument is essential reading for digital historians wondering about the future state of large-scale textual analysis as discussed in this book. After recounting several successful projects, such as the previously discussed Culturomics and findings of Macroanalysis, the brief notes that these methods have both “inspired many scholars to reconceptualise the very nature of humanities research,” or more simply the “role of providing new tools for testing old theories, or suggesting new areas of inquiry.” The authors then make a very powerful statement (emphasis is theirs):
¶ 15 Leave a comment on paragraph 15 0 None of this, however, can be done in the Twentieth-Century context if scholars cannot make nonexpressive uses of underlying copyrighted texts, which (as shown above) will frequently number in the thousands, if not millions. Given copyright law’s objective of promoting “the Progress of Science,” it would be perversely counterintuitive if the promise of Digital Humanities were extinguished in the name of copyright protection.
¶ 16 Leave a comment on paragraph 16 0 Without digital historians being active participants in copyright discussions, much of the promise of digital history will go unrealized for decades, if not centuries. This has had the effect of skewing digital historical research towards the 19th century: one recent author has characterized repositories like Google Books “as a place of scholarly afterlives, where forgotten authors and discarded projects are enjoying a certain reincarnation.” Digital investigations allow researchers to move beyond established canons of texts, discovering forgotten works through textual analysis. Unfortunately, while the 20th century is dominated by vast arrays of typewritten and born-digital sources, copyright rules have minimized the potential that these sources offer at present. In any case, copyright belongs in the pantheon of essential terms for digital historians.
¶ 18 Leave a comment on paragraph 18 0  The differences largely have to do with philosophy; free software has a much more significant political point, as opposed to the license-focused open source approach. For more on what these differences may imply in your own research, see Richard Stallman, “Why Open Source misses the point of Free Software,” https://www.gnu.org/philosophy/open-source-misses-the-point.html.
¶ 25 Leave a comment on paragraph 25 0  David Vise, “Google to Digitize Some Library Collections; Harvard, Stanford, New York Public Library Among Project Participants,” Washington Post, 14 December 2004, E05. Accessed via Lexis|Nexis. See also http://www.bio-diglib.com/content/2/1/2.
¶ 34 Leave a comment on paragraph 34 0  For more information on HathiTrust itself, see http://www.hathitrust.org/partnership. The New York Times article is found at http://bits.blogs.nytimes.com/2008/10/13/an-elephant-backs-up-googles-library/?_r=0. See also the overview at Heather Christenson, “HathiTrust,” Library Resources & Technical Services, Vol. 55, No. 2 (2011): 93-102.
¶ 36 Leave a comment on paragraph 36 0  See Matthew L. Jockers, Matthew Sag, Jason Schultz, “Brief of Digital Humanities and Law Scholars as Amici Curiae in Authors Guild v. Google,” Social Science Research Network, 3 August 2012, http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2102542, accessed 26 July 2013.
¶ 38 Leave a comment on paragraph 38 0  Paula Findlen, “How Google Rediscovered the 19th Century,” Chronicle of Higher Education: The Conversation Blog, 22 July 2013, http://chronicle.com/blogs/conversation/2013/07/22/how-google-rediscovered-the-19th-century/, accessed 13 August 2013.