¶ 1 Leave a comment on paragraph 1 0 We have entered an era of Big Data. As IBM noted in 2012, “90% of the data in the world today has been created in the last two years alone.”1 Yet while Big Data is often explicitly framed as a problem for the future, it has already presented fruitful opportunities for the past. The most obvious place where this is true is archived copies of the publicly accessible Internet. The advent of the World Wide Web in 1991 had potentially revolutionary effects on human communication and organization, and its archiving presents a tremendous body of non-commercialized public speech. There is a lot of it, however, and large methodologies will be needed to explore it.
¶ 2 Leave a comment on paragraph 2 5 Above, we have seen how the digital humanities have developed and flourished. Putting this into historical context, if the first wave of computational history emerged out of humanities computing, and the second wave developed around textual analysis (and H-Net, Usenet, and GIS), we believe that we are now on the cusp of a third revolution in computational history. There are three main factors that make this instrumental: decreasing storage costs, with particular implication for historians; the power of the Internet and cloud computing; and the rise of open-source tools.
¶ 3 Leave a comment on paragraph 3 1 Significant technological advances in how much information can be stored herald a new era of historical research and computing. In short, we can retain more of the information produced every day, and the ability to retain information has been keeping up with the growing amount of generated data. As author James Gleick argued:
¶ 4 Leave a comment on paragraph 4 2 The information produced and consumed by humankind used to vanish – that was the norm, the default. The sights, the sounds, the songs, the spoken word just melted away. Marks on stone, parchment, and paper were the special case. It did not occur to Sophocles’ audiences that it would be sad for his plays to be lost; they enjoyed the show. Now expectations have inverted. Everything may be recorded and preserved, at least potentially.2
¶ 5 Leave a comment on paragraph 5 2 This has been made possible by the corollary to Moore’s Law (which held that the number of transistors on a microchip would double every two years), Kryder’s Law.3 He argues, based on past practice, that storage density will double approximately every eleven months. While it is mistaken to use past trends to predict future performance, the fact remains that storage has been getting cheaper over the last ten years and has enabled the storage and hopeful long-term digital preservation of invaluable historical resources.
¶ 6 Leave a comment on paragraph 6 0 We store more than we ever did before, and increasingly have an eye on this digital material to make sure that future generations will be able to fruitfully explore it: the field of digital preservation. The creation or generation of data and information does not, obviously, in and of itself guarantee that it will be kept. We have already lived through aspects of a Digital Dark Age, and there are fears that our current era may turn out to be inaccessible to the future has prompted considerable research in the field.4 Perhaps the best encapsulation of the fact that digital creation is not an automatic panacea for future preservation was the efforts and media sensation surrounding the very first website, launched in 1991 at the European Organization for Nuclear Research (CERN). As CERN’s web manager explained in 2013:
¶ 7 Leave a comment on paragraph 7 0 The earliest known version of the very first website is out there, somewhere, on an outdated disk drive. “Maybe someone is using it as a paperweight,” says Dan Noyes, the web manager for the communication group at the European Organisation for Nuclear Research, known as Cern, in Switzerland. “We do know there’s a disk drive from 1990 that was sent to a conference in Santa Clara and went missing. Ideally we’d like to get that. We want the earliest iterations we can get.”5
¶ 8 Leave a comment on paragraph 8 0 Reading that disk drive would require specialized techniques. Many people still have cartons of “floppy” disks that were ubiquitous into the late 1990s, now unreadable without special effort. Early textual documents produced in WordPerfect, as recent as ten or fifteen years ago, are hard to open under some operating systems and soon may slip beyond the veil. Raising the issue, the Economist noted in September 2012 that their first website launched in March 1994: “Eighteen months later, it was reconfigured and brought in-house. All records of the original website were subsequently lost. So much for the idea that the internet never forgets. It does.”6
¶ 9 Leave a comment on paragraph 9 0 It is not hardware that is the problem. As Internet Archive founder Brewster Kahle pointed out in a recent documentary, that’s an easily surmountable problem: backups and data migration. Software is the more vexing issue.7 For websites, some of that has been mitigated by the adoption of international standards and bodies of literature around making sure data is continually accessible.8 The issue is discussed, researched, and raised in social media forums, academic journals, and popular blogs hosted by institutions such as the British Library and the American Library of Congress.
¶ 10 Leave a comment on paragraph 10 1 The bigger problem, of course, is people: influential individuals, with positions of power within corporations, who do not recognize the historical obligation placed upon them. Exhibit A in the pantheon of bad historical citizens is the global Internet corporation, Yahoo! The largest collection of websites ever created by Internet users was Geocities.com. Geocities had opened in 1994, became fully accessible by the next year, and allowed anybody with a rudimentary knowledge of web development to create their own webpage with a then-whopping personal allotment of fifteen megabytes.9 A large co-operative community ensued, with Community Leaders helping others get their websites started. Archivist Jason Scott reinforces the significance of this data set:
¶ 11 Leave a comment on paragraph 11 0 At a time when full-color printing for the average person was a dollar-per-printed-page proposition and a pager was the dominant (and expensive) way to be reached anywhere, mid 1990s web pages offered both a worldwide audience and a near-unlimited palette of possibility. It is not unreasonable to say that a person putting up a web page might have a farther reach and greater potential audience than anyone in the history of their genetic line.10
¶ 12 Leave a comment on paragraph 12 2 In 1999, Yahoo! purchased Geocities (at the time the third-most visited website on the World Wide Web) for three billion dollars. Over the next few years however, the digital revolution continued: web development became more accessible, and GeoCities became an object of scorn, known for garish backgrounds, outdated development, and ubiquitious “under construction” signs. But still, an astounding collection of user-generated content remained online. A case can be made that this was the single largest repository of social historical resources ever generated, on the public facing World Wide Web (unlike Facebook). It would be an unparalled big data resource for future historians.
¶ 13 Leave a comment on paragraph 13 0 In 2011, humanity created 1.8 zettabytes of information. This was not an outlier, but part of an ongoing trend: from 2006 until 2011, the amount of data expanded by a factor of nine.11 These take a variety of forms, some accessible and some inaccessible to historians. In the latter camp, we have walled gardens and proprietary networks such as Facebook, corporate databases, server logs, security data, and so forth. Save a leak or forward thinking individuals, historians may never be able to access that data. Yet in the former camp, even if it smaller than the latter one, we have a lot of information: YouTube (seeing 72 hours of video uploaded every single minute); the hundreds of millions of tweets sent every day over Twitter; the blogs, ruminations, comments, and thoughts that make up the publicly-facing and potentially archivable World Wide Web.12 Beyond the potentialities of the future, however, we are already in an era of archives that dwarf previously conceivable troves of material.
¶ 14 Leave a comment on paragraph 14 2 The Internet Archive was launched in 1996 and is presently the world’s biggest archive by an order of magnitude. Founded by Brewster Kahle in a climate of fears around digital preservation, the Archive aims to preserve the entire publicly-accessible World Wide Web. Websites were seen as particularly vulnerable to being lost forever due to their exceptionally short life span, and there were fears that historians would not be able to study the period between the rise of personal computing and the eventual archival solution.13 Kahle, an Internet entrepreneur, instrumental in the development of both the Wide Area Information Servers (WAIS) architecture and the Internet website and traffic ranking firm Alexa Internet, had an ambitious vision that was largely executed.14 Accumulating its information through web crawlers, which visit a page and download it, before generating a list of al the links on a page and following each one in turn – downloading them, generating links, ad nauseum, the Internet Archive has rapidly grown. After beginning in July 1996, the Archive had two terabytes by May 1997, fourteen terabytes by March 2000, and now sits at a whopping ten petabytes of information.15 There are other large collections, some centered around national institutions such as the British Library or the American Library of Congress, but the Internet Archive looms even larger. A symbol of big data, it exemplifies the sheer quantity of data now available to researchers.
¶ 15 Leave a comment on paragraph 15 2 All of this means that historical information is being preserved at an ever-increasing rate. A useful way to understand all of this, and put it into perspective, is to compare it with the largest existing collection of conventional analog sources in the world: the American Library of Congress (which, in some respects, is roughly tied in size of collection with its counterpart in England, the British Library). Simply walking along its 838 miles, or 1,349 kilometers,16 of shelving would take weeks – without even stopping to open up a single book. It has long loomed large in the minds of information theorists. Indeed, when the father of information theory, Claude Shannon, thought about “information” he placed the Library of Congress at the top of his logarithmic list: “the largest information stockpile he could think of.”17 That was 1949, however. Thanks to the Internet Archive, those exhaustive shelves no longer represent the pinnacle of information storage.
¶ 16 Leave a comment on paragraph 16 1 By comparison, the Internet Archive’s WaybackMachine – which has only been collecting information since 1996, it is worth underlining – now dwarves this collection. Comparing miles of shelves to the Internet Archive’s 11,000 hard drives is an “apples versus oranges” issue, but it can be done as a rough thought experiment. Trying to put a firm data figure on an analog collection is difficult: a widely distributed figure is that the Library of Congress’s print collection amounts to 10 terabytes. That is too low, and a more accurate figure is somewhere in the ballpark of 200 to 250 terabytes (if one digitized each book at 300 DPI, leading to a rough figure of eighty megabytes per book).18 As a petabyte is a thousand terabytes, if we take the latter figure we arrive at a 1:40 ratio. The WaybackMachine continues to grow, and grow, and grow, and historians are now confronted with historical sources on an entirely new order of magnitude.
¶ 17 Leave a comment on paragraph 17 0 The data that we now have as historians, and will have in the future, did not come out of nowhere, of course: instead, it is the product of conscious policies and institutions put in place in the 1990s and beyond. As noted, the Internet Archive began archiving the World Wide Web in 1996, which was subsequently made available through its WaybackMachine in 2001. The idea for the Internet Archive arose out of a specific context. By the mid-1990s, digital preservation was becoming an increasingly pressing problem: digital records were becoming far more numerous; files were beginning to combine different objects within the same file (i.e. image and text); and the storage mediums they were being stored on were changing from floppy disks to CD-ROMs. Software development threatened the continued accessibility of information, and these new storage mediums had different standards of longevity and storage. Technological challenges aside, copyright was also emerging as a very real issue in preserving the digital past.19
¶ 18 Leave a comment on paragraph 18 0 For all these challenges, however, forward thinking individuals had begun to realize what all this data could potentially represent for historians. As Michael Lesk explained:
¶ 19 Leave a comment on paragraph 19 1 We do not know today what Mozart sounded like on the keyboard … What will future generations know of our history? … But digital technology seemed to come to the rescue, allowing indefinite storage without loss. Now we find that digital information too, has its dark side.20
¶ 20 Leave a comment on paragraph 20 1 The dark side stemmed from the very same issues that made digital technology so dangerous also made it potentially profitable: the long-term retention of information, the preservation of image and sound. Data does not simply preserve itself. It would require work, continuous work, but it would become possible. The question was what to do: the life span of websites was measured in months, rather than years, and information was simply being erased.
¶ 21 Leave a comment on paragraph 21 3 It was in this context that Brewster Kahle created the Internet Archive, an ambitious project to preserve all of this information. While previous preservation projects had been focused on specific events and individuals, such as American Presidents and elections in the case of the Smithsonian Institution’s projects, this would be an indiscriminate preservation of everyday activity on the Internet. It has grown dramatically since its inception, and in 2012 the Internet Archive celebrated the attainment of the 10 Petabyte mark. The size of this archive, a profound human accomplishment, needs to be underlined: this tremendous data collection is a leading example of a new wave of big data, with significant potential to rework aspects of the humanities.
¶ 22 Leave a comment on paragraph 22 0 The shift towards widespread digital storage, preserving information longer and conceivably storing the records of everyday people on an ever more frequent basis, represents a challenge to accepted standards of inquiry, ethics, and the role of archivists. How should historians respond to the transitory nature of historical sources, be it the hastily deleted personal blogs held by MySpace, the destroyed websites of Geocities? How can we even use large repositories such as the over two million messages sent over USENET in the 1980s alone? Do we have ethical responsibilities to website creators who may have had an expectation of privacy, or in the last had no sense that they were formally publishing their webpage in 1996? These are all questions that we, as professionals, need to tackle. They are, in a word, disruptive.
¶ 23 Leave a comment on paragraph 23 0 It is important to pause briefly, however, and situate this claim of a revolutionary shift due to ever-bigger data sets into its own historic context. Humanists have long grappled with medium shifts and earlier iterations of this Big Data moment, which we can perhaps stretch back to the objections of Socrates to the written word itself21. As the printing press and bound books replaced earlier forms of scribed scholarly transmission, a similar medium shift threatened existing standards of communication. Martin Luther, the German priest, argued that “the multitude of books [were] a great evil;” this 16th century sentiment was echoed as well by Edgar Allen Poe in the 19th century and Lewis Mumford as recently as 1970.22 Bigger is certainly not better, at least not inherently, but it should equally not be dismissed out of hand.
¶ 24 Leave a comment on paragraph 24 0 If it is not new, however, it certainly is different today – at least in terms of scale. The Internet Archive dwarfs the Library of Congress, and it has only been in existence since 1996. By late spring 2013, Google Books now had thirty million books scanned in its repository.23 Media is mixed together, from video, images, audio files, to printed text.
¶ 25 Leave a comment on paragraph 25 3 It is also different in terms of scope. Archivists have different approaches to their craft than does the Internet Archive. The Library of Congress is predominantly print, whereas the Internet Archive has considerable multimedia holdings, which includes videos, images, and audio files. The LOC is, furthermore, a more curated collection, whereas the Internet Archive draws from a wider range of producers. The Internet offers the advantages and disadvantages of being a more democratic archive. For example, a Geocities site created by a 12-year old Canadian teenager in the mid-1990s might be preserved by the Internet Archive, whereas it almost certainly would not be saved by a national archive. This difference in size and collections management, however, is at the root of the changing historians’ toolkit. If we make use of these large data troves, we can access a new and broad range of historical subjects.
¶ 26 Leave a comment on paragraph 26 0 This big data, however, is only as useful as the tools that we have to interpret it. Luckily, two interrelated trends make interpretation possible: more powerful personal computers, and more significantly, accessible open-source software to make sense of all this data. Prices continue to fall, and computers continue to get more powerful: significantly for research involving large datasets, the amount of Random Access Memory, or RAM, that computers have continues to increase. Information loaded into RAM can be manipulated and analyzed very quickly. Even humanities researchers with limited research budgets can now use computers that would have been prohibitively expensive only a few years ago.
¶ 27 Leave a comment on paragraph 27 0 It is, however, the ethos and successes of the open-source movement that have given digital historians and the broader field of the digital humanities wind in their sails. Open-source software is a transformative concept that moves beyond simply “free” software: an open-source license means that the code that drives the software is freely accessible, and users are welcome to delve through it, make changes that they see fit, and distribute the original or their altered version as they see fit. Notable open-source projects include the Mozilla Firefox browser, the Linux operating system, the Zotero reference-management software system developed by George Mason University’s Centre for History and New Media (CHNM), the WordPress and Drupal website Content Management System (CMS) platforms, and the freely-accessible OpenOffice productivity suite.
¶ 28 Leave a comment on paragraph 28 0 For humanists interested in carrying out large-scale data analysis, then, no longer requires a generous salary or expense account. Increasingly, it does not even require potentially expensive training. Take, by way of introduction, the Programming Historian 2. An open-source textbook dedicated to introducing computational methods to humanities researchers, the book itself is written with the open-source WordPress platform. It is a useful introduction, as well, to the potential offered by these programs. They include several that we saw in our preface:
- ¶ 29 Leave a comment on paragraph 29 0
- • Python: An open-source programming language, freely downloadable, that allows you to do very powerful textual anaylsis and manipulation. It can help download files, turn text into easily-digestible pieces, and then provide some basic visualizations.
- • Komodo Edit: An open-source editing environment, allowing you to write your own code, edit it, and quickly pinpoint where errors might have crept it.
- • Wget: A program, run on the command line, that lets you download entire repositories of information. Instead of right-clicking on link after link, wget can quickly download an entire directory or website to your own computer.
- • Search Engine and Clustering: The Apache Software Foundation, a developer community that releases software under open-source licenses, provides a number of incredible tools: an Enterprise-level search engine that you can index your own sources in, giving you sophisticated search on your home computer; software that clusters it into relevant chunks; as well as other tools to improve computation, automatically recognize documents, and facilitate the automated machine learning of sources.
- • MALLET: The MAchine Learning for LanguagE Toolkit provides an entire package of open-source tools, notably topic modeling which takes large quantities of information and finds the ‘topics’ that appear in them.
¶ 30 Leave a comment on paragraph 30 0 These four tools are just the tip of the iceberg, and represent a significant change. Free tools, with open-source documentation written for and by humanists, allow us to unlock the potential inherent in big data.
¶ 31 Leave a comment on paragraph 31 0 Big data represents a key component of this third wave of computational history. By this point, you should have an understanding of what we mean by big data, and some of the technical opportunities we have to open it. The question remains, however: what can this do for humanities researchers? What challenges and opportunities does it present, beyond the short examples provided at the beginning of the chapter? In the next section, we explore the emergencies of this new era of big data.References
- ¶ 32 Leave a comment on paragraph 32 0
- See, for example, http://www-01.ibm.com/software/data/bigdata/. [↩]
- James Gleick, The Information: A History, A Theory, A Flood (Vintage, 2012). [↩]
- Chip Walter, “Kryder’s Law,” Scientific American, July 25, 2005, http://www.scientificamerican.com/article.cfm?id=kryders-law. [↩]
- “We Need to Act to Prevent a Digital ‘Dark Age’ | Innovation Insights | Wired.com,” Innovation Insights, accessed July 5, 2013, http://www.wired.com/insights/2013/05/we-need-to-act-to-prevent-a-digital-dark-age/; “The Internet Archaeologists Digging up the Digital Dark Age,” Irish Times, accessed July 5, 2013, http://www.irishtimes.com/news/technology/the-internet-archaeologists-digging-up-the-digital-dark-age-1.1381962. [↩]
- “The Internet Archaeologists Digging up the Digital Dark Age.” [↩]
- “Difference Engine: Lost in Cyberspace,” The Economist, September 1, 2012, http://www.economist.com/node/21560992. [↩]
- Jonathan Minard, Internet Archive, Online Video, 2013, http://vimeo.com/59207751. [↩]
- Library of Congress, “WARC, Web ARChive File Format,” DigitalPreservation.Gov, accessed May 29, 2013, http://www.digitalpreservation.gov/formats/fdd/fdd000236.shtml; Stephan Strodl, Peter Paul Beran, and Andreas Rauber, “Migrating Content in WARC Files,” accessed May 29, 2013, http://www.ltu.se/cms_fs/1.83925!/file/Migrating_Content_in_WARC_Files.pdf; “WARC, Web ARChive File Format,” accessed April 22, 2013, http://www.digitalpreservation.gov/formats/fdd/fdd000236.shtml; “IIPC Framework Working Group: The WARC File Format (Version 0.16),” accessed April 22, 2013, http://archive-access.sourceforge.net/warc/warc_file_format-0.16.html. [↩]
- The Archive Team Geocities Snapshot (Part 1 of 8) : Free Download & Streaming : Internet Archive, accessed July 5, 2013, http://archive.org/details/2009-archiveteam-geocities-part1. [↩]
- “Unpublished Article on Geocities,” ASCII by Jason Scott, accessed July 5, 2013, http://ascii.textfiles.com/archives/2402. [↩]
- John Gantz and David Reinsel, “Extracting Value from Chaos” (IDC iView, June 2011), http://www.emc.com/collateral/analyst-reports/idc-extracting-value-from-chaos-ar.pdf. [↩]
- Twitter, “Total Tweets Per Minute | Twitter Developers,” Twitter.com, November 2012, https://dev.twitter.com/discussions/3914; YouTube, “Statistics – YouTube,” YouTube, May 29, 2013, http://www.youtube.com/yt/press/statistics.html. [↩]
- Steve Meloan, “No Way to Run a Culture,” Wired, February 13, 1998, http://web.archive.org/web/20000619001705/http://www.wired.com/news/culture/0,1284,10301,00.html. [↩]
- WAIS was an early solution for navigating the Internet, helping users find relevant documents and information quickly. Indeed, foreshadowing what was to come, WAIS was initially seen as an early way to access public domain electronic books. As one reporter explained in 1994, “[i]t reminds me of Fahrenheit 451, only instead of a utopia where each person commits a book to memory for life, here books are committed to bytes forever.” See Delilah Jones, “Browsing on the Internet,” St. Petersburg Times (Florida), 11 December 1994, 7D. [accessed via LexisNexis] [↩]
- Internet Archive, “The Internet Archive: Building an ‘Internet Library’,” 20 May 2000, Internet Archive, 2000, http://web.archive.org/web/20000520003204/http://www.archive.org/; Internet Archive, “80 Terabytes of Archived Web Crawl Data Available for Research.” [↩]
- Library of Congress, “Fascinating Facts – About the Library (Library of Congress),” 2013, http://www.loc.gov/about/facts.html. [↩]
- Gleick, The Information. [↩]
- Nicholas Taylor, “Transferring ‘Libraries of Congress’ of Data,” The Signal: Digital Preservation Blog, July 11, 2011, http://blogs.loc.gov/digitalpreservation/2011/07/transferring-libraries-of-congress-of-data/; Leslie Johnston, “How Many Libraries of Congress Does It Take?,” The Signal: Digital Preservation Blog, March 23, 2012, http://blogs.loc.gov/digitalpreservation/2012/03/how-many-libraries-of-congress-does-it-take/; Leslie Johnston, “A ‘Library of Congress’ Worth of Data: It’s All In How You Define It,” The Signal: Digital Preservation Blog, April 25, 2012, http://blogs.loc.gov/digitalpreservation/2012/04/a-library-of-congress-worth-of-data-its-all-in-how-you-define-it/. [↩]
- Michael Lesk, “Preserving Digital Objects: Recurrent Needs and Challenges,” Lesk.com, 1995, http://www.lesk.com/mlesk/auspres/aus.html. [↩]
- Ibid. [↩]
- See the Phaedrus dialogue by Plato, available in English translation at http://classics.mit.edu/Plato/phaedrus.html [↩]
- Gleick, The Information. [↩]
- Robert Darnton, “The National Digital Public Library Is Launched!,” The New York Review of Books, April 25, 2013, http://www.nybooks.com/articles/archives/2013/apr/25/national-digital-public-library-launched/. [↩]