An experiment in writing in public, one page at a time, by S. Graham, I. Milligan, & S. Weingart

Putting Big Data to Good Use: An Overview

1 Leave a comment on paragraph 1 3 For 239 years, accused criminals (initially Londoners, subsequently all accused of major trials) stood for justice before magistrates at the Old Bailey. Thanks to a popular fascination with crime amongst Londoners, the events were recorded in the Proceedings: initially as pamphlets, later as bound books and journals replete with advertisement, and subsequently becoming official publications, these sources provide an unparalleled look into the administration of justice and the lives of ordinary people in seventeenth, eighteenth, and nineteenth century England (their publication ceased in 1913).1 While there are gaps, this would become an exhaustively large source for historians who sought to answer questions of social and legal history in the twentieth and twenty-first centuries.

2 Leave a comment on paragraph 2 0 The Old Bailey records are certainly, by the standards set out above, big data. They comprise 127 million words, cover 197,000 trials, and have been transcribed to a high standard by two typists working simultaneously to reduce error rates.2 As a research team put it: “it is one of the largest bodies of accurately transcribed historical text currently available online. It is also the most comprehensive thin slice view of eighteenth and nineteenth-century London available online.” Tackling a dataset of this size, however, requires specialized tools. Once digitized, it was made available to the public through keyword searches. Big data methodologies, however, offered new opportunities to make sense of this very old historical material.

3 Leave a comment on paragraph 3 2 The Data Mining with Criminal Intent project sought to do that. A multinational project, including scholars from the United States, Canada, and the United Kingdom, they sought to create “a seamlessly linked digital research environment” for working with these court records. Pulling their gaze back from individual trials and records, the team made new and relevant discoveries. Using a plugin for the open-source reference and research management software Zotero, Fred Gibbs at George Mason University developed a means to look at specific cases (i.e. those pertaining to “poison”) and look for commonalities. For example, “drank” was a common word; closer to that, the word “coffee;” conversely, the word “ate” was conspicuously absent. In another development, closely related trials could be brought up; through comparing differences in documents (using Normalized Compression Distance, or the standard tools that compress files on your computer) one can get the database to suggest trials that are structurally closely related to the one a user is currently viewing. And, most importantly, taking this material and visualizing it allowed the research team to discover new findings, such as a “significant rise in women taking on other spouses when their husbands had left them,” or the rise of plea bargaining around 1825, based on the figure below showing a significant shift in the length of trials in that time period.

4 Leave a comment on paragraph 4 0  

5 Leave a comment on paragraph 5 0 Clipboard Image

6 Leave a comment on paragraph 6 0  

7 Leave a comment on paragraph 7 0 All of these tools were put online, and researchers can now access the Old Bailey both through the traditional portal as before and through a dedicated Application Programming Interface or API. This allows them to use programs like Zotero, or their own programming language of choice, or visualization tools such as the freely accessible Voyant Tools, to access the database. The project thus leaves a legacy to future researchers, to enable them to “distantly read” the trials, to make sense of that exhaustive dataset of 127 million words.

8 Leave a comment on paragraph 8 0 The high quality of the Old Bailey dataset is an outlier, however: most researchers will not have ready access to nearly perfectly transcribed databases like the criminal records. There is still joy to be found, however, in the less-than-perfect records. The Trading Consequences project is one example of this. Bringing together cutting edge work in Optical Character Recognition (OCR) algorithms, leading experts in Natural Language Processing (NLP), and innovative historians in Canada and the United Kingdom, this project discovered as-until-then-unknown findings about the trans-Atlantic trade. Moving beyond anecdotes and case studies, the team drew on over six million British parliamentary paper pages, almost four million documents from Early Canadiana Online, and smaller series of letters, British documents (‘only’ 140,010 images, for example), and other correspondences.3

9 Leave a comment on paragraph 9 1 An automated process would take a document, turn the image into accessible text, and then ‘mark it up’: London, for example, would be marked as a ‘location,’ grain would be marked as a ‘commodity,’ and so forth. The interplay between trained programmers, linguists, and historians was especially fruitful for this: for example, the database kept picking up the location “Markham” (a suburb north of Toronto, Canada) and the historians were able to point out that the entries actually referred to a British official, culpable in smuggling, Clements Robert Markham. As historians develop technical skills, and computer scientists develop humanistic skills, fruitful collaborative undertakings can develop. Soon, historians of this period will be able to contextualize their studies with an interactive, visualized database of global trade in the 19th century.

10 Leave a comment on paragraph 10 1 Beyond court and trading records, census recordings have long been a staple of computational inquiry. In Canada, for example, there are two major and ambitious projects underway that use computers to read large arrays of census information. The Canadian Century Research Infrastructure project, funded by federal government infrastructure funding, draws on five censuses of the entire country in an attempt to provide a “new foundation for the study of social, economic, cultural, and political change.”4 Simultaneously, francophone researchers at the Université de Montréal are reconstructing the European population of Quebec in the sevneteenth and eighteenth centuries, drawing heavily on parish registers.5 This form of history harkens back to the first wave of computational research, discussed later in this chapter, but shows some of the potentials available to historians computationally querying large datasets.

11 Leave a comment on paragraph 11 3 Any look at textual digital history would be incomplete without a reference to the Culturomics Project and Google Ngrams.6 Originally co-released as an article and an online tool, a team collaborated to develop a process for analyzing the millions of books that Google has scanned and applied OCR to as part of its Google Books project. This project indexed word and phrase frequency across over five million books, enabling researchers to trace the rise and fall of cultural ideas and phenomena through targeted keyword and phrase searches and their frequency over time. The result is an uncomplicated but powerful look at a few hundred years of book history. One often unspoken tenant of digital history is that very simple methods can produce incredibly compelling results, and the Google Ngrams tool exemplifies this idea. In terms of sheer data, this is the most ambitious and certainly the most widely accessible (and publicized0 Big History project in existence. Ben Zimmer used this free online tool to show when the United States began being discussed as a singular entity rather than as a plurality of many states by charting when people stopped saying “The United States are” in favor of “The United States is”:7

12 Leave a comment on paragraph 12 0 Clipboard Image_1

13 Leave a comment on paragraph 13 0  

14 Leave a comment on paragraph 14 1 This is a powerful finding, both confirmatory of some research and suggestive of future paths that could be pursued. There are limitations, of course, with such a relatively simple methodology: words change meaning over time, there are OCR errors, and searching only on words or search phrases can occlude the surrounding context of a word. Some of the hubris around Culturomics can rankle some historians, but taken on its own merits, the Culturomics project and the n-gram viewer have done wonders for popularizing this model of Big Digital history and have become recurrent features in the popular press, academic presentations, and lectures.

15 Leave a comment on paragraph 15 0 Culturomics also presents historians with some professional cautionary notes. The authorship list of the paper and the tool was extensive: thirteen individuals and the Google Books team. There were mathematicians, computer scientists, scholars from English literature, and psychologists. However, there were no historians present on the list. This is suggestive of the degree to which historians had not then yet fully embraced digital methodologies, an important issue given the growing significance of digital repositories, archives, and tools. Given the significance of Culturomics and its historical claims, this omission did not go unnoticed. Writing in the American Historical Association’s professional newspaper, Perspectives, then-AHA President Anthony Grafton tackled this issue. Where were the historians, he asked in his column, when this was a historical project conducted by a team of doctoral holders from across numerous disciplines?8 To this, project leaders Erez Leiberman Aiden and Jean-Baptiste Michel responded in a comment, noting that while they had approached historians and used some in an advisory role, no historians met the “bar” for meriting inclusion in the author list: every one of them had “directly contributed to either the creation o the collection of written texts (the ‘corpus’), or to the design and execution of the specific analyses we performed.” As for why, they were clear:

16 Leave a comment on paragraph 16 0 The historians who came to the [project] meeting were intelligent, kind, and encouraging. But they didn’t seem to have a good sense of how to yield quantitative data to answer questions, didn’t have relevant computational skills, and didn’t seem to have the time to dedicate to a big multi-author collaboration. It’s not their fault: these things don’t appear to be taught or encouraged in history departments right now. ((Comment by Jean-Baptiste Michel and Erez Lieberman Aiden on Anthony Grafton, “Loneliness and Freedom.”))

17 Leave a comment on paragraph 17 0 This is a serious indictment. To some degree it is an overstatement, as these other projects are testament to. Yet there is a kernel of truth to this, in that they are not yet in the mainstream of the profession. This is part of the issue that this book aims to address.

18 Leave a comment on paragraph 18 0 Textual analysis is not the be all and end all of digital history work, as a project like ORBIS: The Stanford Geospatial Network Model of the Roman World vividly demonstrates. ORBIS, developed by a team of researchers at Stanford University, allows users to explore how the Roman Empire was stitched together by roads, animals, rivers, and the sea.9 Taking the Empire, mostly circa 200 CE, Walter Scheidel, Elijah Meeks and Jonathan Weiland’s creation allows visitors to understand the realm as a geographic totality: it is a comprehensive model. As they explained in a white paper: The model consists of 751 sites, most of them urban settlements but also including important promontories and mountain passes, and covers close to 10 million square kilometers (~4 million square miles) of terrestrial and maritime space. 268 sites serve as sea ports. The road network encompasses 84,631 kilometers (52,587 miles) of road or desert tracks, complemented by 28,272 kilometers (17,567 miles) of navigable rivers and canals.10 Wind and weather are calculated, and the variability and uncertainty of certain travel routes and options are well provided. 363,000 “discrete cost outcomes” are made available. A comprehensive, “top down” vision of the Roman Empire is provided, drawing upon conventional historical research. Through an interactive graphical user interface, a user – be they a historian, a lay person, or student – can decide their start, destination, month of travel, whether they want a journey to be the fastest, cheapest, or shortest, how they want to travel, and then specifics around how they want to trace on rivers or by road (from foot, to rapid march, to horseback, to ox cart, and beyond). For academic researchers, they are now able to begin substantially more finer-grained research into the economic history of antiquity.

19 Leave a comment on paragraph 19 0 They have moved beyond the scholarly monograph into an accessible and comprehensive study of antiquity. Enthusiastic reactions from media outlets as diverse as the Atlantic, the Economist, the technology blog Ars Technica, and ABC all demonstrate the potential for this sort of scholarship to reach new audiences.11 ORBIS takes big data, in this case an entire repository of information about how long it would take to travel from point A to point B (remember, 363,000 cost outcomes), and turns it into a modern day Google Maps for antiquity.

20 Leave a comment on paragraph 20 0 From textual analysis explorations in the Old Bailey, to the global network of 19th century commodities, to large collections of Victorian novels or even millions of books within the Google Books database, or the travel networks that made up the Roman Empire, thoughtful and careful employment of big data has much to benefit historians today. Each of these projects, representative samples of a much larger body of digital humanities work, demonstrates the potential that new computational methods can offer to scholars and the public. Now that we have seen some examples of the current state of the field, let us briefly travel back to 1946 and the first emergences of the digital humanities. Compared to the data centres of the Old Bailey online or the scholars of Stanford University, the field had unlikely beginnings in an Italian pontifical university.

  1. 21 Leave a comment on paragraph 21 0
  2. “The Proceedings – Publishing History of the Proceedings – Central Criminal Court,” April 2013, http://www.oldbaileyonline.org/static/Publishinghistory.jsp. []
  3. Dan Cohen et al., “Data Mining with Criminal Intent: Final White Paper,” August 31, 2011, http://criminalintent.org/wp-content/uploads/2011/09/Data-Mining-with-Criminal-Intent-Final1.pdf. []
  4. Trading Consequences Personal Communications, Will Update Citations When Publications Appear []
  5. For more information, see the Canadian Century Research Infrastructure Project website at http://www.ccri.uottawa.ca/CCRI/Home.html. []
  6. See the Programme de recherche en démographie historique at http://www.geneology.umontreal.ca/fr/leprdh.htm. []
  7. Michel et al. “Quantitative Analysis of Culture Using Millions of Digitized Books,” Science 331:6014, January 14, 2011. and http://books.google.com/ngrams []
  8. Ben Zimmer, “Bigger, Better Google Ngrams: Brace Yourself for the Power of Grammar,” TheAtlantic.Com, 18 October 2012, http://www.theatlantic.com/technology/archive/2012/10/bigger-better-google-ngrams-brace-yourself-for-the-power-of-grammar/263487/ []
  9. Anthony Grafton, “Loneliness and Freedom,” AHA Perspectives, March 2011, available online, http://www.historians.org/perspectives/issues/2011/1103/1103pre1.cfm. []
  10. Walter Scheidel, Elijah Meeks, and Jonathan Weiland, “ORBIS: The Stanford Geospatial Network Model of the Roman World,” 2012, http://orbis.stanford.edu/#. []
  11. Walter Scheidel, Elijah Meeks, and Jonathan Weiland, “ORBIS: The Stanford Geospatial Network Model of the Roman World,” May 2012, http://orbis.stanford.edu/ORBIS_v1paper_20120501.pdf. []
  12. “Travel Across the Roman Empire in Real Time with ORBIS,” Ars Technica, accessed June 25, 2013, http://arstechnica.com/business/2012/05/how-across-the-roman-empire-in-real-time-with-orbis/; “London to Rome, on Horseback,” The Economist, accessed June 25, 2013, http://www.economist.com/blogs/gulliver/2012/05/business-travel-romans; Rebecca J. Rosen, “Plan a Trip Through History With ORBIS, a Google Maps for Ancient Rome,” The Atlantic, May 23, 2012, http://www.theatlantic.com/technology/archive/2012/05/plan-a-trip-through-history-with-orbis-a-google-maps-for-ancient-rome/257554/. []
Page 75

Source: http://www.themacroscope.org/?page_id=246