¶ 2 Leave a comment on paragraph 2 0 For 239 years, accused criminals (initially Londoners, subsequently all Britons accused of major trials) stood for justice before magistrates at the Old Bailey. Thanks to a popular fascination with crime amongst Londoners, the events were recorded in the Proceedings: initially as pamphlets, later as bound books and journals replete with advertisement, and subsequently becoming official publications. These sources provide an unparalleled look into the administration of justice and the lives of ordinary people in seventeenth, eighteenth, and nineteenth century England (their publication ceased in 1913). While there are gaps, this is an exhaustively large source for historians using customary methods who sought to answer questions of social and legal history in the twentieth and twenty-first centuries.
¶ 3 Leave a comment on paragraph 3 0 The Old Bailey records are, by the standards of historians and most humanists, big data. They comprise 127 million words, cover 197,000 trials, and have been transcribed to a high standard by two typists working simultaneously to reduce error rates. Computers can read this amount of information quickly, but it would take years for a single scholar to read this (and then they probably would have forgotten half of what they had read). As the research team put it: “it is one of the largest bodies of accurately transcribed historical text currently available online. It is also the most comprehensive thin slice view of eighteenth and nineteenth-century London available online.” Tackling a dataset of this size, however, requires specialized tools. Once digitized, it was made available to the public through keyword searches. Big data methodologies, however, offered new opportunities to make sense of this very old historical material.
¶ 4 Leave a comment on paragraph 4 0 The Data Mining with Criminal Intent project sought to do that. A multinational project team, including scholars from the United States, Canada, and the United Kingdom, sought to create “a seamlessly linked digital research environment” for working with these court records. Pulling their gaze back from individual trials and records, the team made new and relevant discoveries. Using a plugin (a little program or component that adds something to a software program) for the open-source reference and research management software Zotero (http://www.zotero.org/), Fred Gibbs at George Mason University developed a means to look at specific cases (e.g.. those pertaining to “poison”) and look for commonalities. For example, “drank” was a common word; closer to that, the word “coffee;” conversely, the word “ate” was conspicuously absent. In another development, closely related trials could be brought up; through comparing differences in documents (using Normalized Compression Distance, or the standard tools that compress files on your computer) one can get the database to suggest trials that are structurally similar to the one a user is currently viewing. And, most importantly, taking this material and visualizing it allowed the research team to discover new findings, such as a “significant rise in women taking on other spouses when their husbands had left them,” or the rise of plea bargaining around 1825, based on the figure below (figure 1.1) showing a dramatic shift in the length of trials in that time period.
[Insert Figure 1.1: The length of trials, in words, in Old Bailey Proceedings. Copyright Tim Hitchcock and William J. Turkel, 2011. Used with permission.]
¶ 7 Leave a comment on paragraph 7 0 All of these tools were put online, and researchers can now access the Old Bailey both through the traditional portal as before and through a dedicated Application Programming Interface or API. APIs are a set of rules or specifications that let different software programs talk to each other. In the case of the Old Bailey, the API allows researchers to use programs like Zotero, or their programming language of choice, or visualization tools such as the freely accessible Voyant Tools, to access the database. The project thus leaves a legacy to future researchers, to enable them to point their macroscope toward the trials, to make sense of that exhaustive dataset of 127 million words.
¶ 8 Leave a comment on paragraph 8 0 Findings are sure to emerge over the next few years, but one eye-opening study – “The Civilizing Process in London’s Old Bailey” by Sara Klingenstein, Tim Hitchcock, and Simon DeDeo – has already made a provocative argument that required computational intervention. Drawing on the massive Old Bailey dataset, the researchers used an innovative method of creating “bags of words” out of related words from Roget’s Thesaurus and were able to convincingly demonstrate a major shift that occurred in the early 19th century: violent crimes began to be both discussed and treated differently. Before that time, while the crime itself was mentioned, the violence within it was not a discrete category. The work helps confirm the hypothesis of others, which sees a “civilizing process” as a:
¶ 9 Leave a comment on paragraph 9 0 Deep-rooted and multivalent phenomenon that accompanied the growing monopoly of violence by the state, and the decreasing acceptability of interpersonal violence as part of normal social relations. Our work [in this article] is able to track an essential correlate of this long-term process, most visibly in the separation of assault and violent theft from nonviolent crimes involving theft and deception.
¶ 10 Leave a comment on paragraph 10 0 As the New York Times explained in their coverage of the study, the routine nature of violence in the 1700s gave way to by the 1820s an emphasis on “containing violence a development reflected not just in language but also in the professionalization of the justice system.” Distant reading enabled historians to pull their gaze back from the individual trials, and to consider the 197,000 trials as a whole. The “civilizing process” study will probably, in our views, emerge as one of the defining stories of how Big Data can reveal new things that were not possible without it.
¶ 11 Leave a comment on paragraph 11 0 The high quality of the Old Bailey dataset is an outlier, however: most researchers will not have ready access to nearly perfectly transcribed databases like these criminal records. Even digitized newspapers, which on the face of it would seem to be excellent resources, are not without serious issues at the level of the OCR. There is still joy to be found, however, in the less-than-perfect records. The Trading Consequences project is one example of this. Bringing together leading experts in Natural Language Processing (NLP) and innovative historians in Canada and the United Kingdom, this project confirmed and enhanced existing understandings of trade by tracing the relationships between commodities and global locations. They adapted the Edinburgh Geoparser (which is publicly forthcoming) to find and locate place names in text files, and then identified commodities that were mentioned in relationship with these place names. One useful example the project gave us was that of chinchona bark, a raw material used to make quinine (an anti-malaria drug) – it was a useful example because it allowed them to confirm the secondary literature on the research question. They confirmed via computational analysis that until the 1850s it was closely related to Peruvian forests as well as markets and factories in London, England; by the 1860s, it had shifted towards locations in India, Sri Lanka, Indonesia, and Jamaica. By drawing on thousands of documents, changes can be quickly spotted. Even more promisingly, it can be put online in database fashion, enabling historians to test out their own hypotheses and findings. There are several ways to access their database, including searches on specific commodities and locations, and it can all be found at http://tradingconsequences.blogs.edina.ac.uk/access-the-data/. For historians interested in this form of work, their project white paper is of considerable utility.
¶ 12 Leave a comment on paragraph 12 0 Moving beyond anecdotes and case studies, the team drew on over six million British parliamentary paper pages, almost four million documents from Early Canadiana Online, and smaller series of letters, British documents (‘only’ 140,010 images, for example), and other correspondences. This sort of work presages the growing importance of linked open data, which points towards the increasingly trend in putting information online in a format that other computer programs can read. In this way, Early Canadiana Online can speak to files held in another repository, allowing us to harness the combined potential of many numerous silos of knowledge.
¶ 13 Leave a comment on paragraph 13 0 An automated process would take a document, turn the image into accessible text, and then ‘mark it up’: London, for example, would be marked as a ‘location,’ grain would be marked as a ‘commodity,’ and so forth. The interplay between trained programmers, linguists, and historians was especially fruitful for this: for example, the database kept picking up the location “Markham” (a suburb north of Toronto, Canada) and the historians were able to point out that the entries actually referred to a British official, culpable in smuggling, Clements Robert Markham. As historians develop technical skills, and computer scientists develop humanistic skills, fruitful collaborative undertakings can develop. Soon, historians of this period will be able to contextualize their studies with an interactive, visualized database of global trade in the 19th century. Yet, while collaborative outcomes are laudable when dealing with big questions, we still believe that there is a role for the sole or small group undertaking. We personally find that exploratory research can be fruitfully carried out by a sole researcher, or in our cases, three of us working on a problem together. There’s no one size of a team that will fit all questions.
¶ 14 Leave a comment on paragraph 14 0 Beyond court and trading records, census recordings have long been a staple of computational inquiry. In Canada, for example, there are two major and ambitious projects underway that use computers to read large arrays of census information. The Canadian Century Research Infrastructure project, funded by federal government infrastructure funding, draws on five censuses of the entire country in an attempt to provide a “new foundation for the study of social, economic, cultural, and political change.” Simultaneously, researchers at the Université de Montréal are reconstructing the European population of Quebec in the seventeenth and eighteenth centuries, drawing heavily on parish registers. This form of history harkens back to the first wave of computational research, discussed later in this chapter, but shows some of the potentials available to historians computationally querying large datasets.
¶ 15 Leave a comment on paragraph 15 0 Any look at textual digital history would be incomplete without a reference to the Culturomics Project and Google Ngrams. Originally co-released as an article and an online tool, a team from Harvard University (the composition of which is discussed shortly) collaborated to develop a process for analyzing the millions of books that Google has scanned and applied OCR to as part of its Google Books project. This project indexed word and phrase frequency across over five million books, enabling researchers to trace the rise and fall of cultural ideas and phenomena through targeted keyword and phrase searches and their frequency over time. The result is an uncomplicated but powerful look at a few hundred years of book history. One often unspoken tenet of digital history is that very simple methods can produce incredibly compelling results, and the Google Ngrams tool exemplifies this idea. In terms of sheer data, this is the most ambitious and certainly the most widely accessible (and publicized) Big History project in existence. Ben Zimmer used this free online tool to show when the United States began being discussed as a singular entity rather than as a plurality of many states by charting when people stopped saying “The United States are” in favor of “The United States is” (figure 1.2):
[Insert Figure 1.2: “The United States are” versus “The United States is”]
¶ 18 Leave a comment on paragraph 18 0 This is a powerful finding, both confirmatory of some research and suggestive of future paths that could be pursued. There are limitations, of course, with such a relatively simple methodology: words change in meaning and typographical characteristics over time (compare the frequency of beft with the frequency of best to see the impact of the “medial s”), there are OCR errors, and searching only on words or search phrases can occlude the surrounding context of a word. Some of the hubris around Culturomics may rankle some historians, but taken on its own merits, the Culturomics project and the n-gram viewer have done wonders for popularizing this model of Big Digital history and have become recurrent features in the popular press, academic presentations, and lectures.
¶ 19 Leave a comment on paragraph 19 0 Culturomics also presents historians with some professional cautionary notes, as Ian Milligan brought up in a Journal of the Canadian Historical Association article. The authorship list of the paper and the tool was extensive: thirteen individuals and the Google Books team. There were mathematicians, computer scientists, scholars from English literature, and psychologists. However, there were no historians present on the list. This is suggestive of the degree to which historians had not then yet fully embraced digital methodologies, an important issue given the growing significance of digital repositories, archives, and tools. Given the significance of Culturomics and its historical claims, this omission did not go unnoticed. Writing in the American Historical Association’s professional newspaper, Perspectives, then-AHA President Anthony Grafton tackled this issue. Where were the historians, he asked in his column, when this was a historical project conducted by a team of doctoral holders from across numerous disciplines? To this, project leaders Erez Leiberman Aiden and Jean-Baptiste Michel responded in a comment, noting that while they had approached historians and used some in an advisory role, no historians met the “bar” for meriting inclusion in the author list: every one of the project participants had “directly contributed to either the creation or the collection of written texts (the ‘corpus’), or to the design and execution of the specific analyses we performed.” As for why, they were clear:
¶ 20 Leave a comment on paragraph 20 0 The historians who came to the [project] meeting were intelligent, kind, and encouraging. But they didn’t seem to have a good sense of how to yield quantitative data to answer questions, didn’t have relevant computational skills, and didn’t seem to have the time to dedicate to a big multi-author collaboration. It’s not their fault: these things don’t appear to be taught or encouraged in history departments right now. (emphasis added).
¶ 21 Leave a comment on paragraph 21 0 To some degree it is an overstatement, as the previous examples in this chapter illustrate. Historians are doing some very amazing things with data. Yet there is a kernel of truth to this (still, as of 2014), in that the teaching of digital history perspectives and method are not yet in the mainstream of the profession, though this is certainly changing: witness American Historical Association panels on both teaching and doing digital history, as well as the emergence of exciting and well-received books such as History in the Digital Age. This is part of the issue that this book aims to address.
¶ 22 Leave a comment on paragraph 22 0 While the n-gram database is a good way to begin to understand and think about textual analysis, other important resources exist for humanities scholars: the refined version of the n-gram database Bookworm as well as the JSTOR text mining database Data for Research. To use their database, visit their website at http://bookworm.culturomics.org/. In Bookworm, we see that size is not everything. Its “open library” search functionality ‘only’ searches a little over a million books (the openlibrary.org collection), but it draws those books from the pre-1923 corpus; this means that it has functionality not available to a tool that incorporates copyrighted works. Consider the following search depicted in Figure 1.3:
[Insert Figure 1.3: Bookworm search of ‘taxes’]
¶ 25 Leave a comment on paragraph 25 0 The frequency of taxes is increasing, with initial valleys and peaks from 1750 through 1830, and then a steady ascendency up to and continuing into the end of the database in 1923. In the former n-gram database, this would be all we could see. But due to the granularity of the database, we can see where discussion of taxes has been increasing. If we focus our search so it simply considers books published in the United States, we see this (figure 1.4):
[Insert Figure 1.4: Bookworm search of ‘taxes’, for books published in the United States]
¶ 28 Leave a comment on paragraph 28 0 The peak here occurs much earlier – the American Revolution – and only begins to trend upwards into much later in the 19th century. We are seeing more refined results, and the overall contours become clearer when we look at taxes in the United Kingdom (figure 1.5):
[insert Figure 1.5: Bookworm search of ‘taxes’, for books published in the United Kingdom]
¶ 32 Leave a comment on paragraph 32 0 While this occludes context at a glance, because included books are not under copyright we can move from the distant reading level of relative word frequency to individual books by clicking on a given year. We discover that the peak year is 1776, the American Revolution, and that it was the discussion of taxes as a central role. By clicking on each year, readers can be taken to the texts themselves: in this case, discussions of the Wealth of Nations, political pamphlets published in England concerning taxation, speeches published, and government reports. There are other options: users can discover the percentage of books that a word features in, for example, or in the American context go down to the level of the author’s state. It thus serves the distant reading purpose of looking at overall trends, but also the close reading perspective of finding individual works. Bookworm is even more versatile than we have the space to do it justice here: as of writing (2014) it has similar features for scientific literature (ArXIV), bills passed in the United States Congress, the American Chronicling America newspaper database, and the Social Science Research Network.
¶ 33 Leave a comment on paragraph 33 0 As the datasets made available in Bookworm indicate, there is much value in extracting word and phrase frequency from the body of academic literature. As historians, we often find ourselves writing historiographies – a process which involves, in part, understanding the trends in the literature that has come before. One of the major repositories that we use as researchers is JSTOR (Journal STORage). Recognizing the value of distant reading and textual analysis to researchers, JSTOR has made it so that anybody – even members of the general public who would not normally have access to the collections – can carry out queries.
¶ 34 Leave a comment on paragraph 34 0 An example can help show the utility. Visit http://dfr.jstor.org/. You do not even need to have a subscription to JSTOR to use this tool! Let’s take two sub-disciplines of history that have seen relative waning and booming of fortunes, in terms of scholarly interest, over the last few decades. In the search engine, we search first for “labor history.” Now, knowing that there is an alternative English-language spelling – “labour history,” we do a second search for that. Searches stack on top of each other, so we are seeing the combined results. The default results are a list of citations, so to generate visualizations we click on the ‘graph’ icon (figure 1.6).
[Insert Figure 1.6: ‘graph’ icon from DFR.JSTOR.ORG]
¶ 37 Leave a comment on paragraph 37 0 Below, between 1960 (where the first mentions occur) and present, the combined relative frequency of the phrases “labor history” and “labour history” (figure 1.7):
[Insert Figure 1.7: ‘labor history’ and ‘labour history’ search results in articles in JSTOR, via dfr.jstor.org]
¶ 40 Leave a comment on paragraph 40 0 Remember that JSTOR is not a perfectly comprehensive database: many journals are not included in it, but even so, this is a useful starting point. We see labour history rising to peak in the late 1970s, early 1980s, with another surprising late 1990s bump. It then declines throughout the 2000s. If we want to generate a list of citations, we can download them all by clicking ‘submit a dataset request’ in the upper menu bar.
[Insert Figure 1.8: ‘environmental history’ search results in articles in JSTOR, via dfr.jstor.org]
¶ 44 Leave a comment on paragraph 44 0 Interesting – a subdiscipline ascendant! But, what if this reflects the use of “environmental history” in other disciplines: environmental history not as a historical subdiscipline, but as literature reviews in scientific or technical papers? We can continually refine our data, however. On the left column, we see a list of search criteria. We click on ‘discipline,’ then select ‘history,’ and look at the results (figure 1.9):
[Insert Figure 1.9: ‘environmental history’ search results, filtered by ‘discipline’, in JSTOR, via dfr.jstor.org]
¶ 47 Leave a comment on paragraph 47 0 Aha! The results are seeing a decline in history, from the mid-2000s onwards. Anecdotally, this can be borne out by the North American job market as well. This data must be used with caution, but is an additional resource to add to your digital humanities toolkit. It can be a useful contextualizing tool for those historiographical papers.
¶ 48 Leave a comment on paragraph 48 0 Textual analysis is not the be all and end all of digital history work, as a project like ORBIS: The Stanford Geospatial Network Model of the Roman World vividly demonstrates. ORBIS, developed by Walter Scheidel and Elijah Meeks at Stanford University, allows users to explore Roman understandings of space as temporal distances and linkages, dependent on stitching together roads, vehicles, animals, rivers, and the sea. Taking the Empire, mostly circa 200 AD, ORBIS allows visitors to understand that world as a geographic totality: it is a comprehensive model. As Scheidel and Meeks explained in a white paper, the model consists of 751 sites, most of them urban settlements but also including important promontories and mountain passes, and covers close to 10 million square kilometers (~4 million square miles) of terrestrial and maritime space. 268 sites serve as seaports. The road network encompasses 84,631 kilometers (52,587 miles) of road or desert tracks, complemented by 28,272 kilometers (17,567 miles) of navigable rivers and canals. Wind and weather are calculated, and the variability and uncertainty of certain travel routes and options are well provided. 363,000 “discrete cost outcomes” are made available. A comprehensive, “top down” vision of the Roman Empire is provided (given how we are used to understanding space), drawing upon conventional historical research. Through an interactive graphical user interface, a user – be they a historian, a lay person, or student – can decide their start, destination, month of travel, whether they want their journey to be the fastest, cheapest, or shortest, how they want to travel, and then specifics around whether they want to travel on rivers or by road (from foot, to rapid march, to horseback, to ox cart, and beyond). For academic researchers, they are now able to begin substantially more finely-grained research into the economic history of antiquity.
¶ 49 Leave a comment on paragraph 49 0 The ORIBS project has moved beyond the scholarly monograph into an accessible and comprehensive study of antiquity. Enthusiastic reactions from media outlets as diverse as the Atlantic, the Economist, the technology blog Ars Technica, and the ABC television network all demonstrate the potential for this sort of scholarship to reach new audiences. ORBIS takes big data, in this case an entire repository of information about how long it would take to travel from point A to point B (remember, 363,000 cost outcomes), and turns it into a modern day Google Maps for antiquity.
¶ 50 Leave a comment on paragraph 50 0 From textual analysis explorations in the Old Bailey, to the global network of 19th century commodities, to large collections of Victorian novels or even millions of books within the Google Books database, or to the travel networks that made up the Roman Empire, thoughtful and careful employment of big data has much to benefit historians today. Each of these projects, representative samples of a much larger body of digital humanities work, demonstrates the potential that new computational methods can offer to scholars and the public. None of them would be possible, of course, if information professionals had not done such a fantastic job of collecting all this material – in a theme that we will return to, librarians and archivists often do yeoman’s work in curating, collecting, and preserving these traces of the past in aggregate form for us. Now that we have seen some examples of the current state of the field, let us briefly travel back to 1946 and the first emergences of the digital humanities. Compared to the data centres of the Old Bailey online or the scholars of Stanford University, the field had unlikely beginnings.
¶ 53 Leave a comment on paragraph 53 0  Dan Cohen et al., “Data Mining with Criminal Intent: Final White Paper,” August 31, 2011, http://criminalintent.org/wp-content/uploads/2011/09/Data-Mining-with-Criminal-Intent-Final1.pdf
¶ 54 Leave a comment on paragraph 54 0  The best explanation of this is probably the Piled Higher and Deeper (PhD Comics) video explaining British historian Adam Crymble’s work. See “Big Data + Old History,” YouTube video, 6 September 2013, http://www.youtube.com/watch?v=tp4y-_VoXdA.
¶ 55 Leave a comment on paragraph 55 0  For an incredible overview of this, see Tim Hitchcock, “Big Data for Dead People: Digital Readings and the Conundrums of Postivism,” Historyonics Blog, 9 December 2013, http://historyonics.blogspot.ca/2013/12/big-data-for-dead-people-digital.html.
¶ 57 Leave a comment on paragraph 57 0  Sara Klingenstein, Tim Hitchcock, and Simon DeDeo, “The Civilizing Process in London’s Old Bailey,” Proceedings of the National Academy of Sciences 111, no. 26 (1 July 2014): 9419–24. Available online at http://www.pnas.org/content/111/26/9419.full.
¶ 58 Leave a comment on paragraph 58 0  Sandra Blakeslee, “Computing Crime and Punishment,” The New York Times, June 16, 2014, http://www.nytimes.com/2014/06/17/science/computing-crime-and-punishment.html.
¶ 59 Leave a comment on paragraph 59 0  Ian Milligan, “Illusionary Order: Online Databases, Optical Character Recognition, and Canadian History, 1997-2010,” Canadian Historical Review, Vol. 94, No. 4 (December 2013): 540-569.
¶ 60 Leave a comment on paragraph 60 0  Ewan Klein et al., “Trading Consequences: Final White Paper,” March 2014, http://tradingconsequences.blogs.edina.ac.uk/files/2014/03/DiggingintoDataWhitePaper-final.pdf.
¶ 64 Leave a comment on paragraph 64 0  Remember, to do a phrase search, use quotation marks around your phrase. Otherwise, the search will return results where the words are in close proximity, which can skew your results. See https://books.google.com/ngram/info.
¶ 65 Leave a comment on paragraph 65 0  Aptly put in Ted Underwood, “Wordcounts are Amazing,” The Stone and the Shell Research Blog, 20 February 2013, http://tedunderwood.com/2013/02/20/wordcounts-are-amazing/.
¶ 66 Leave a comment on paragraph 66 0  Ben Zimmer, “Bigger, Better Google Ngrams: Brace Yourself for the Power of Grammar,” TheAtlantic.Com, 18 October 2012, http://www.theatlantic.com/technology/archive/2012/10/bigger-better-google-ngrams-brace-yourself-for-the-power-of-grammar/263487/
¶ 67 Leave a comment on paragraph 67 0  Ian Milligan, “Mining the ‘Internet Graveyard’: Rethinking the Historians’ Toolkit,” Journal of the Canadian Historical Association, vol. 23, no. 2 (2012): 21-64.
¶ 70 Leave a comment on paragraph 70 0  History in the Digital Age, ed. Toni Weller (Routledge, 2013). See also increasingly prominent venues such as Global Perspectives on Digital History, http://gpdh.org/.
¶ 72 Leave a comment on paragraph 72 0  Walter Scheidel, Elijah Meeks, and Jonathan Weiland, “ORBIS: The Stanford Geospatial Network Model of the Roman World,” May 2012, http://orbis.stanford.edu/ORBIS_v1paper_20120501.pdf
¶ 73 Leave a comment on paragraph 73 0  “Travel Across the Roman Empire in Real Time with ORBIS,” Ars Technica, accessed June 25, 2013, http://arstechnica.com/business/2012/05/how-across-the-roman-empire-in-real-time-with-orbis/; “London to Rome, on Horseback,” The Economist, accessed June 25, 2013, http://www.economist.com/blogs/gulliver/2012/05/business-travel-romans; Rebecca J. Rosen, “Plan a Trip Through History With ORBIS, a Google Maps for Ancient Rome,” The Atlantic, May 23, 2012, http://www.theatlantic.com/technology/archive/2012/05/plan-a-trip-through-history-with-orbis-a-google-maps-for-ancient-rome/257554/.
¶ 74 Leave a comment on paragraph 74 0  The interested reader should follow http://www.digitalhumanitiesnow.org and @dhnow on Twitter to be kept abreast of current developments and projects in digital history. The resource is a machine-human collaboration meant to surface the best recent digital humanities work, which is itself a leveraging of big data that has only recently become possible.