¶ 2 Leave a comment on paragraph 2 2 In this chapter, we begin by discussing what big data means for humanities researchers. “How big is big,” we rhetorically ask: big data for literature scholars might mean a hundred novels, for historians it might mean an entire array of 19th century shipping rosters, and for archeologists it might mean every bit of data generated by several seasons of field survey and several seasons of excavation and study – the materials that don’t go into the Geographic Information system. For us, “big is in the eye of the beholder.” If it’s more data that you could conceivably read yourself in a reasonable amount of time, or that requires computational intervention to make new sense of it, it’s big enough! This opens up an opportunity to begin the book by introducing some cutting edge projects, many funded by the Digging into Data grants.
¶ 3 Leave a comment on paragraph 3 0 With the current state of affairs thus surveyed, we turn our eyes to the historical context of this current scholarly moment. We begin with the original Digital Humanities project of Father Busa and his Index Thomisticus, and then continue into a discussion of how and why the shift from ‘humanities computing’ to the ‘digital humanities’ is underway and why it matters.
¶ 4 Leave a comment on paragraph 4 0 Following this, we then discuss the broader implications of an “era of big data.” Here we see the joys of abundance, but also the dangers of information overload. Using information from the Internet Archive, Google, national archival organizations from the United States, Great Britain, and Canada, as well as a host of other projects, we discuss the contours of this challenge and opportunity.
¶ 5 Leave a comment on paragraph 5 0 Not all data is of equal use and accessibility, and here we run into issues of copyright (more fully discussed in Chapter Two). Alluding to some of the discussion of copyright that will follow, we note that scholars of the 18th and 19th century have tremendous opportunity due to the ever-growing number of open-access records.
¶ 6 Leave a comment on paragraph 6 0 We conclude with the argument that the computational side of what we think of as ‘digital history’ or ‘digital archaeology’ offer tremendous opportunity for scholars.
¶ 8 Leave a comment on paragraph 8 0 With the “big picture” established in Chapter One, this chapter discusses what the Digital Humanities moment can offer to scholars. We begin by establishing a baseline of knowledge, focusing on a trio of critical terms: open access, copyright, and what we mean by textual analysis. The latter point segues into an important point: humanities scholars are already engaging in much of this, as a tangible example of how PROQUEST newspaper databases and Google searches have already influenced our scholarship (we can see this in several recent publications: an American Historical Review showing how many historians are slow to adopt tools and the negative impact on historiography; a Canadian Historical Review article about the tremendous impact that digital resources have had on the literature; and a recent ITHAKA S + R consulting report about the impact of new technologies on the profession).
¶ 9 Leave a comment on paragraph 9 0 This chapter is then a series of do-it-yourself, hands-on-lessons, using general conceptual terms but with concrete links to accessible tools where appropriate (conscious of dating the book, much of this can be discussed at a non-programming language specific level).
¶ 10 Leave a comment on paragraph 10 0 The first lesson will be how you can find data. But once you find it? What can you do with it? This is a second lesson into building up your digital toolkit. Using off-the-shelf, open-access resources, such as the Programming Historian and the Python programming language, you can quickly normalize and tokenize your text. These are not magic bullets, of course. Cautionary notes will abide, including those relating to the limitations of how sources were constructed, especially in terms of Optical Character Recognition (OCR) and sampling bias.
¶ 11 Leave a comment on paragraph 11 3 Provocatively, we will then conclude the chapter by discussing, in the light of the widespread social histories we can now create of modern times through data mining, what implications does this hold for historians and archeologists about our currently very selective process of studying the pre-digital past? We do so by discussing the “great unread” (as Moretti puts it) of big data sets. Furthermore, we note the renewed importance of librarians and archivists, noting the emergence of several international standards, and how adopting open standards and making data accessible opens up tremendous opportunities for a moment of “linked data.” To make this more tangible, we will use an example that shows how openly accessible data allows for truly transformative scholarship.
¶ 13 Leave a comment on paragraph 13 1 After discussing generalities and overall issues, Chapter Three – accessible to a senior undergraduate reader – introduces readers to several critical Digital Humanities data mining and visualization tools. By this point, readers understand how to obtain and wrangle data. We now begin to discuss: what can you do with it?
¶ 14 Leave a comment on paragraph 14 0 We begin with a simple word cloud (such as those generated by Wordle). While these are often overused and need to be used with caution, discussed here, they are also a “gateway” drug for further data visualizations. Beyond Wordle, we then move into open-source options for visualizing. These include a series of vignettes, including Keyword in Context (KWICs), word maps, and Excel tools that move us from simple data sets to node visualizations. Spatial visualizations are similarly introduced, building off the British project Locating London’s Past and using the free Google Earth platform as a quick introduction.
¶ 15 Leave a comment on paragraph 15 0 These lessons will be platform independent and use generalities, rather than specific technical details. This will prevent the book from unduly aging. We will, however, have a few ‘sidebars’ that introduce more free technical applications: one introducing the Software Environment for the Advancement of Scholarly Research (SEASR) platform, another for the web-based and Voyant Tools suite, and another for the Metadata Offer New Knowledge platform. These sidebars will contain links to our github repository where appropriate.
¶ 17 Leave a comment on paragraph 17 0 Building on our demonstrated expertise in teaching users how to use the MAchine Learning for LanguagE Toolkit (MALLET) suite of resources, as seen in the Programming Historian 2 and the Journal of Digital Humanities, this chapter fleshes out the technical side with an in-depth exploration of what topic modelling offers to historians – and how it works.
¶ 18 Leave a comment on paragraph 18 0 We begin small, with a single, short, famous piece of text: Abraham Lincoln’s Gettysburg Address. Using simple spreadsheet software as our sole piece of technology, we take the user from the speech in plain text, to breaking it into tokens and removing stopwords, to performing a full topic model on it. Given the fairly considerable difficulty posed by comprehending topic modelling and its underlying algorithm (Latent Dirichlet Allocation), this will teach the core concept. By using simple software, we also ensure that this section will not be dated. We point our readers to the ‘data buffet’ analogy of Matt Jockers and the ‘farmer’s market’ analogy of Lyn Rhody, but working a (comparatively and highly simplified) topic model out by hand grounds the reader in the conceptual framework of the technique.
¶ 19 Leave a comment on paragraph 19 0 Once our readers are confident in a complete understanding of the methodology, we then take them through some more targeted examples – demonstrating the insights that can be gleaned through this sophisticated methodology. Examples will be drawn from the now-cannonical work of Robert Nelson (the Mining the Dispatch project) and Cameron Blevin’s work on Martha Ballard’s diary, as well as examples we build from public domain sources like William Lyon Mackenzie King’s diaries (Canada’s wartime Prime Minister). In particular, one example that demonstrates the immense power of the method is explicitly archaeological. We will build a topic model of the data contained in the Portable Antiquities Scheme Database (tens of thousands of records), where we will imagine that each parish in the UK is a ‘document’ and the items and their descriptions recovered in those parishes form the individual ‘tokens’. A sidebar will discuss David Mimno’s topic modelling of the contents of a house from Pompeii. We will analyze and visualize this data in this and subsequent chapters. The tokenized data itself (from all of our own exemplars) will be available in our repository for the reader to play with, to run their own models on, to explore and to visualize. We will encourage readers to come to their own conclusions about what the resulting topic models might say. We conclude with various beautiful visualizations that help bring everything together.
¶ 21 Leave a comment on paragraph 21 0 By this point, readers should have a good sense of the general and specific principles of data gathering and analysis. This chapter addresses one of the key fears that scholars have with big data: that it could mean losing the trees for the forest. We argue that network analysis allows historians to connect the micro and macro, situating individual actors within a complex interconnected ecology. Network analysis has been one of the most fruitful ways that topic models have been visualized, making it a useful discussion at this point. For historians, network analysis fruitfully explores concepts of space and time.
¶ 22 Leave a comment on paragraph 22 0 The chapter begins with a basic breakdown of the concepts and vocabulary of network analysis, which is a unique and transformative visualization and relational modelling technique (building upon Chapter Four). We discuss the technological features and limits of data structures for encoding networks, including matrices, edge lists, node/edge attribute lists, and dynamic representations. All of this can be performed within a spreadsheet program (especially, but not limited to, the free NodeXL plugin for MS Excel), and we will walk our readers through a rudimentary example to bring it all to the forefront.
¶ 23 Leave a comment on paragraph 23 0 We then explore more detailed topics and the opportunities offered. These include explicit networks, such as those drawn from correspondence collections, as well as ‘derived’ networks, where relationships are deduced from statistical analysis of (for instance) word structures, discourses, or in the case of archaeology, measurements of artifact similarity. Within sidebars, we also introduce several popular tools for creating, visualizing, and analyzing networks along with sample datasets and workflows. Examples will include archaeological databases, early-modern letters, and historical document co-occurrences of words. Of course, no tool is perfect. We conclude with a discussion of the perils and potentials of network analysis for historians.
¶ 25 Leave a comment on paragraph 25 0 Chapter Six serves as both a conclusion as well as an emphatic argument that the digital humanities can spur significant changes in how we approach our research questions. Using a series of binary pairs, such as discovery versus justification, close versus distant reading, and the the micro versus the macrocope, we ultimately argue that we are in the midst of an epistemological shift from single-author, single-perspective works to the multiple perspectives required by big data. While we believe that there is still considerable room for traditional methodologies, which will continue to make up the lion’s share of research approaches, we argue that digital techniques should become an essential part of the undergraduate and graduate training experience. We hope that after moved the reader through the previous five chapters, they will have found something useful and thus agree.
¶ 26 Leave a comment on paragraph 26 0 We will then conclude the book with a rumination over what be next, and a final reflection of the role played by data mining, textual analysis, topic modelling, and networks for history and archaeology. In our closing paragraphs, we return to our scholar from the first page. How has her workflow changed? What has she discovered?