An experiment in writing in public, one page at a time, by S. Graham, I. Milligan, & S. Weingart

Why We’re All Digital Now

1 Leave a comment on paragraph 1 0 Previous section: Delving into Big Data

2 Leave a comment on paragraph 2 0 You do not have to be a self-declared Digital Historian to be a digital historian. Indeed, almost all historians are digital today. While there are undoubtedly a few laggards, most historians use search engines to find sources (both online and in conventional print archives). A smaller, yet still significant number, take digital photographs at archives to review when they get home. Perhaps most significantly, historians increasingly rely on online databases to find relevant journal and newspaper articles: from JSTOR to ProQuest to regional or national undertakings like the Canadian province of Ontario’s Scholar’s Portal database. The previous three keywords all apply to these practices: copyright influences what is and what is not available (especially important for newspapers) and to who, some databases are open access and others are not (as paywalls remind us), and – critically – computational textual analysis underscores these search databases. Our research is being shaped every day by these forces. Once a historian begins to conceptualize her or his everyday activities as being part of the digital turn, the leap towards the whole-scale manipulation and exploration of digital materials is not so big a gulf after all.

3 Leave a comment on paragraph 3 0 Beyond anecdote, we have increasingly sophisticated data on how historians are responding to this digital turn. In late 2012, the consulting firm ITHAKA S+R (owned by the parent company of the JSTOR database), carried out an exhaustive report on historian’s research practices. Their summary shows the widespread impact of digital methods, beyond the explicitly defined Digital Historians:

4 Leave a comment on paragraph 4 0 Even if the impact of computational analysis and other types of new research methods remains limited to a subset of historians, new research practices and communications mechanisms are being adopted widely, bringing with them both opportunities and challenges.[1]

5 Leave a comment on paragraph 5 0 Historians begin with Google, find sources online, work with online databases, go to archives and take digital photographs, and digitally view these representations of analog materials on their home or office computers. Despite the antimodernist presentation of a historian in stock images and in parts of the popular consciousness, historians have been no less affected by the digital turn. The issue at hand is that while historians have adopted digital tools, to some degree they have adopted them uncritically.

6 Leave a comment on paragraph 6 0 Search engines need to be considered critically. As the authors of the important Web Dragons: Inside the Myths of Search Engine Technology note that we need to complicate our understanding of the Web as “the universal key to information access” for a number of reasons, including the fact that the “rich get richer.” Websites that are linked to by more people go up in the search rankings, and as people use the search rankings to find websites, this compounds; it can be hard to break in.[2] In 2007, after noting the issues of why many websites receive no links and how they become hidden, the authors note that one mitigating factor is that “fortunately, we do not all favour the same dragon [search engine].”[3] While this is true, Google is now the uncontested victor of the search engine wars, maintaining a whopping 67.1% share of the search engine market.[4] Ted Underwood, a literary scholar at the University of Illinois, has argued that:

7 Leave a comment on paragraph 7 0 The internal mathematics of full-text search also has more in common with data mining than with bibliographic retrieval. If I do a title search for Moby-Dick, the results are easy to scan. But in full-text search, there are often too many matches for the user to see them all. Instead, the algorithm has to sort them according to some measure of relevance. Relevance metrics are often mathematically complex; researchers don’t generally know which metric they’re using; in the case of web search, the metric may be proprietary.[5]

8 Leave a comment on paragraph 8 0 In short, if you Google something even seemingly particular like “Canadian history,” you’ll receive 273,000,000 results. You are probably not going to read result 270,000, let alone result 100. What makes something get onto that first or second page? These are the questions that scholars begin to ask as we adopt digital tools on a widespread scale.

9 Leave a comment on paragraph 9 0 Consider the problem of using online newspaper databases.[6] They seem orderly, advanced, and comprehensive. Instead of using a microfilm reader to navigate an old newspaper, one logs into a newspaper database through a library portal. A keyword search for a particular event, person, or cultural phenomenon brings up a list of research findings. While date-by-date searching is also available, it seems clunky and slow; keyword searching, however, offers something new, something potentially transformative. Each result is often broken down by date, newspaper page number, the section it appears in, and a further click-through brings you to the entire page, scanned at a decently high resolution, search terms highlighted for convenience. The surrounding context of the page, advertisements, and the original layout are all preserved. Previously impossible or implausible research projects can now be approached, especially when they involve wide swaths of social or cultural terrain.

10 Leave a comment on paragraph 10 0 In the Canadian context, this was a particularly important issue: Canada was the first country to have its two major newspapers digitized in 2002: the Toronto Star (Canada’s largest circulation newspaper) was fully searchable before even the New York Times, followed by the Globe and Mail (by some accounts the ‘paper of record’ for Canada).[7]

11 Leave a comment on paragraph 11 0 Research done by Ian Milligan demonstrated that, dating back to 2002, researchers have disproportionately cited what they find online. Drawing on a collection of every history dissertation uploaded to ProQuest between 1997 and 2010, he discovered dramatic increases: comparing 1998 to 2010 saw a 991% increase in citations to the Toronto Star, as opposed to minor increases and even decreases for other newspapers. This has had two dramatic impacts. First, in a country with regional identities and histories, this had the effect of centralizing studies towards the metropoles and away from the peripheries – to the extent that studies of events in smaller cities with regional newspapers were recounted using Toronto-based newspapers.

12 Leave a comment on paragraph 12 0 Secondly, and perhaps most importantly, search engines have a necessarily skewing effect on our research, something that is true for Google searches, to searching within electronic books, through to the Toronto Star online. Yet the issues of poor OCR in these databases make this a very pressing and significant issue. Something like newspaper digitization is both simple and complicated. Let us use the case of the Toronto Star as a pertinent example. At a speed of roughly one million pages of newspaper per month, digitization was carried out from microfilm originals. From the microfilm, each individual page is subsequently produced as a decently high-resolution PDF document, averaging approximately 700KB. Every page was put through an OCR scanner, producing a text file of the text found; when a user enters a search term, he or she is searching their query against the text file. Once a match is found, the PDF is made available to the user.

13 Leave a comment on paragraph 13 0 However, OCR means that some data is lost. OCR is primarily concerned with commercial markets and users: the massive digitization and transcription of large arrays of typewritten documents, often in corporate, legal, and governmental settings. Applying this technology to historical documents is tricky, as we are taking a tool and applying it to a closely related yet far from identical task. In a comprehensive article, a team of three researchers has outlined the major problems facing OCR routines as they tackle historical documents and newspapers.[8]These include non-standard fonts (historical newspapers did not use standard typefaces); printing noise (the small errors, betraying the printers hand on the actual paper); unequal line and word spacing; line-break hyphenation (if a word transgresses a column, frequent due to small historical columns, it is lost if the algorithm did not incorporate it), and the inherent data loss of medium transformation.

14 Leave a comment on paragraph 14 0 We use this extended example for two main reasons: firstly, to encourage readers to think about how Big Data (in the form of newspaper and journal databases) already structures their research activities. Yes, there is something distinctive about explicitly using computational methods to tackle new or existing research questions, but it now underpins all facets of our scholarly experience now. Second, to illustrate some of the pitfalls that all scholars now have to look out for. We need to consider the underlying structure of databases: how they were constructed, what assumptions went into them, and how the data is formatted. For historians, as we work with digitized primary sources, we need to always consider the quality of the text: how was it constructed? Was it double-blind entered, like the Old Bailey Online? Or did commercial OCR algorithms, intended for law firms and applied to this application for which it was not designed, scan it? We need to cite the format that we use: if we are citing a newspaper article from a Lexis|Nexis database search, it should be cited as such; a book from Google Books, viewed in snippet form, needs to be treated and documented differently than the full e-book or the physical copy itself. Citing the digital matters, both for intellectual honesty as well as recognizing the algorithms that underlie these databases.

15 Leave a comment on paragraph 15 0 In any event, almost all historians (if not all) are in some ways digital. We have amazing amounts of information at our finger tips, via Google and JSTOR, and that information is delivered to us through a series of complicated mathematical and linguistic algorithms. With that in mind, going down the digital path should not seem daunting, but rather an opportunity to do what we do better. In the next section, we discuss the toolkit that historians can build to help them be explicit digital historians, and bring some of the power of Big Data into their own hands.

16 Leave a comment on paragraph 16 0 Next section: Building the Historian’s Toolkit

17 Leave a comment on paragraph 17 0 [1] Roger C. Schonfeld and Jennifer Rutner, “Supporting the Changing Research Practices of Historians,” ITHAKA S+R, 7 December 2012, http://www.sr.ithaka.org/research-publications/supporting-changing-research-practices-historians, accessed 30 July 2013.

18 Leave a comment on paragraph 18 0 [2] Ian H. Witten, Marco Gori, Teresa Numerico, Web Dragons: Inside the Myths of Search Engine Technology (San Francisco: Morgan Kaufmann, 2007), 182-183.

19 Leave a comment on paragraph 19 0 [3] Ibid., 185.

20 Leave a comment on paragraph 20 0 [4] Jennifer Slegg, “Google’s Market Share Drops as Bing Passes 17%,” Search Engine Watch, 21 May 2013, http://searchenginewatch.com/article/2269591/Googles-Search-Market-Share-Drops-as-Bing-Passes-17, accessed 31 July 2013.

21 Leave a comment on paragraph 21 0 [5] Ted Underwood, “Theorizing Research Practices We Forgot to Theorize Twenty Years Ago,” Representations, Vol. 127, No. 1 (Summer 2014), 65.

22 Leave a comment on paragraph 22 0 [6] This problem has been addressed in depth by one of the co-authors, in Ian Milligan, “Illusionary Order: Online Databases, Optical Character Recognition, and Canadian History, 1997-2010,” Canadian Historical Review, Vol. 94, No. 4 (December 2013).
[7] For background on the project, Cold North Wind has a fairly detailed website. Please visit “About Paper of Record,” https://paperofrecord.hypernet.ca/default.asp (accessed 21 June 2012).

23 Leave a comment on paragraph 23 0 [8] Maya R. Gupta, Nathaniel P. Jacobson, Eric K. Garcia, “OCR Binarization and Image Pre-Processing for Searching Historical Documents,” Pattern Recognition: The Journal of the Pattern Recognition Society, 40 (2007), 389.

Page 26

Source: http://www.themacroscope.org/?page_id=617