An experiment in writing in public, one page at a time, by S. Graham, I. Milligan, & S. Weingart

Clustering Data to Find Powerful Patterns with Overview

1 Leave a comment on paragraph 1 0 Previous section: Voyant tools

2 Leave a comment on paragraph 2 0 Journalists have recently taken to big data and data visualization, as a response to the massive data dumps occasioned by such things as Wikileaks, or the Snowden revelations. The Knight Foundation for instance has been promoting the development of new tools to help journalists and communities deal with the deluge. Past winners of the Knight News Challenge have included crowd-mapping applications such as Ushahidi (ushahidi.com), or DocumentCloud (documentcloud.org). Indeed, some historians have used these applications in their own work, and historians could usefully repurpose these projects to their own ends.[1]

3 Leave a comment on paragraph 3 0 A recent project to emerge from the Knight News Challenge is ‘Overview’. Overview has some affinities with topic modeling, discussed in the following chapter, and so we suggest it here as a more user-friendly approach to exploring themes within your data.[2] Overview can be installed for free on your own computer.[3] However, if your data warrants it (i.e. there are no privacy concerns), you can go to overviewproject.org and upload your materials to their servers and begin exploring.

4 Leave a comment on paragraph 4 0 Overview explores word patterns in your text using a rather different process than topic modeling. It looks at the occurrence of words in every pair of documents. ‘If a word appears twice in one document, it’s counted twice… we multiply the frequencies of corresponding words, then add up the results’.[4] (The technical phrase is ‘term frequency-inverse document frequency.) Documents are then grouped together using a clustering algorithm based on this similarity of scores. It categorizes your documents into folders, sub-folders, sub-sub-folders. Let’s say that we are interested in the ways historical sites and monuments are recorded in the city of Toronto. We upload the full text of these historical plaques (614 texts) into Overview (figure 3.8).

5 Leave a comment on paragraph 5 0 3.8-overview

[insert Figure 3.8 The Overview interface sorting the text of historical plaques from Toronto}

7 Leave a comment on paragraph 7 0 Overview divides the historical plaques, at the broadest level, of similarity into the following groups:

‘church, school, building, toronto, canada, street, first, house, canadian, college (545 plaques),

‘road, john_graves, humber, graves_simcoe, lake, river, trail, plant’ (41 plaques)

‘community’ with ‘italian, north_york,  lansing, store, shepard, dempsey, sheppard_avenue’, 13 documents

‘years’ with ‘years_ago, glacier, ice, temperance, transported, found, clay, excavation’, 11 documents.

8 Leave a comment on paragraph 8 0 That’s interesting information to know. There appears to be a large division between what might be termed ‘architectural history’ - first school, first church, first street - and ‘social history’. Why does this division exist? That would be an interesting question to explore.

9 Leave a comment on paragraph 9 0 Overview is great for visualizing patterns of similar word use across sets of documents. Once you’ve examined those patterns, assigning tags for different patterns, Overview can export your texts with those descriptive labels as a csv file, that is, as a table. Thus, you could use a spreadsheet program to create bar charts of the count of documents with an ‘architecture’ tag or a ‘union history’ tag, or ‘children’, ‘women’, ‘agriculture,’ etc. We might wonder how plaques concerned with ‘children’, ‘women’, ‘agriculture’, ‘industry’, etc might be grouped, so we could use Overview’s search function to identify these plaques by search for a word or phrase, and applying that word or phrase as a tag to everything that is found. One could then visually explore the way various tags correspond with particular folders of similar documents.[5]

10 Leave a comment on paragraph 10 0 Using tags in this fashion, comparing with the visual structure presented by Overview, is a dialogue between a close and distant reading. Overview thus does what it set out to do: it provides a quite quick and relatively painless way to get a broad sense of what is going on within your documents.

11 Leave a comment on paragraph 11 0 Next section: Manipulating Text with the Power of Regular Expressions

12 Leave a comment on paragraph 12 0 [1] See Shawn Graham, Guy Massie, and Nadine Feurherm, “The HeritageCrowd Project: A Case Study in Crowdsourcing Public History,” in Writing History in the Digital Age, eds. Jack Dougherty and Kristen Mawrotszki (Ann Arbor: University of Michigan Press, 2013), available at http://dx.doi.org/10.3998/dh.12230987.0001.001.

13 Leave a comment on paragraph 13 0 [2] Jonathan Stray has written an excellent piece on using Overview as part of a ‘data journalism’ workflow, many points of which are appropriate to the historian. See Jonathan Stray,  “You Got the Documents. Now What? – Learning – Source: An OpenNews Project”, Source, 14 March 2014. https://source.opennews.org/en-US/learning/you-got-documents-now-what/

14 Leave a comment on paragraph 14 0 [3] The documentation for Overview may be found at http://overview.ap.org/ ; the software itself can be downloaded at https://github.com/overview/overview-server/wiki/Installing-and-Running-Overview

15 Leave a comment on paragraph 15 0 [4] http://overview.ap.org/blog/2013/04/how-overview-can-organize-thousands-of-documents-for-a-reporter/

16 Leave a comment on paragraph 16 0 [5] As in this example http://overview.ap.org/blog/2013/07/comparing-text-to-data-by-importing-tags/

Page 37

Source: http://www.themacroscope.org/?page_id=641