|
An experiment in writing in public, one page at a time, by S. Graham, I. Milligan, & S. Weingart

Basic Scraping: Getting Your Data

1 Leave a comment on paragraph 1 0 Previous section: How To Become A Programming Historian, a Gentle Introduction

2 Leave a comment on paragraph 2 0 There are a number of ways to grab existing big historical data. Sometimes, if we are lucky, the data that we wish to explore will already be provided to us in a format that we can use. Consider this fruitful example of a scholar sharing his or her research. Peter Holdsworth, an MA student at Carleton University, uploaded all of his raw data as Excel spreadsheets to the research data sharing website Figshare.[1] Figshare occupies an important niche in the ecosystem of scholarly publishing, especially as historians begin to grapple with larger and larger datasets that cannot possibly be described or effectively communicated in our traditional print journals. Figshare provides a unique digital identifier for every file that a researcher chooses to share on the site, which may be cited. Here for instance is Holdsworth’s network dataset from his research:

3 Leave a comment on paragraph 3 0 Holdsworth, Peter (2013): Holdsworth 1902 Organizational Network. figshare.
http://dx.doi.org/10.6084/m9.figshare.727770

4 Leave a comment on paragraph 4 0 Figshare also provides an ‘application programming interface’ which allows other websites or software to access the files held in Figshare (one could write a program that visualizes the location of all data concerned with a particular geographic region, for instance). Using APIs is a kind of ‘scraping’ in the sense that APIs provide a formal protocol for the sharing, reuse, and remixing of data, automatically. Using APIs is not overly difficult (the Programming Historian explains how to use the Zotero reference and research environment API) but for now we want to explore ways of obtaining information automatically using free tools.

5 Leave a comment on paragraph 5 0 Wget is a useful tool when the information we want is well formed and well organized as a series of pages in a static website (see sidebar above). Many sites are prearranged for retrieving information in such a way that we can use it for your own analysis. Imagine visiting the website of a historical society that has transcribed a series of newspaper articles onto a single webpage. The page is well laid out: every date is clearly offset from the text, the text is nicely arranged for reading… we might want to scrape that information into a spreadsheet file (comma or tab separated values). We could ‘save as’ the page to our desktop and manually grab everything, cutting and pasting. That would be an enormous waste of effort. Since the browser knows how to visualize the information for us to read it, we can use those tags, the html markup for headers and tables and so on, and get the computer to automatically put it into a table for us. The Programming Historian has a number of lessons explaining how to do this in Python. Here, we describe a way of using three different pieces of software to achieve the same thing. While we would certainly recommend that learning how to do this using Python should be on your to-do list, it certainly is easier, working with students or colleagues, to point to useful browser based tools that can do this.

6 Leave a comment on paragraph 6 0 The first tool that we sometimes use is called ‘Outwit Hub,’ available at http://www.outwit.com/. We will revisit it in Chapter Four when we discuss topic modelling, and show how you can use it to grab specific data from an American primary source to subsequently visualize it. It is not freeware, but it does come in a basic, free version that is quite useful if occasionally a bit limited.

7 Leave a comment on paragraph 7 0 Outwit Hub, as with many other pieces of software programs that find information on the Web, finds information by parsing the ‘soup’ of HTML markup tags to download elements of interest. HTML, or the Hypertext Markup Language, is a markup language that dictates how a webpage should appear. Think of when you edit – or ‘mark up’ – a document on paper: when something should be italics, maybe you circle it and write ‘italics’; or if something should be emphasized, you underline it. This is what a markup language is. The HTML document contains two things: the content, but also the instructions on how the content should be displayed. So you might have a document like:

<HTML>
      <body>
            <h1>The diary of Professor Smith.</h1>
            <h2>13 December 2013</h2>
                  <em>This is the diary entry.</em>
            <h2>14 December 2014</h2>
                  <em>This is a second entry.</em>
      </body>
</HTML>

8 Leave a comment on paragraph 8 0 This document would appear like so:

9 Leave a comment on paragraph 9 0  

10 Leave a comment on paragraph 10 0 The diary of Professor Smith.

11 Leave a comment on paragraph 11 0 13 December 2013

12 Leave a comment on paragraph 12 0 This is the diary entry.

13 Leave a comment on paragraph 13 0 14 December 2014

14 Leave a comment on paragraph 14 0 This is a second entry.

15 Leave a comment on paragraph 15 0 So in this case, while it isn’t visible to the naked eye, the dates are all encapsulated within the <h2> (or Header, Level 2) tags, the title is within the <h1> tag, and the entries are all within the italics, or <em> tags. So if we wanted to quickly get a list of the dates, we would point Outwit Hub at the <h2> tags, to get information between <h2> and </h2>.

16 Leave a comment on paragraph 16 0 Download and install the free version. Outwit Hub looks rather like a web browser. That’s because it contains one within it. To use, simply put the address of the website you are interested in into the address bar. Outwit will load the page. You then click on the ‘source’ button in the menu on the left side of the screen. This shows you the underlying markup for the page. Scroll down through this material, taking note of the tags that embrace the kinds of information that you are interested in, for instance,

<h3> {the date of the entry} </h3>.

17 Leave a comment on paragraph 17 0 Here is some example data that you might see:

<h3>15 January 1945</h3>
<p>It was a cold day in New York City today. I wish I had a coat.</p>
<h3>16 January 1945</h3>
<p>I am still cold. I really wish I had a coat.</p>

18 Leave a comment on paragraph 18 0 In the following example, we will use Outwit Hub to extract the dates – in this case, 15 and 16 January 1945 – as well as the entries, relating to our poor friend being cold in New York City. We could then make a spreadsheet out of it: one column with dates, the second with the entries themselves. This is giving some structure to otherwise seemingly unstructured data, which will let computational tools interact with it. The above HTML code should seem familiar! It is a bit different than the first example we gave, but similar enough to begin to parse.

19 Leave a comment on paragraph 19 0 In Outwit Hub, then, click on the ‘scrapers’ item on the left side of the screen, within the menu bar. In the free version, you are limited to the number of scrapers you may design, and the number of rows of material that they will scrape (the maximum is one hundred). A new subwindow opens in the screen. Note some of the options here. One is ‘Apply if page url contains’ – it should already be filled with the URL of the site you went to. If not, paste it in here.

20 Leave a comment on paragraph 20 0 Below that is a table that allows you to construct your scraper. Double click in the first blank field under ‘Marker Before.’ You can put the first part of the tag that embraces the information you want in here, e.g. <h3> or the date tag in our example above. Double click in the next column over (under ‘Marker After’) and put in the closing tag, e.g </h3>. If you then clicked ‘execute’, the bottom part of the screen would fill up with a table with a single column, with the list of dates. In your scraper, you could input a second criterion (say, everything embraced by <p> and </p>, or paragraph) and when you hit the ‘execute’ button your table would contain both the date and the text for that date as a nicely formed table. You then hit the ‘export’ button at the bottom of the screen to save your ‘catch’ (as Outwit Hub terms it) as an .xls or .csv or other output. Our diary has been painlessly exported! We will return to Outwit Hub in chapter four in more detail. It requires some tinkering because every website is different, but in general if there’s some structure in the website, Outwit can grab it for you.

21 Leave a comment on paragraph 21 0 What is a CSV file, you may be asking? You have probably heard of XLS files – those are the proprietary formats that a program like Microsoft Excel uses to show a spreadsheet. A CSV file is a “Comma-separated value” file, and it is just that: a file that has a bunch of information organized by different commas. Consider the following spreadsheet of data:

31-Mar-14 I am looking forward to catching up with my colleagues Scott and Shawn.
01-Apr-14

It was April Fool’s Day! My good colleagues fooled me good.

22 Leave a comment on paragraph 22 0  

23 Leave a comment on paragraph 23 0 In a CSV format it would be:

31-Mar-14, I am looking forward to catching up with my colleagues Scott and Shawn.
1-Apr-14, It was April Fools’ Day! My good colleagues fooled me good.

24 Leave a comment on paragraph 24 0 It is plain text, separated by commas, which lets a computer program read it and know where fields should be broken.

25 Leave a comment on paragraph 25 0 Now, Outwit Hub can do much more than this simple scrape, but this in essence is what the program does. You ‘train’ it to recognize the data you are interested in, a step that we will discuss below, and then you set it to grab that data. There are far more sophisticated things you could set it to do. You could give it a particular pattern to look for over a number of sites, or within x number of links of a site, or a particular kind of data file; you can set it to guess automatically as to the kind of data a site might have; you can speed up or slow down the speed with which it asks websites for data (you do not want to over tax someone else’s server). The full documentation of what the program can do is provided on the website, and through numerous video tutorials provided by users.

26 Leave a comment on paragraph 26 0 A second option is still in ‘beta’ – or undergoing a public release with the expectation of finding bugs or issues with the program – but as of writing seems to hold much promise. Called ‘import.io’, it is similarly a stand-alone piece of software (with related plugins for the Chrome browser) that allow sophisticated crawling and scraping of data from websites and APIs. It allows not just the scraping of data, but the ability to mix data together from various sources and APIs, to share that data via unique URI (universal resource identifier), and to visualize real-time streams of data. The Chrome plugin will alert the user if a scraper for a particular website already exists (install it and then visit the Guardian Newspaper online, for example). Import.io works as a stand-alone browser. Like Outwit Hub, you have to train it (or you can have it guess, with mixed results) on the kind of data that you are looking for in a page. The browser provides much contextual help and step-by-step instructions to guide the user through the process of building and executing a scraper, and so this might be a good option for researchers new to scraping.[2] Import.io is currently free but it may develop ‘premium’ features. Its creators write that they will always have a free tool available for scraping and exporting the data.[3]

27 Leave a comment on paragraph 27 0 Finally, the Chrome browser web store has a number of plugins for scraping websites that require varying degrees of fluency with a query language called ‘xpath’. The easiest to use is simply called ‘Scraper’.[4] Scraper, once installed, provides a ‘Scrape Similar’ contextual menu item (install it, then right-click on any piece of text on a webpage to bring up the contextual menu). When you click this, a window opens that shows you the results of what Scraper grabbed, as well as an ‘xpath’ query window showing the pattern of data that it searched for. Xpath is query language for html and xml. In essence, the query is a list of the nested tags that describe the data that you are looking for. Chrome has a built-in tool called the ‘inspect element’, again available from the contextual menu. When you click on this, an inspector window opens at the bottom of the browser showing the underlying source code. You can then click on the element in the source copy and select ‘copy xpath’. This can then be put in the xpath query window, for instance. Scraper is integrated with Google drive, so anything you collect can be sent instantly to Google’s spreadsheet (and thence exported to a csv or excel file).

28 Leave a comment on paragraph 28 0 Splitting a csv file into separate text files

29 Leave a comment on paragraph 29 0 Sometimes it might be necessary to break one of these csv files into separate text files, in order to do the next stage of your analysis. For instance, imagine that you had scraped John Adams’ diaries into a single csv file, but what you really wanted to know was the way his views on governance changed over time. This might be a question for which topic modeling (see chapter four) could be well suited; many topic modeling tools require each document (here, a diary entry) to be in its own text file.

30 Leave a comment on paragraph 30 0 How could we do this? One route to explore, when we don’t know exactly how to achieve what we’re trying to do, is to examine the question and answer site StackOverflow, available at http://stackoverflow.com/. It is a repository of questions posed and answered by users, and is a great place to learn the basics of programming. Chances are that if you have run into a problem, somebody else has run into it as well and discussed it on this site. There is no need to reinvent the wheel when it comes to digital work.

31 Leave a comment on paragraph 31 0 We went to StackOverflow, and typed our question into its search box “how do I break an excel spreadsheet into single files for each row?” We found that a user named Eke had asked a similar question: “how do I write each Excel row to new .txt file with ColumnA as file name?” This captures what we’re trying to do, so let’s look at that discussion.

32 Leave a comment on paragraph 32 0 If you examine his question on Stackoverflow, Eke has precisely described what it is he’s trying to achieve, he has cited other threads on the forum that seem relevant to his question, even if they are not exactly what he wants, and he’s given examples of his data and his ideal output. Go visit the specific site at http://stackoverflow.com/questions/15554099/write-each-excel-row-to-new-txt-file-with-columna-as-file-name/15665756#15665756, or – for those who want to avoid long links, try http://bit.ly/1yweKHz. This is a very good example of how to get the most out of this (or indeed any) forum. We provide this example so you can see how we learn to tackle complicated problems as well!

33 Leave a comment on paragraph 33 0 If we examine the thread, we see a lot of back and forth with various users as suggestions are made and Eke reports back his results. A good citizen of the site, Eke also posts his complete code once he finds the solution that works best for him. You do not need to reinvent the wheel: the code is already here!

34 Leave a comment on paragraph 34 0 Let’s copy and paste that code. To use this script, we have to make a new macro in Excel, and then edit it. In Excel, find the Macros button for your version (it may be hidden under the ‘tools’ menu bar in more recent versions of the spreadsheet suite), click view macros, and then click ‘create’. (Sometimes, the ‘create’ button is grayed out. If this is true for you, click ‘record new macro’ instead, hit the start then stop button and the new macro will appear on the list. You can then click ‘edit’.) A new window opens with ‘Microsoft Visual Basic’ in the title. You can now copy and paste Eke’s code into your window. You don’t need to hit save, as anything you do in this window will save automatically.

35 Leave a comment on paragraph 35 0 Go back over to your normal Excel spreadsheet that contains your table of data. Click on Macros, select the one now listed as ‘SaveRowsasTXT’ and the macro will automatically copy and paste each row into a new worksheet, save it as a txt file, close that new worksheet, and iterate down to the next row. If you get an out-of-range error, make sure your worksheet in Excel is named ‘Sheet1′, so that this line in the script is correct:

set wsSource = ThisWorkbook.Worksheets("Sheet1")

36 Leave a comment on paragraph 36 0 Note the line that reads

filePath = "C:\Users\Administrator\Documents\TEST\

37 Leave a comment on paragraph 37 0 You will want to change everything after the quotation mark to point to the location where you want your separate files stored (note that Mac users will use a slightly different file path).[5]

38 Leave a comment on paragraph 38 0 Whichever tool you use, or whether you build one from scratch, being able to scrape websites for information for your research is a key ability of doing digital historical research. When using free software, be certain to read the terms of service, and understand the implications of what you are doing.

39 Leave a comment on paragraph 39 0 Next section: Normalizing and Tokenizing Data


40 Leave a comment on paragraph 40 0 [1] Holdsworth, Peter ‘Author page’ Figshare http://figshare.com/authors/Peter_Holdsworth/402385.

41 Leave a comment on paragraph 41 0 [2] A useful tutorial by Mike Klaczynski scrapes movie titles from a database https://www.tableausoftware.com/public/blog/2013/08/importio-2088. Keep in mind however that some of these steps (or perhaps all) could become obsolete if Import.io changes materially once it is formally launched (that is, no longer in ‘beta’).

42 Leave a comment on paragraph 42 0 [3] See http://import.io/pricing for more information.

43 Leave a comment on paragraph 43 0 [4] A tutorial by Jens Finnas gives detailed instructions at http://dataist.wordpress.com/2012/10/12/get-started-with-screenscraping-using-google-chromes-scraper-extension/.  Also note that Finnas maintains a useful list of resources for ‘Data Journalism’ at http://datajournalistik.se/ (in Swedish).

44 Leave a comment on paragraph 44 0 [5] We describe in later chapters the statistical programming language R, and how to accomplish various text-mining tasks within it. Once you become familiar with it, you will see that you can accomplish many of these kinds of ‘helper’ tasks within R itself, which is advantageous since you can do all of your analyses within the one environment. For instance, Ben Marwick has written a ‘script’ for R (in essence, a recipe of commands) to separate the rows in a single Excel file into separate text files, which is available at https://gist.github.com/benmarwick/9278490. We thank Ben for writing and contributing this (and several other) pieces of R code. Eke’s code is also available http://themacroscope.org/2.0/code/ Try it yourself on John Adam’s Diary (which you can download at http://themacroscope.org/2.0/datafiles. Incidentally, having the csv broken into separate files arranged in chronological order (so, sort them by date written before running the macro) means that when the zipped folder of these files is uploaded to Voyant Tools, you can explore trends over time, reading from left (older) to right (newer), in the basic Voyant Tools interface.

Page 30

Source: http://www.themacroscope.org/?page_id=625