|
An experiment in writing in public, one page at a time, by S. Graham, I. Milligan, & S. Weingart

Automatic Retrieval of Data

1 Leave a comment on paragraph 1 0 Previous section: Building The Historian’s Toolkit

2 Leave a comment on paragraph 2 0 First, however, we need data. How can historians find Big Data repositories to work on? As with everything, this depends on your unique needs. We want to use a few different examples that scale in difficulty. At its easiest, the document is sitting there on one or two webpages and can be downloaded with a few clicks of your mouse, as in the case of second American president John Adam’s diary. With a little more effort, we can automatically access large collections from the exhaustive Internet Archive. And with some added effort and a few more lines of commands, we can access even larger collections from national institutions including the Library of Congress and Library and Archives Canada. Join us as we bring Big Data to your home computer.

3 Leave a comment on paragraph 3 0 Have you ever sat at your computer, looking at a long list of records held at a place like Library and Archives Canada, the British Library, or the Library of Congress, right-clicking and downloading each record-by-painstaking-record? Or have you pulled up a query in the Internet Archive that brings up hundreds of results? Perhaps not, but chances are that you have Googled for historical sources relating to your research interests, used online finding aids, and perhaps wondered how to access it in a quicker, more straightforward way. There are tools to ease this sort of work. There is a bit of a learning curve, in that the first download might actually take longer to do than if you did it manually, but there is a cumulative savings: the next collection will download far quicker, and from then on, you are saving time with every query.

4 Leave a comment on paragraph 4 0 One of the most powerful ways for a historian to download data is the free and open-source program wget, available on every platform. It lets you set up some rules and automatically download data. If you ever find yourself staring at a list, right-clicking and downloading each individual file in a repetitious way, fear not: this is the situation that something like wget was designed for.

5 Leave a comment on paragraph 5 0 Unlike many other software programs that you may be familiar with, wget runs on a command-line interface. If you were a computer user in the 1980s or early 1990s, this will be familiar: MS-DOS was a command-line interface. If not, however, this may seem somewhat foreign. Most users today interact with computer systems through a graphical-user interface, which lets you navigate files and programs through images (icons, pictures, and rendered text). Several of the tools discussed in this book use the command line. While this has a bit of a learning curve, it offers a degree of precision and – eventually – speed that compensates for the initial difficulty.

6 Leave a comment on paragraph 6 0 In this section, we want to show you how you could install wget on your system using the command line. While we do not provide this level of detail for every program in this handbook, we think that this can help demystify things. Playing on the command line is not just for ‘hackers’ – with some introductions, you can begin functioning there as well. Let’s begin. Because operating systems differ from system to system, you will see different instructions below. There is also another guide to using the command line at the Programming Historian, if you visit http://programminghistorian.org/lessons/intro-to-bash.

7 Leave a comment on paragraph 7 0 Wget can be relatively easily installed on your system. Linux users will have it pre-installed by default. Mac users have a slightly more complicated procedure (if you are a windows user, please skip the next paragraph or two until the paragraph beginning ‘for windows users’). From the App Store or from the Apple website itself, install ‘XCode.’ Once it has installed, install the ‘Command Line Tools’ kit from the ‘Preferences’ tab in the program. With that complete, a popular package manager named Homebrew can help you install wget in a few lines of code. Open your ‘Terminal’ window, which is by default found in the ‘Utilities’ folder within your ‘Applications,’ and type the following to install Homebrew:

ruby -e "$(curl -fsSL https://raw.github.com/Homebrew/homebrew/go/install)"

8 Leave a comment on paragraph 8 0 It might ask you for your password – don’t worry, this is a normal stage in installing software (you often have to type in your password when installing software in your normal graphical operating system; if this line still doesn’t work, try putting sudo at the start of the command). Get it configured to make sure it works by typing

brew doctor

9 Leave a comment on paragraph 9 0 And then install wget with the following easy command:

brew install wget

10 Leave a comment on paragraph 10 0 Wget will download and install itself on your system.

11 Leave a comment on paragraph 11 0             For Windows users, the easiest way is to download WGET for Windows. You simply download the file (wget.exe) to your c:\windows directory so you can access it from anywhere else on your system.[1] Then it is as easy as opening up your command line (which is ‘cmd.exe’ and can be found by searching for that term through your Start Menu or finding it under ‘Accessories’). If you have downloaded wget.exe into the C:\windows directory, typing wget into that box will work.

12 Leave a comment on paragraph 12 0 You can open the command line within any directory in Windows by holding the shift key and right-clicking in the folder. A contextual menu box will open. Select ‘open command window here’. This can save you some hassle as it is usually easier to navigate your files and folders using Windows Explorer than by typing the change directory command, cd .. .

13 Leave a comment on paragraph 13 0 All together now (Mac, Windows, and Linux) – let’s quickly use it: Before we turn back to the main body of the lesson, since we have gone so far as to install wget, let’s quickly use it. For more instructions, you might want to check out Ian Milligan’s “Automated Downloading with Wget” lesson at the Programming Historian at http://programminghistorian.org/lessons/automated-downloading-with-wget. But in short, the command is:

wget <any modifiers> <the site or page you want to download>

14 Leave a comment on paragraph 14 0 Let us try it on the example of the Macroscope page pertaining to this lesson. Type:

wget http://www.themacroscope.org/?page_id=330

15 Leave a comment on paragraph 15 0 If you download this and then open it up with your web browser (using the file menu -> open command in your browser), you will see a copy of that page! We can do more advanced things with this, such as mirroring an entire website or downloading entire arrays of documents. We will return to wget later.

16 Leave a comment on paragraph 16 0 Next section: How to Become a Programming Historian, a Gentle Introduction


17 Leave a comment on paragraph 17 0 [1] For example, at “WGET for Wndows (Win32),” http://users.ugent.be/~bpuype/wget/, accessed 31 July 2013.

Page 28

Source: http://www.themacroscope.org/?page_id=621