An experiment in writing in public, one page at a time, by S. Graham, I. Milligan, & S. Weingart

Quickly Extracting Tables from PDFs

1 Leave a comment on paragraph 1 0 Previous section: Cleaning Data with Open Refine

2 Leave a comment on paragraph 2 0 Much of the historical data that we would like to examine that we find online comes in the cumbersome form of the PDF, ‘the portable document format’. Governments especially love the PDF because they can quickly be generated in response to freedom of information requests, and because they preserve the look and layout of the original paper documents. Sometimes, a PDF is little more than an image; the text show is just a pattern of dark and light dots. Other times, there is a hidden layer of machine-readable text that one can select with the mouse by clicking and dragging and copying. When we are dealing with tens or hundreds or thousands of pages of pdfs, this quickly is not a feasible workflow. Journalists have this same problem, and have developed many tools that the historian may wish to incorporate into her own workflow. Recently, the ‘data journalist’ Jonathan Stray has written about the various free and paid tools that can be wrangled to extract meaningful data from thousands of pdfs at a time.[1] One in particular that Stray mentions is called ‘Tabula’, which can be used to extract tables of information from PDFs, such as may be found in census documents.

3 Leave a comment on paragraph 3 0 Tabula is open source and runs on all the major platforms. You simply download it from http://tabula.nerdpower.org/, install it, and then double-click on the icon; it loads up inside your browser at address[2] If for some reason it does not appear, try typing that address into the address bar directly. Once Tabula is running, you load your pdf into it. When the PDF appears, draw boxes around the tables that you are interested in grabbing. Tabula will then extract that table cleanly, allowing you to download it as a csv or tab separated file, or paste it directly into something else.

4 Leave a comment on paragraph 4 0 For instance, say you’re interested in the data that Gill and Chippindale compiled on neolithic Cycladic figurines and the art market.[3] If you have access to the database JSTOR, you can find it here http://www.jstor.org/stable/506716. There are a lot of charts so it is a good example to play with. You would like to grab those tables of data to perhaps compile with data from other sources to perform some sort of meta study about the antiquities market.

5 Leave a comment on paragraph 5 0 Download the paper, open it in your pdf reader and then feed it into Tabula. Let’s look at table 2 from the article in your pdf reader. You could just highlight this table in your pdf reader and hit ctrl+c to copy it, but when you paste that into your spreadsheet, you’d get everything in a single column. For a small table, maybe that’s not such a big issue. But let’s look at what you get with Tabula. With Tabula running, you go to the pdf your are interested in, and draw a bounding box around the table. Release the mouse and you will be presented with a preview that you can then download as a csv. You can quickly drag the selection box around every table in the document and hit download just the one time.

6 Leave a comment on paragraph 6 0 Since you can copy directly to the clipboard, you can paste directly into a Google drive spreadsheet (thus taking advantage of all the visualization options that Google offers) or into something like Raw from Density Design, exploring the data and its patterns through a variety of quickly generated visualizations.[4]

7 Leave a comment on paragraph 7 0 Next section: Chapter Three Conclusion

8 Leave a comment on paragraph 8 0 [1] Stray,  “You Got the Documents. Now What?”

9 Leave a comment on paragraph 9 0 [2] Since it is open source, you can make and maintain your own copy, in the event that the original ‘Tabula’ website goes offline. Indeed, this is a habit you should get into. (This is called ‘forking’ on github; you create an account on github, login, then go to the repository you wish to copy. Click ‘fork’ and you’ve got your own copy!)

10 Leave a comment on paragraph 10 0 [3] David W. Gill and Christopher Chippindale. ‘Material and Intellectual Consequences of Esteem for Cycladic Figures’. American Journal of Archaeology 97, no. 4 (October 1993): 601.

11 Leave a comment on paragraph 11 0 [4] Another open source project, ‘Raw’ lets you paste your data into a box on a webpage, and then render that data using a variety of different kinds of visualizations. You can download it at http://app.raw.densitydesign.org/. Raw does not send any data over the Internet; it performs all calculations and visualizations within your browser, so your data stays secure. It is possible (but not easy) to install Raw locally on your own machine, if you wish. Follow the links on the Raw website to its github code repository.

Page 40

Source: http://www.themacroscope.org/?page_id=647