|
An experiment in writing in public, one page at a time, by S. Graham, I. Milligan, & S. Weingart

Cleaning Data with Open Refine

1 Leave a comment on paragraph 1 0 Previous section: Manipulating Text with the Power of Regular Expressions

2 Leave a comment on paragraph 2 0 Open Refine is a powerful tool that originated within Google. Since 2012, it has become a free, open source tool under continual community development. It allows users to get a quick overview of their data, find the messy bits, and start to transform the data into a useable format for further analyses. At the Open Refine webpage, there are numerous tutorials, background information, and manuals that we strongly recommend any historian to explore. Here, we continue our example from above and use Open Refine to clean up the data that we extracted using the regex (like in our section above about installing Voyant-Tools locally, Open Refine runs as a server on your own computer).

3 Leave a comment on paragraph 3 0 Download Open Refine from http://openrefine.org/download.html. Follow the installation instructions. Start Open Refine by double clicking on its icon. This will open a new browser window, pointing to http://127.0.0.1:3333. This location is your own computer, so even though it looks like it’s running on the Internet, it isn’t. The ‘3333’ is a ‘port’, meaning that Open Refine is running much like a server, serving up a webpage via that port to the browser.

4 Leave a comment on paragraph 4 0 Start a new project by clicking on the ‘create project’ tab on the left hand side of the screen. Click on ‘choose files’ and select the CSV file created in the previous section. This will give you a preview of your data. Name your project in the box on the top right side and then click ‘create project’. It may take a few minutes.

5 Leave a comment on paragraph 5 0 3.10 create project open refine

[insert Figure 3.10 The regex search results imported into Open Refine]

7 Leave a comment on paragraph 7 0 Once your project has started, one of the columns that should be visible in your data is ‘sender’. Click on the arrow to the left of “Sender” in OpenRefine and select Facet->Text Facet. Do the same with the arrow next to “Recipient”. A box will appear on the left side of the browser showing all 189 names listed as senders in the spreadsheet (Figure 3.10).

8 Leave a comment on paragraph 8 0 The spreadsheet itself is nearly a thousand rows, so immediately we see that, in this correspondence collection, some names are used multiple times. You may also have noticed that many of the names suffered from errors in the text scan (OCR or Optical Character Recognition errors), rendering some identical names from the book as similar, but not the same, in the spreadsheet. For example the recipient “Juan de Dios Cafiedo” is occasionally listed as “Juan de Dios CaAedo”. Any subsequent analysis will need these errors to be cleared up, and OpenRefine will help fix them.

9 Leave a comment on paragraph 9 0 Within the “Sender” facet box on the left-hand side, click on the button labeled “Cluster”. This feature presents various automatic ways of merging values that appear to be the same.[1]  Play with the values in the drop-down boxes and notice how the number of clusters found change depending on which method is used. For example, the first ‘method’ that it will suggest is Key Collision, which provides two suggestions. You might then want to attempt the Nearest Neighbor approach, which will find several more as well. Try out the two different distance functions as well. Keep playing. Because the methods are slightly different, each will present different matches that may or may not be useful. If you see two values which should be merged, e.g. “Ashbel Smith” and “. Ashbel Smith”, check the box to the right in the ‘Merge’ column and click the ‘Merge Selected & Re-Cluster’ button below.

10 Leave a comment on paragraph 10 0 Go through the various cluster methods one-at-a-time, including changing number values, and merge the values that appear to be the same. “Juan de Dios CaAedo” clearly should be merged with “Juan de Dios Cafiedo”, however “Correspondent in Tampico” probably should not be merged with “Correspondent at Vera Cruz.” If you are not an expert of this period, use your best judgement. By the end, you should have reduced the number of unique Senders from 189 to around 150. Repeat these steps with Recipients, reducing unique Recipients from 192 to about 160. To finish the automatic cleaning of the data, click the arrow next to “Sender” and select ‘Edit Cells->Common transforms->Trim leading and trailing whitespace’. Repeat for “Recipient”. The resulting spreadsheet will not be perfect, but it will be much easier to clean by hand than it would have been before taking this step (figure 3.11). Click on ‘export’ at the top right of the window to get your data back out as a .csv file. This is a versatile file format that we have seen before. However, Open Refine can also transform data for specific applications like Gephi.

11 Leave a comment on paragraph 11 0 3.11-1 openrefine

12 Leave a comment on paragraph 12 0 3.11-2 openrefine

13 Leave a comment on paragraph 13 0 [insert Figure 3.11 The output of using Open Refine to clean our data] 

14 Leave a comment on paragraph 14 0 We discuss the network analysis and visualization tool Gephi in more depth in later chapters, but right now you will see how easy it can be to visualize a network using Gephi. Before we talk about how to import data for Gephi, there is one more step to do in Open Refine.

15 Leave a comment on paragraph 15 0 In order to get this data into Gephi, we will have to rename the “Sender” column to “source” and the “Recipient” column to “target”. In the arrow to the left of Sender in the main OpenRefine window, select Edit column->Rename this column, and rename the column “source”. Now in the top right of the window, select ‘Export->Custom tabular exporter’. Notice that “source”, “target”, and “Date” are checked in the content tab; uncheck “Date”, as it will not be used in Gephi. Go to the download tab and change the download option from ‘Tab-separated values (TSV)’ to ‘Comma-separated values (CSV)’ and press download.[2] The file will likely download to your automatic download directory.

16 Leave a comment on paragraph 16 0 Gephi is an increasingly popular piece of network visualization and analysis software. This next section is not meant to be an in-depth introduction to Gephi (that comes later!) but rather illustrates the final step in the process that began when we started looking at the correspondence of the Republic of Texas. Download and install Gephi from http://gephi.github.io. Now open Gephi by double clicking its icon. Click ‘new project’. The middle pane of the interface window is the ‘Data Laboratory’, where you can interact with your network data in spreadsheet format. This is where we can import the data cleaned up in Open Refine.

17 Leave a comment on paragraph 17 0 In the Data Laboratory, select ‘Import Spreadsheet’. Press the ellipsis ‘…’ and locate the csv you just saved in the download directory. Make sure that the Separator is listed as ‘Comma’ and the ‘As table’ is listed as ‘Edges table’. Press ‘Next’, then ‘Finish’.

18 Leave a comment on paragraph 18 0 Your data should load up. Click on the ‘overview’ tab and you will be presented with a tangled network graph. You can save your work here and return to it after reading chapter 5! Don’t worry; we will explain what to do with it then.

19 Leave a comment on paragraph 19 0 Congratulations. You’ve downloaded a body of historical materials, used regular expressions to parse this material, trimming it to remove extraneous information that wasn’t of use to you, used Open Refine to clean it further, and loaded it into a software package for further analysis and visualization. With these skills, you will be able to handle very big historical data indeed! On the website for this book (under ‘Data, Code & More’), we have also include some further exercises and tutorials exploring some further things one could do with this data, such as parsing the text for place names (‘Named Entity Recognition’), or visualizing the pattern of correspondence over time as a dynamic network.

20 Leave a comment on paragraph 20 0 Next section: Quickly Extracting Tables from PDFs


21 Leave a comment on paragraph 21 0 [1] This is described in more detail here https://github.com/OpenRefine/OpenRefine/wiki/Clustering-In-Depth.

22 Leave a comment on paragraph 22 0 [2] It’s worth pointing out that once you have cleaned data in CSV or TSV format, your data can be imported into a variety of other tools, or be ready for other kinds of analysis. Many online visualization tools like Raw (http://app.raw.densitydesign.org/ ) and Palladio  (http://palladio.designhumanities.org/ ) accept and expect data in this format.

Page 39

Source: http://www.themacroscope.org/?page_id=645