|
An experiment in writing in public, one page at a time, by S. Graham, I. Milligan, & S. Weingart

Regex

1 Leave a comment on paragraph 1 0 A regular expression (also called regex) is a powerful tool for finding and manipulating text. At its simplest, a regular expression is just a way of looking through texts to locate patterns. A regular expressing can help you find every line that begins with a number, or every instance of an email address, or whenever a word is used even if there are slight variations in how it’s spelled. As long as you can describe the pattern you’re looking for, regular expressions can help you find it. Once you’ve found your patterns, they can then help you manipulate your text so that it fits just what you need.

2 Leave a comment on paragraph 2 0 This section will explain how to take a book scanned and available on the Internet Archive, “Diplomatic correspondence of the Republic of Texas,” and manipulate the raw text (for instance into just the right format that you can clean the data and load into the Gephi network visualization package, as a correspondence network. The data cleaning and network stages will appear in later chapters using the same data). In this section, you’ll start with a simple unstructured index of letters, use regular expressions, and turn the text into a spreadsheet that can be edited in Excel.

3 Leave a comment on paragraph 3 0 Regular expressions can look pretty complex, but once you know the basic syntax and vocabulary, simple ‘regexes’ will be easy. Regular expressions can often be used right inside the ‘Find and Replace’ box in many text and document editors, like Microsoft Word, Notepad++, or Textwrangler. You type the regular expression in the search bar, press ‘find’, and any words that match the pattern you’re looking for will appear on the screen.

4 Leave a comment on paragraph 4 0 Let’s say you’re looking for all the instances of “cat” or “dog” in your document. When you type the vertical bar on your keyboard (it looks like |, shift+backslash on windows keyboards), that means ‘or’ in regular expressions. So, if your query is dog|cat and you press ‘find’, it will show you the first time either dog or cat appears in your text. (As Ben points out, while MS Word has some basic search and replace functions, other programs for editing text take full advantage of regular expressions in their search and replace tools).

5 Leave a comment on paragraph 5 0 If you want to replace every instance of either “cat” or “dog” in your document with the world “animal”, you would open your find-and-replace box, put dog|cat in the search query, put animal in the ‘replace’ box, hit ‘replace all’, and watch your entire document fill up with references to animals instead of dogs and cats.

6 Leave a comment on paragraph 6 0 The astute reader will have noticed a problem with the instructions above; simply replacing every instance of “dog” or “cat” with “animal” is bound to create problems. Simple searches don’t differentiate between letters and spaces, so every time “cat” or “dog” appear within words, they’ll also be replaced with “animal”. “catch” will become “animalch”; “dogma” will become “animalma”; “certificate” will become “certifianimale”. In this case, the solution appears simple; put a space before and after your search query, so now it reads:

 dog | cat

7 Leave a comment on paragraph 7 0 With the spaces, “animal” replace “dog” or “cat” only in those instances where they’re definitely complete words; that is, when they’re separated by spaces.

8 Leave a comment on paragraph 8 0 The even more astute reader will notice that this still does not solve our problem of replacing every instance of “dog” or “cat”. What if the word comes at the beginning of a line, so it is not in front of a space? What if the word is at the end of a sentence or a clause, and thus followed by a punctuation? Luckily, in the language of regex, you can represent the beginning or end of a word using special characters.

\<

9 Leave a comment on paragraph 9 0 means the beginning of a word. In some programs, like TextWrangler, this is used instead:

\b

10 Leave a comment on paragraph 10 2 so if you search for \<cat , (or, in TextWrangler, \bcat )it will find “cat”, “catch”, and “catsup”, but not “copycat”, because your query searched for words beginning with “cat”. For patterns at the end of the line, you would use:

\>

11 Leave a comment on paragraph 11 0 or in TextWrangler,

\b

12 Leave a comment on paragraph 12 0 again.  The remainder of this walk-through imagines that you are using Notepad++, but if you’re using Textwrangler, keep this quirk in mind. If you search for

cat\>

13 Leave a comment on paragraph 13 0 it will find “cat” and “copycat”, but not “catch,” because your query searched for words ending with -”cat”.

14 Leave a comment on paragraph 14 0 Regular expressions can be mixed, so if you wanted to find words only matching “cat”, no matter where in the sentence, you’d search for

\<cat\>

15 Leave a comment on paragraph 15 0 which would find every instance. And, because all regular expressions can be mixed, if you searched for

\<cat|dog\>

16 Leave a comment on paragraph 16 0 and replaced all with “animal”, you would have a document that replaced all instances of “dog” or “cat” with “animal”, no matter where in the sentence they appear.

17 Leave a comment on paragraph 17 0 You can also search for variations within a single word using parentheses. For example if you were looking for instances of “gray” or “grey”, instead of the search query

gray|grey

18 Leave a comment on paragraph 18 0 you could type

gr(a|e)y

19 Leave a comment on paragraph 19 0 instead. The parentheses signify a group, and like the order of operations in arithmetic, regular expressions read the parentheses before anything else. Similarly, if you wanted to find instances of either “that dog” or “that cat”, you would search for:

(that dog)|(that cat)

20 Leave a comment on paragraph 20 0 Notice that the vertical bar | can appear either inside or outside the parentheses, depending on what you want to search for.

21 Leave a comment on paragraph 21 0 The period character . in regular expressions directs the search to just find any character at all. For example, if we searched for:

d.g

22 Leave a comment on paragraph 22 0 the search would return “dig”, “dog”, “dug”, and so forth.

23 Leave a comment on paragraph 23 0 Another special character from our cheat sheet, the plus + instructs the program to find any number of the previous character. If we search for

do+g

24 Leave a comment on paragraph 24 0 it would return any words that looked like “dog”, “doog”, “dooog”, and so forth. Adding parentheses before the plus would make a search for repetitions of whatever is in the parentheses, for example querying

(do)+g

25 Leave a comment on paragraph 25 0 would return “dog”, “dodog”, “dododog”, and so forth.

26 Leave a comment on paragraph 26 0 Combining the plus ‘+’ and period ‘.’ characters can be particularly powerful in regular expressions, instructing the program to find any amount of any characters within your search. A search for

d.+g

27 Leave a comment on paragraph 27 0 for example, might return “dried fruits are g”, because the string begins with “d” and ends with “g”, and has various characters in the middle. Searching for simply “.+” will yield query results that are entire lines of text, because you are searching for any character, and any amount of them.

28 Leave a comment on paragraph 28 0 Parentheses in regular expressions are also very useful when replacing text. The text within a regular expression forms what’s called a group, and the software you use to search remembers which groups you queried in order of their appearance. For example, if you search for

(dogs)( and )(cats)

29 Leave a comment on paragraph 29 0 which would find all instances of “dogs and cats” in your document, your program would remember “dogs” is group 1, ” and ” is group 2, and “cats” is group 3. Notepad++ remembers them as “″, “″, and “″ for each group respectively.

30 Leave a comment on paragraph 30 0 If you wanted to switch the order of “dogs” and “cats” every time the phrase “dogs and cats” appeared in your document, you would type

(dogs)( and )(cats)

31 Leave a comment on paragraph 31 0 in the ‘find’ box, and
Screen Shot 2014-06-26 at 4.29.19 PM

32 Leave a comment on paragraph 32 0 in the ‘replace’ box. That would replace the entire string with group 3 (“cats”) in the first spot, group 2 (” and “) in the second spot, and group 1 (“dogs”) in the last spot, thus changing the result to “cats and dogs”.

33 Leave a comment on paragraph 33 2 The vocabulary of regular expressions is pretty large, but there are many cheat sheets for regex online (one that we sometimes use is http://regexlib.com/CheatSheet.aspx ). To help, we’ve included a workflow for searching using regular expressions that draws from the cheat sheet, to provide a sense of how you would use it to form your own regular expressions.

34 Leave a comment on paragraph 34 0 Let’s begin. Point your browser to this document.

35 Leave a comment on paragraph 35 0 Full text: http://archive.org/stream/diplomaticcorre33statgoog/diplomaticcorre33statgoog_djvu.txt

36 Leave a comment on paragraph 36 0 Copy the text into Notepad++ or TextWrangler.[1]  Remember to save a spare copy of your file before you begin – this is very important, because you’re going to make mistakes that you won’t be sure how to fix. Now delete everything but the index where it has the list of letters. Look for this in the text, and delete everything that comes before it:

37 Leave a comment on paragraph 37 2 That is, you’re looking for the table of letters, starting with ‘Sam Houston to J. Pinckney Henderson’. (There are, before we clean them, approximately 2000 lines’ worth of letters indexed!) Notice there are a lot of features that we are not interested in at the moment: page numbers, headers, footers, or categories. We’re going to use regular expressions to get rid of them. What we want to end up with is a spreadsheet that looks kind of like:

38 Leave a comment on paragraph 38 0 Sender | Recipient | Date

39 Leave a comment on paragraph 39 0 We are not really concerned about dates for this example, but they might be useful at some point so we’ll still include them. We’re eventually going to use openrefine to fix things, but for now, that’s what we’re looking for.

40 Leave a comment on paragraph 40 0 Scroll down through the text; notice there are many lines which don’t include a letter, because they’re either header info, or blank, or some other extraneous text. We’re going to get rid of all of those lines. We want to keep every line that looks like this:

41 Leave a comment on paragraph 41 0 Sender to Recipient, Month, Date, Year, Page

42 Leave a comment on paragraph 42 0 This is a fairly convoluted process so first we’ll outline exactly what we are going to do, and then walk you through how to do it. We start by finding every line that looks like a reference to a letter, and put a little tilde ( one of these: ~ ) at the beginning of it so we know to save it for later. Next, we get rid of all the lines that don’t start with tildes, so that we’re left only with the relevant text. After this is done, we begin to format the remaining text by putting commas in appropriate places, so we can import it into a spreadsheet and do further edits there.

43 Leave a comment on paragraph 43 2 There are lots of ways we can do this, but for the sake of clarity we’re going to just delete every line that doesn’t have the word “to” in it (as in sender TO recipient).        In Notepad++ press ctrl-f or search->find to open the find dialogue box.[2] In that box, go to the ‘Replace’ tab, and check the radio box for ‘Regular expression’ at the bottom of the search box. In TextWrangler, hit command+f to open the find and replace dialogue box. Tick off the ‘grep’ radio button (which tells TextWrangler that we want to do a regex search) and the ‘wraparound’ button (which tells TextWrangler to search everywhere).

44 Leave a comment on paragraph 44 0 Remember from earlier that there’s a way to see if the word “to” appears in full. Type

45 Leave a comment on paragraph 45 0 \<to\>

46 Leave a comment on paragraph 46 0 in the search bar.  In TextWrangler remember, we would look for \bto\b.This will find every instance of the word “to” (and not, for instance, also ‘potato’).

47 Leave a comment on paragraph 47 2 Recall we don’t just want to find “to”, but the entire line that contains it. You learned earlier that the query “.+” returns any amount of text, no matter what it says. If your query is

48 Leave a comment on paragraph 48 0 .+\<to\>.+

49 Leave a comment on paragraph 49 0 your search will return every line which includes the word “to” in full, no matter what comes before or after it, and none of the lines which don’t.

50 Leave a comment on paragraph 50 2 As mentioned earlier, what we want to do is add a tilde ~ before each of the lines that look like letters, so we can save them for later. This involves the find-and-replace function, and a query identical to the one before, but with parentheses around it, so it looks like

51 Leave a comment on paragraph 51 0 (.+\<to\>)

52 Leave a comment on paragraph 52 0 and the entire line is placed within a parenthetical group. In the ‘replace’ box, enter

53 Leave a comment on paragraph 53 0 which just meanScreen Shot 2014-06-26 at 4.29.58 PMs replace the line with itself (group 1), placing a tilde before it. In short, that’s:

54 Leave a comment on paragraph 54 0  

55 Leave a comment on paragraph 55 0 STEP ONE:

56 Leave a comment on paragraph 56 0 Find: (.+\<to\>)  

57 Leave a comment on paragraph 57 0 in Textwrangler: (.+\bto\b)

58 Leave a comment on paragraph 58 0 Screen Shot 2014-06-26 at 4.30.48 PM

59 Leave a comment on paragraph 59 0  

60 Leave a comment on paragraph 60 0 Click ‘Replace All’.

61 Leave a comment on paragraph 61 0 After running the find-and-replace, you should note your document now has most of the lines with tildes in front of it, and a few which do not. The next step is to remove all the lines which do not include a tilde. The search string to find all lines which don’t begin with tildes is

62 Leave a comment on paragraph 62 0 \n[^~].+

63 Leave a comment on paragraph 63 0 A \n at the beginning of a query searches for a new line, which means it’s going to start searching at the first character of each new line.  However, given the evolution of computing, it may well be that this won’t quite work on your system. Linux based systems use \n for a new line, while Windows often uses \r\n, and older Macs just use \r. Since this will likely cause much frustration, your safest bet will be to save a copy of what you are working on, and then experiment to see what gives you the best result. In most cases, this will be:

64 Leave a comment on paragraph 64 0 \r\n[^~].+

65 Leave a comment on paragraph 65 0 Within a set of square brackets [] the carrot ^ means search for anything that isn’t within these brackets; in this case, the tilde ~. The .+ as before means search for all the rest of the characters in the line as well. All together, the query returns any full line which does not begin with a tilde; that is, the lines we did not mark as looking like letters. To reiterate,

66 Leave a comment on paragraph 66 0 STEP TWO

67 Leave a comment on paragraph 67 0 Find: \r\n[^~].+

68 Leave a comment on paragraph 68 0 Replace:

69 Leave a comment on paragraph 69 0 By finding all \r\n[^~].+ and replacing it with nothing, you effectively delete all the lines that don’t look like letters. What you’re left with is a series of letters, and a series of blank lines. We need to remove those surplus blank lines. The find-and-replace query for that is:

70 Leave a comment on paragraph 70 0 STEP THREE:

71 Leave a comment on paragraph 71 0 Find: \n\r

72 Leave a comment on paragraph 72 0 In textwrangler: ^\r

73 Leave a comment on paragraph 73 0 Replace:

74 Leave a comment on paragraph 74 0 Now that all the extraneous lines have been deleted, it’s time to format the text document into something you can import into and manipulate with Excel as a csv, or a comma-seperated-value file. A csv is a text file which spreadsheet programs like Microsoft Excel can read, where every comma denotes a new column, and every line denotes a new row.

75 Leave a comment on paragraph 75 1 To turn this text file into a spreadsheet, we’ll want to separate it out into one column for sender, one for recipient, and one for date, each separated by a single comma. Notice that most lines have extraneous page numbers attached to them; we can get rid of those with regular expressions. There’s also usually a comma separating the month-date and the year, which we’ll get rid of as well. In the end, the first line should go from looking like:

76 Leave a comment on paragraph 76 0 ~Sam Houston to J. Pinckney Henderson, December 31, 1836 51

77 Leave a comment on paragraph 77 5 to

78 Leave a comment on paragraph 78 0 Sam Houston, J. Pinckney Henderson, December 31 1836

79 Leave a comment on paragraph 79 0 such that each data point is in its own column.

80 Leave a comment on paragraph 80 0 Start by removing the page number after the year and the comma between the year and the month-date. To do this, first locate the year on each line by using the regex:

81 Leave a comment on paragraph 81 2 [0-9]{4}

82 Leave a comment on paragraph 82 0 As a cheat sheet shows, [0-9] finds any digit between 0 and 9, and {4} will find four of them together. Now extend that search out by appending .+ to the end of the query; as seen before, it will capture the entire rest of the line. The query

83 Leave a comment on paragraph 83 0 [0-9]{4}.+

84 Leave a comment on paragraph 84 0 will return, for example, “1836 51″, “1839 52″, and “1839 53″ from the first three lines of the text. We also want to capture the comma preceding the year, so add a comma and a space before the query, resulting in

85 Leave a comment on paragraph 85 0 , [0-9]{4}.+

86 Leave a comment on paragraph 86 0 which will return “, 1836 51″, “, 1839 52″, etc.

87 Leave a comment on paragraph 87 0 The next step is making the parenthetical groups which will be used to remove parts of the text with find-and-replace. In this case, we want to remove the comma and everything after year, but not the year or the space before it. Thus our query will look like:

88 Leave a comment on paragraph 88 0 (,)( [0-9]{4})(.+)

89 Leave a comment on paragraph 89 0 with the comma as the first group “″, the space and the year as the second “″, and the rest of the line as the third “″. Given that all we care about retaining is the second group, the find-and-replace will look like this:

90 Leave a comment on paragraph 90 0 STEP FOUR

91 Leave a comment on paragraph 91 2 Find: (,)( [0-9]{4})(.+)

92 Leave a comment on paragraph 92 0 Replace:    #that’s space-backslash-two

93 Leave a comment on paragraph 93 0 Wordpress seems to be removing  any time I writes it. See, it was here in the draft!WordPress seems to be removing any time I write it. See, it was here in the draft!

94 Leave a comment on paragraph 94 0  

95 Leave a comment on paragraph 95 0  

96 Leave a comment on paragraph 96 0  

97 Leave a comment on paragraph 97 0  

98 Leave a comment on paragraph 98 0 The next step is easy; remove the tildes we added at the beginning of each line, and replace them with nothing to delete them.

99 Leave a comment on paragraph 99 0 STEP FIVE

100 Leave a comment on paragraph 100 0 Find: ~

101 Leave a comment on paragraph 101 0 Replace:

102 Leave a comment on paragraph 102 0 Finally, to separate the sender and recipient by a comma, find all instances of the word “to” and replace it with a comma. Although we used \< and \> (in TextWrangler, \b )to denote the beginning and end of a word earlier in the lesson, we don’t need to do that here. All we need to find is the word and the space preceding it, ” to”, and replace it with a comma “,”.

103 Leave a comment on paragraph 103 0 STEP SIX

104 Leave a comment on paragraph 104 0 Find:  to       #remember to include a space in front of “ to”

105 Leave a comment on paragraph 105 0 Replace: ,     # a comma with no spaces around it

106 Leave a comment on paragraph 106 0 (you don’t type the # and the comment after it in the search and replace boxes. Hashes are often used to comment out code, and so, following Ben’s suggestion, I replaced my annotation with hashes to make this easier to read. SG, June 25th)

107 Leave a comment on paragraph 107 0 You may notice that some lines still do not fit our criteria. Line 22, for example, reads

108 Leave a comment on paragraph 108 0 “Abner S. Lipscomb, James Hamilton and A. T. Bumley, AugUHt 15, “. It has an incomplete date; these we don’t need to worry about for our purposes. More worrisome are lines, like 61 “Copy and summary of instructions United States Department of State, ” which include none of the information we want. We can get rid of these lines in Excel.

109 Leave a comment on paragraph 109 0 The only non-standard lines we need to worry about with regular expressions are the ones with more than 2 commas, like line 178, “A. J. Donelson, Secretary of State [Allen,. arf interim], December 10 1844″. Notice that our second column, the name of the recipient, has a comma inside of it. If you were to import this directly into Excel, you would get four columns, one for sender, two for recipient, and one for date, which would break any analysis you would then like to run. Unfortunately these lines need to be fixed by hand, but happily regular expressions make finding them easy. The query:

110 Leave a comment on paragraph 110 0 .+,.+,.+,

111 Leave a comment on paragraph 111 0 will show you every line with more than 2 commas, because it finds any line that has any set of characters, than a comma, then any other set, then another comma, and so forth.

112 Leave a comment on paragraph 112 0 STEP SEVEN

113 Leave a comment on paragraph 113 0 Find: .+,.+,.+,

114 Leave a comment on paragraph 114 0 After using this query, just find each occurrence (there will be 15 of them), and replace the appropriate comma with another character that signifies it was there, like a semicolon. While you’re searching, you may find some other lines, like 387, “Barnard E. Bee, James Treat, April 28, 1»40 665″, which are still not quite perfect. If you see them, go ahead and fix them by hand so they fit the proper format, deleting the lines that are not relevant.  Finally, there will be snippets of text left over at the bottom of the file. Highlight these and delete them.

115 Leave a comment on paragraph 115 0 At the top of the file, add a new line that says “Sender, Recipient, Date”. These will be the column headers.

116 Leave a comment on paragraph 116 0 Go to file->save as, and save the file as correspondence.csv.

117 Leave a comment on paragraph 117 2 Congratulations! You have used regular expressions to extract and clean data. This skill alone will save you valuable time

118 Leave a comment on paragraph 118 0  


119 Leave a comment on paragraph 119 0 [1] Notepad++ (for Windows) can be downloaded at http://notepad-plus-plus.org/ . Textwrangler (for Mac) can be found at http://www.barebones.com/products/textwrangler/

120 Leave a comment on paragraph 120 0 [2] Available at http://notepad-plus-plus.org/ if you haven’t already installed it. On mac, try Textwrangler http://www.barebones.com/products/textwrangler/

Page 67

Source: http://www.themacroscope.org/?page_id=521