|
An experiment in writing in public, one page at a time, by S. Graham, I. Milligan, & S. Weingart

8,000 Canadians – adding new columns to a network file

1 Leave a comment on paragraph 1 0 Once data has been uploaded into Gephi, it might become apparent that it would be useful to split this data up by some new attribute. For instance, a network of 8,000 individuals over three centuries might occlude or obscure patterns of interest. Here is how we tease this network into three separate networks.

2 Leave a comment on paragraph 2 0 The first problem is to extract date information from the bibliographical essays. One could go to the ‘data laboratory’ pane in Gephi, add a new column called ‘century’, and manually enter all of the data, but this would be a very long process indeed. Instead, let’s extract those dates semi-automatically. We create a new batch script containing a series of commands to iterate over every file in our folder. User ‘Aacini’ suggests the following script to do so:

3 Leave a comment on paragraph 3 0
@echo off
setlocal EnableDelayedExpansion
if exist result.csv del result.csv
for %%f in (*.txt) do (
set i=0
for /F "delims=" %%l in (%%f) do (
set /A i+=1
set line!i!=%%l
)
echo %%f, !line3!, !line5!, !line7! >> result.csv
)

4 Leave a comment on paragraph 4 1 We open our text editor, paste this in, and save it with a .bat file extension (in Windows). This script goes over each .txt file in the folder, pastes its name and the contents of the third, fifth, and seventh lines into a .csv file called ‘result’. When we look at our txt files, we see that there is date information usually in the very first line of the file (born, died). So we rewrite the script:

@echo off
setlocal EnableDelayedExpansion
if exist result.csv del result.csv
for %%f in (*.txt) do (
    set i=0
    for /F "delims=" %%l in (%%f) do (
        set /A i+=1
        set line!i!=%%l
    )
    echo %%f, !line1! >> result.csv
)

5 Leave a comment on paragraph 5 0 We save this, then run it by double-clicking its name in the explorer window. After a brief moment, ‘result.csv’ appears. When we open this file in Notepad++ or Excel, we have the file name, and the contents of each line. Using regex search and replace, we can remove the alphabetic and punctuation characters, leaving just the dates. We create a new column called ‘Id’ (the capital-I is important) and a new column called ‘Period’ which contains the dates. The items under Id have to be the same as under Id in the Gephi file. We save this csv, and in Gephi, under the data laboratory, we click on ‘import spreadsheet’. Making sure to keep the ‘force nodes to be created as new ones’ unchecked, we click ok – and we now have a new column in Gephi with the dates for each individual in them. On the main overview panel in Gephi we can set up filters that display only those nodes that fall between 1700 and 1800, for instance.

6 Leave a comment on paragraph 6 0 For the purposes of the quick and dirty analysis in this chapter, we fudged by not indicating specific dates, but rather ‘17th century or earlier’, ‘18th century’, and ‘19th century’ (the data we are working with contain no ‘purely’ 20th century individuals). Periodisation carries huge implications, as discussed by Matthew Jockers in Macroanalysis. We encourage readers to consider the lesson on ‘Cleaning Data with OpenRefine’ in The Programming Historian 2 for an alternative approach to cleaning and extracting data.

Page 88

Source: http://www.themacroscope.org/?page_id=72