|
An experiment in writing in public, one page at a time, by S. Graham, I. Milligan, & S. Weingart

Networks in Practice

1 Leave a comment on paragraph 1 0 This section shows how to take the conceptual framework of network analysis and apply it in practice. A historical network analysis will often require a similar series of steps: (1) deciding on a dataset, (2) encoding, collecting, or cleaning the data, (3) importing the data into a network analysis package, (4) analyzing the data, (5) visualizing the data, (6) interpreting results, (7) drawing conclusions. The framework is not universal, and the process usually requires a lot of repeating steps, and some steps may be omitted or added depending on circumstance. Steps 4 & 5, while covered to a small in this book, are extremely open-ended, and historians who wish to learn more are encouraged to delve into the tool of their choice using the Further Reading section of this chapter. The steps we take the most time explaining usually take the least time in practice; less than 5% of the time spent on a project will be time spent analyzing and visualizing data. Most time will be spent on collecting, cleaning, and interpreting.

Picking a Dataset

2 Leave a comment on paragraph 2 0 A popular question by historians interested in network analysis is how will they know which data will be useful to collect, and how much they should collect. The answer depends on the project, of course, but unfortunately it is often difficult to tell at the outset of a project which data will be most useful, if any, for a network analysis. Many network analyses lead to dead ends, and the more experience you have, the earlier you will begin to notice when a network analysis might not be going anywhere.

3 Leave a comment on paragraph 3 0 There are two camps of network analysis which are relevant in creating a dataset: exploratory and hypothesis-driven. Exploratory network analysis is based around the idea that the network is important, but in as-yet unknown ways. Hypothesis-driven network analysis requires a preconceived idea about the world which the network analysis will either reinforce or discredit. For historical network analysis these two camps are often blurred, but for the purposes of data creation, the distinction is useful.

4 Leave a comment on paragraph 4 0 Exploratory network analysis is by far the more difficult for creating data. If you do not yet have a strong concept of what may be of interest in a network, you will often be at the mercy of what data is most readily available. A good many network-oriented digital history projects begin this way; a historian is made aware of a pre-existing network dataset, which already comes with a set of pre-defined categories, and they explore that dataset in order to find whatever interesting may arise from it. This is not a flawed approach, but it can be an extremely limiting one. The creation of these datasets is rarely driven by an interest in historical networks, and the categories available may only be able to explore a limited set of questions.

5 Leave a comment on paragraph 5 0 Hypothesis-driven network analysis, although perhaps more foreign to historians from its formal perspective, will reduce the likelihood of an analysis reaching a dead end. The first step is to operationalize a historiographic claim. The claim might be that popularity among early electric blues musicians was deeply influenced by their connections to other musicians rather than to record labels. The claim helps constrain the potential dataset in time and scope, and lends itself to the collection of certain data. First you need to pick nodes and edges: this will be musicians and record labels connected by recordings and contracts and collaborations. You must operationalize popularity; perhaps this can be done through the sale of records, or the frequency with which certain songs were played on the radio. You can operationalize the strength of connections as well, perhaps as a function of the number times two musicians recorded together or recorded with a particular label. Because it might be difficult to tell whether a connection to a musician or a label is more important, it would be useful to include the time of events in the data gathered.  With all these data, you could then check whether those connected with certain people or labels tended to become more popular, using path lengths, or whether certain musicians were instrumental in introducing new musicians to record labels, by looking at triadic closure. Going into an analysis with a preformed hypothesis or question will make both the data gathering and analysis step much easier.

Software

6 Leave a comment on paragraph 6 0 It is important to decide on a software package to use before embarking on a network study, because it can dictate which data you plan on collecting and how you plan on collecting it. Thankfully, you are not locked in to any particular software choice; file formats are convertible to one another, although the process can occasionally be occult, and sometimes an analysis requires the use of more than one tool.

UCINET

7 Leave a comment on paragraph 7 0 We do not recommend UCINET for most historians; it is not free software and can have a steep learning curve. However, for those who are already familiar with matrix mathematics, UCINET can be more intuitive than the other options. It is Windows-only, has a lot of advanced features, and has a wide user base among social network analysts. We do not recommend UCINET for creating visualizations.

Pajek

8 Leave a comment on paragraph 8 2 Pajek is a free program for network analysis with more features and algorithms than any other non-command-line tool.  We do not recommend it for those starting out with network analysis, as it can be difficult to learn, however those in need of algorithms they can find in no other tool might find Pajek suits their needs. Pajek is Windows-only, has a wider user base, and newer versions can scale to fairly large networks. We do not recommend Pajek for creating visualizations. The Pajek format is an industry standard, and most tools can read or create them.

NWB & Sci2

9 Leave a comment on paragraph 9 0 The Network Workbench and the Sci2 Tool (both developed in collaboration with one of the authors of this textbook) are similar free tools for the manipulation of data, the analysis of networks, and the creation of visualizations. The first focuses on network analysis, and the second focuses on scientometrics (citation analysis and other similar goals). These tools have more features than the others with regards to data preprocessing, especially in the creation of networks out of both unstructured and structured data, but fewer specifically analytic tools than UCINET or Pajek. They are particularly useful for converting between file formats, they run on all platforms, and are slightly easier to use than Pajek and UCINET, though not as easy to use as NodeXL or Gephi.

NodeXL

10 Leave a comment on paragraph 10 0 NodeXL is a free plugin for Microsoft Excel, and is Windows-only. We recommend NodeXL for historians who are familiar with Excel and are just beginning to explore network analysis. The plugin is not as feature-rich as any of the others in this list, however it does make entering and editing data extremely easy, and works very well for small datasets. NodeXL also provides some unique visualizations and the ability to import data from social networks; it does not scale well to networks of more than a few thousand nodes and edges.

Gephi

11 Leave a comment on paragraph 11 0 Gephi is quickly becoming the tool of choice for network analysts who do not need quite the suite of algorithms offered by Pajek or UCINET. Although it does not have the data entry or preprocessing features of NWB, Sci2, or NodeXL, it is easy to use (eclipsed in this only by NodeXL), can analyze fairly large networks, and creates beautiful visualizations. The development community is also extremely active, with improvements being added constantly. We recommend Gephi for the majority of historians undertaking serious network analysis research.

Networks online: d3.js, gexf-js, and sigma.js

12 Leave a comment on paragraph 12 0 When it comes to network visualizations, an element of interactivity is at the top of most researcher’s wish-lists. Up until recently, the only truly great options for online, interactive network visualizations came out of the Stanford Visualization Group under Jeffrey Heer and including the work of Mike Bostock. This team is responsible for a number of widely used visualization infrastructures, including the prefuse toolkit, a java-based framework for creating visualizations; flare, an actionscript (Adobe Flash) library for creating visualizations; and protovis, a javascript library for creating visualizations. The team’s most recent venture, d3.js, is a highly flexible javascript-based framework for developing novel visualizations, and is currently the industry standard for interactive, online visualizations. Mike Bostock is now a graphics editor at the New York Times, and the Stanford Visualization Group has moved to the University of Washington and is now the Interactive Data Lab.

13 Leave a comment on paragraph 13 0 d3.js is a complex language, and it can be difficult for beginners, even those familiar with coding, to use effectively. A number of libraries have been created as a layer around d3.js which attempt to ease the process of creating visualizations, including vega and NVD3. All of these libraries require some knowledge of coding, however a little effort in learning them can be rewarded by highly customized interactive networks online. For those who do not want to code a visualization themselves, there are a few options for creating interactive online visualizations using Gephi. Seadragon web export, Sigmajs Exporter, and Gexf-JS Web Viewer are all plugins available through the Gephi marketplace for creating such visualizations.

Data in Abstract

14 Leave a comment on paragraph 14 0 Network data matches network theory: its basic components are nodes and edges. There are three ways these are generally represented: as matrices, as adjacency lists, and as node & edge lists. Each have their own strengths and weaknesses, and they will be discussed below.

15 Leave a comment on paragraph 15 0 The same example network will be used in all three descriptions: a network of exchange between four fictional cities. The data types will be used to show how network data can have varying degrees of detail.

Matrices

16 Leave a comment on paragraph 16 0 Although not in most historians’ toolboxes, the matrix is an extremely useful representation for small networks, and it happens to double as a simple network visualization. The below is a network of trade between four fictional cities: Netland, Connectia, Graphville, and Nodopolis. A ‘0’ is placed when there is no trade route between cities, and a ‘1’ if there is.

Netland Connectia Graphville Nodopolis

Netland

1 0 1

Connectia

1 1

Graphville

0

Nodopolis

21 Leave a comment on paragraph 21 0 From this matrix, we can infer that Connectia trades with Netland, Graphville, and Nodopolis; and Nodopolis trades with Netland. Notice only the upper triangle of the matrix is filled in; the diagonal (shaded) is left unfilled because it is not meaningful for cities to trade with themselves, and the lower triangle is left unfilled because in a symmetrical undirected network, any information in that corner would be redundant and identical. The network is unweighted, because all we know is whether trade exists between two cities (represented by 0 or 1), but we do not know the amount of trade.

Netland Connectia Graphville Nodopolis

Netland

$10mil 0 $4mil

Connectia

$2mil $4mil

Graphville

0

Nodopolis

26 Leave a comment on paragraph 26 0 This matrix is identical to the previous, but it now represents a weighted network. It shows $10 million in trade between Connectia and Netland, $2 million between Connectia and Graphville, $4 million between Connectia and Nodopolis, and $4 million between Netland and Nodopolis. This representation could be extended even further.

Target
Netland Connectia Graphville Nodopolis
Source

Netland

$6mil 0 $1mil

Connectia

$4mil $1mil $3mil

Graphville

0 $1mil 0

Nodopolis

$3mil $1mil 0

31 Leave a comment on paragraph 31 0 The matrix now represents a directed, weighted network of trade between cities. The directionality means the trade relationships between cities can be represented asymmetrically, thus broken up into their constituent parts. Directional networks require filling both the upper and lower triangle of the matrix.  Source and Target are the network terms of choice for the nodes that do the sending and those that do the receiving, respectively. In the above matrix, Netland (the source) sends $6 million to Connectia (the target) and $1 million to Nodopolis (the target). Netland (the target) receives $4 million from Connectia (the source) and $3 million from Nodopolis (the source).

32 Leave a comment on paragraph 32 0 If we wanted to extend this representation even further, we could create multiple parallel matrices. Parallel matrices could represent time slices, so each matrix represents trade between cities in a subsequent year. Alternatively, parallel matrices could represent different varieties of trade, e.g., people, money, and goods. This is one method to encode a multiplex network.

33 Leave a comment on paragraph 33 0 Matrices were, at one point, the standard way to represent networks. They are fairly east to read, do not take up much space, and a lot of the network analysis algorithms are designed using matrix mathematics. As networks have become larger, however, it is becoming more common to represent them in adjacency lists or node/edge lists. The matrix is still readable by most network software, and some programs are optimized for use with matrices, particularly UCINET and, to a lesser extent, NodeXL.

Adjacency Lists

34 Leave a comment on paragraph 34 0 The adjacency list is a simple replacement for the matrix, and a bit easier when it comes to data entry. Like a matrix, it can be used to represent many varieties of networks.

Netland

Connectia

Netland

Nodopolis

Connectia

Graphville

Connectia

Nodopolis

43 Leave a comment on paragraph 43 0 This adjacency list represents the same network as the first matrix. It is undirected and unweighted. Adding weights is as easy as adding an additional column of data.

Weight

Netland

Connectia

$10mil

Netland

Nodopolis

$4mil

Connectia

Graphville

$2mil

Connectia

Nodopolis

$4mil

52 Leave a comment on paragraph 52 0 Adjacency lists can have any number of additional columns for every additional edge trait. For example, if this were a multiplex network encoding different varieties of trade, there might be two additional columns: one for goods, another for people. Each additional column could be filled with numerical values. Alternatively, columns could be used to encode the type of a tie. In the Florentine family network, a column for “type of relationship” could be filled in as “trade”, “marriage”, or “both”.

53 Leave a comment on paragraph 53 0 Adjacency lists can also be used to represent directed, asymmetric networks, as below.

Source

Target Weight

Netland

Connectia

$6mil

Netland

Nodopolis

$1mil

Connectia

Graphville

$1mil

Connectia

Nodopolis

$3mil

Connectia

Netland

$4mil

Graphville

Connectia

$1mil

Nodopolis

Netland

$3mil

Nodopolis

Connectia

$1mil

71 Leave a comment on paragraph 71 0 This is identical to the final matrix. In network visualizations, directedness is represented as an arrow going from the source to the target, which implies the directionality of an edge.

Node & Edge Lists

72 Leave a comment on paragraph 72 0 We recommend this data structure for historians embarking on a network analysis. It is widely used, easy to enter data manually, and allows additional information to be appended to nodes, rather than just to edges. The one down side of Node & Edge Lists is that they require more initial work, as they involve the creation of two separate tables: one for nodes, and one for edges.

Nodes

ID

Label

1

Graphville

2

Nodopolis

3

Connectia

4

Netland

83 Leave a comment on paragraph 83 0  

Edges

4

3

4

2

3

1

3

2

92 Leave a comment on paragraph 92 0 This node & edge list is equivalent to the first adjacency list and the first matrix. Notice particularly that nodes are now given unique IDs that are separate from their labels; this becomes useful if, for example, there are multiple cities with the same name. The below two tables show how to add weights and directionality, as well as additional attributes to individual nodes.

Nodes

ID

Label

Population Country

1

Graphville

700,000 USA

2

Nodopolis

250,000 Canada

3

Connectia

1,000,000 Canada

4

Netland

300,000 USA

103 Leave a comment on paragraph 103 0  

Edges

Source

Target Weight

4

3

$6mil

4

2

$1mil

3

1

$1mil

3

2

$3mil

3

4

$4mil

1

3

$1mil

2

4

$3mil

2

3

$1mil

121 Leave a comment on paragraph 121 0 Although node & edge lists require more initial setup, they pay off in the end for their ease of data entry and flexibility. Unfortunately, not all software interprets these data structures in the exact same way, so it will be necessary to convert the gathered data into a format that the program can actually read.

File Formats

122 Leave a comment on paragraph 122 0 There are many file formats for networks, and most are instantiations of the three main data structures covered in the previous section. The below describe them in some detail, using the four fictional cities as examples.

UCINET

123 Leave a comment on paragraph 123 0 UCINET’s data format is fairly simple plaintext, but also fairly difficult to edit by hand. The most common format used by UCINET is the “full matrix,” which includes node a list of node labels which are defined before the matrix is written out. The matrix is assumed to have the same nodes in the same order on the horizontal and the vertical, as below.
dl N = 4
format = fullmatrix
labels:
netland, connectia, graphland, nodopolis
data:
0 1 0 1
0 0 1 1
0 0 0 0
0 0 0 0

124 Leave a comment on paragraph 124 0 Take a moment and notice how this is the same data structure as the first matrix example in the previous section, although the exact specifications of the file format are unique to UCINET. The first line declares the number of nodes (4), the second declares the format, the third and fourth declare the labels, and subsequent lines describe the edges. The file extension is *.DL.

Pajek, NWB, & Sci2

125 Leave a comment on paragraph 125 0 The standard file formats for Pajek and NWB/Sci2 are similar to one another, as they are both node and edge lists encoded in a single plain text file. In the Pajek file format (*.net) calls nodes Vertices, directed edges arcs, and undirected edges edges.

126 Leave a comment on paragraph 126 0 *Vertices 4
1 “Graphville”
2 “Nodopolis”
3 “Connectia”
4 “Netland”
*Edges
4 3
4 2
3 1
3 2

127 Leave a comment on paragraph 127 0 If the edges were directed, the subsection would be *Arcs instead of *Edges. Weight can be added to each edge by simply adding a number equivalent to the edge weight at the end of the line featuring that edge.

128 Leave a comment on paragraph 128 0 The NWB/Sci2 file format (*.nwb) is quite similar, although it requires more declarations of network types and variables. It also requires a declaration of each variable type (int = integer, string = string of text, etc.). An example of an NWB file that has both additional node and edge information would look like this.
*Nodes     4
id*int     label*string    population*float     country*string
1    “Graphville”    700000     “USA”
2    “Nodopolis”     250000     “Canada”
3    “Connectia”     1000000    “Canada”
4    “Netland”  300000     “USA”
*DirectedEdges  8
source*int target*int weight*float
4    3    6000000
4    2    1000000
3    1    1000000
3    2    3000000
3    4    4000000
1    3    1000000
2    4    3000000
2    3    1000000

129 Leave a comment on paragraph 129 0 These file formats are very sensitive to small errors, so it is usually best not to edit them directly.

GEXF

130 Leave a comment on paragraph 130 0 Gephi’s xml-based file format, GEXF, is saved in plain text as *.gexf. It is fairly verbose, but allows quite a bit of detail of a network to be saved.
<?xml version="1.0" encoding="UTF-8"?>
<gexf xmlns="http://www.gexf.net/1.2draft" version="1.2">
<graph mode="static" defaultedgetype="undirected">
<nodes>
<node id="0" label="Graphville" />
<node id="1" label="Nodopolis" />
<node id="2" label="Connectia" />
<node id="3" label="Netland" />
</nodes>
<edges>
<edge id="0" source="3" target="2" />
<edge id="0" source="3" target="1" />
<edge id="0" source="2" target="0" />
<edge id="0" source="2" target="1" />
</edges>
</graph>
</gexf>

131 Leave a comment on paragraph 131 0 These files are easy to read, but should not be edited directly. It is much easier to edit Gephi files from within Gephi’s data manager.

NodeXL

132 Leave a comment on paragraph 132 0 NodeXL files are not saved in plaintext; instead, they are saved in the default Microsoft Excel format. Entering data directly into this format is the easiest of all the others, as it simply requires opening up the file in Excel and editing or adding as necessary.

How to Read Network Visualizations

133 Leave a comment on paragraph 133 0 Network visualizations can be complex and difficult to read. Nodes and edges are not always represented as dots and lines, and even when they are, the larger the network, the more difficult they are to decipher. The reasons behind visualizing a network can differ, but in general, visualizations of small networks are best at allowing the reader to understand individual connections, whereas visualizations of large networks are best for revealing global structure.

134 Leave a comment on paragraph 134 1 Network visualizations, much like network analysis, may or may not add insight depending on the context. A good rule of thumb is to ask a network-literate friend reading the final product whether the network visualization helps them understand the data or the narrative any more than the prose alone. It often will not. We recommend not including a visualization of the data solely for the purpose of revealing the complexity of the data at hand, as it conveys little information, and feeds into a negative stereotype of network science as an empty methodology.

Matrix Diagrams

135 Leave a comment on paragraph 135 0 Matrix diagrams tend to be used more by computational social scientists than traditional social network analysts. They are exact, colorized versions of the matrix data structure, and are good for showing community patterns in medium-to-large networks. They do not suffer from the same clutter as force-directed visualizations, but they also do not lend themselves to be read at the scale of individual actors.

136 Leave a comment on paragraph 136 0 matrixLesMis

137 Leave a comment on paragraph 137 0 Figure [xxx] is a matrix visualization of character interactions in Victor Hugo’s Les Misérables. We made this visualization using Excel to reinforce the fact that matrix visualizations are merely data structures that have been colored and zoomed out. Each column is a character in the book, as is each row, and the list of character names are in the same order horizontally and vertically. A cell is shaded red if the character from that row interacted with the character from that column; it is shaded white if they did not. Note that only one of the matrix’s triangles is filled, because the network is symmetric.

138 Leave a comment on paragraph 138 0 We performed community detection on the network, and ordered the characters based on whether they were in a community together. That is why some areas are thick with red cells, and others are not; each triangular red cluster represents a community of characters that interact with one another. The vertical columns which feature many red cells are main characters who interact with many other characters in the book. The long column near the left-hand side, for example, is the character interactions of Jean Valjean.

139 Leave a comment on paragraph 139 0 Matrix diagrams can be extended to cover asymmetric networks by using both the matrix’s upper and lower triangles. Additional information can be encoded in the intensity of a color (signifying edge weight) or the hue of the shaded cell (indicating different categories of edges).

Tree Layouts

140 Leave a comment on paragraph 140 0 Networks sometimes fall into one of a few special categories, and some of those categories have corresponding visualizations. Trees are one such special category, recognizable by their hierarchical structure. They are networks with a single root node which branches out to a few children nodes; those in turn have more children nodes, and the nodes at the very end are called leaf nodes. Tree networks cannot deviate from this structure.

141 Leave a comment on paragraph 141 1 Visualizations of tree layouts are likely the most familiar to those not versed in network science. Family tree diagrams, corporate organization charts, and other common visualizations make use of this layout.

142 Leave a comment on paragraph 142 0 Wedgwood-family

143 Leave a comment on paragraph 143 0 Figure [xxx] is an example of a tree layout of the Wedgwood/Darwin family. Notice that nodes (in this case people) do not have multiple parents; in the formal definition of a tree network, parents can have multiple children, but children can only have one parent. For similar reasons, although Charles Darwin married his first cousin Emma Wedgwood, they could not be connected in this visualization. This visualization was created using d3.js.

144 Leave a comment on paragraph 144 0 Trees have the advantage of being easier to read than most other network visualizations, but the disadvantage of being fairly restrictive in what they can visualize.

Force-Directed Layouts

145 Leave a comment on paragraph 145 0 Force-directed layouts are the visualizations most people think of when they think of network analyses. An example is in Figure [xxx], below.florentine

146 Leave a comment on paragraph 146 0 This visualization, created using the Sci2 tool, represents a subset of the Florentine families network. Force-directed layouts like this one attempt to reduce the number of edges that cross one another while simultaneously bringing more directly-connected nodes closer together. They do this by modeling the network as though it were a physics problem, as though each edge were a spring and each node a unit connecting various springs together. The computer simulates this system, letting the springs bounce around until each one is as little stretched as it possibly can be. At the same time, the nodes repel each other, like magnets of the same polarity, so the nodes do not appear too close together. Eventually, the nodes settle into a fairly legible graph like Figure [xxx]. This algorithm is an example of how force-directed layouts work, but they do not all use springs and magnets, and often they are a great deal more complex than described.

147 Leave a comment on paragraph 147 0 There are a few important takeaways from this algorithm. The first is that the layout is generally stochastic; there is an element of randomness that will orient the nodes and edges slightly differently every time it is run. The second is that the traditional spatial dimensions (vertical and horizontal) that are so often meaningful in visualizations have no meaning here. There is no x or y axis, and spatial distance from one node to another is not inherently meaningful. For example, had Figure [xxx] been laid out again, the Acciaiuoli family could just as easily have been closer to the Pazzi than the Salviati family, as opposed to in this case where the reverse is true. To properly read a force-directed network visualization, you need to retrain your visual understanding such that you are aware that it is edges, not spatial distance, which marks nodes as closer or farther away.

148 Leave a comment on paragraph 148 0 This style of visualization becomes more difficult to read as a network grows. Larger instantiations have famously been called “spaghetti-and-meatball visualizations” or “giant hairballs”, and it can be impossible to discern any particular details. Still, in some cases, these very large-scale force-directed networks can be useful in discerning patterns at-a-glance.

Other Visualizations

149 Leave a comment on paragraph 149 0 Matrix, Tree, and Force-Directed visualizations are the three most common network visualizations, but they are by no means the only options. A quick search for chord diagrams, radial layouts, arc layouts, hive plots, circle packs, and others will reveal a growing universe of network visualizations. Picking which is appropriate in what situation can be more of an art than a science.

150 Leave a comment on paragraph 150 0 We recommend that, where possible, complex network visualizations should be avoided altogether. It is often easier to and more meaningful for a historical narrative to simply provide a list of the most well-connected nodes, or, e.g., a scatterplot showing the relationship between connectivity and vocation. If the question at hand can be more simply answered with a traditional visualization which historians are already trained to read, it should be.

Page 93

Source: http://www.themacroscope.org/?page_id=424