Principles of Information Visualization
¶ 1 Leave a comment on paragraph 1 0 Reading history through a macroscope will involve visualization. Visualization is a method of deforming, compressing, or otherwise manipulating data in order to see it in new and enlightening ways. A good visualization can turn hours of careful study into a flash of insight, or can convey a complex narrative in a single moment. Visualizations can also lie, confuse, or otherwise misrepresent if used poorly. This section will introduce historians to types of visualizations, why they might want to use them, and how to use them effectively. It will also show off some visualizations which have been used effectively by historians.
¶ 2 Leave a comment on paragraph 2 0 A 13th century Korean edition of the Buddhist canon contains over 52 million characters across 166,000 pages. Lewis Lancaster describes a traditional analysis of this corpus as such:
The previous approach to the study of this canon was the traditional analytical one of close reading of specific examples of texts followed by a search through a defined corpus for additional examples. When confronted with 166,000 pages, such activity had to be limited. As a result, analysis was made without having a full picture of the use of target words throughout the entire collection of texts. That is to say, our scholarship was often determined and limited by externalities such as availability, access, and size of written material. In order to overcome these problems, scholars tended to seek for a reduced body of material that was deemed to be important by the weight of academic precedent.
¶ 4 Leave a comment on paragraph 4 0 As technologies advanced, the old limitations were no longer present; Lancaster and his team worked to create search interface that would allow historians to see the evolution and use of glyphs over time, effectively allowing them to explore the entire text all at once. No longer would historians need to selectively pick which areas of this text to scrutinize; they could quickly see where in the corpus their query was most-often used, and go from there.
¶ 6 Leave a comment on paragraph 6 0 This approach to distant reading–seeing where in a text the object of inquiry is densest–has since become so common as to no longer feel like a visualization. Amazon’s Kindle has a search function called X-Ray which allows the reader to search for a series of words, and see the frequency with which those words appear in a text over the course of its pages. In Google’s web browser, Chrome, searching for a word on a webpage highlights the scroll bar on the right-hand side such that it is easy to see the distribution of that word use across the page.
¶ 8 Leave a comment on paragraph 8 0 The use of visualizations to show the distribution of words or topics in a document is an effective way of getting a sense for the location and frequency of your query in a corpus, and it represents only one of the many uses of information visualization. Uses of information visualization generally fall into two categories: exploration and communication.
¶ 9 Leave a comment on paragraph 9 0 When first obtaining or creating a dataset, visualizations can be a valuable aid in understanding exactly what data are available and how they interconnect. In fact, even before a dataset is complete, visualizations can be used to recognize errors in the data collection process. Imagine you are collecting metadata from a few hundred books in a library collection, making note of the publisher, date of publication, author names, and so on. A few simple visualizations, made easily in software like Microsoft Excel, can go a long way in pointing out errors. Notice how in the chart below, it can easily be noticed that whomever entered the data on book publication dates accidentally typed “1909” rather than “1990” for one of the books.
¶ 11 Leave a comment on paragraph 11 0 Similarly, visualizations can be used to get a quick understanding of the structure of data being entered, right in the spreadsheet. The below visualization, of salaries at a university, makes it trivial to spot which department’s faculty have the highest salaries, and how those salaries are distributed. It utilizes basic functions in recent versions of Microsoft Excel.
¶ 13 Leave a comment on paragraph 13 0 More complex datasets can be explored with more advanced visualizations, and that exploration can be used for everything from getting a sense of the data at hand, to understanding the minute particulars of one data point in relation to another. The below visualization, ORBIS, allows the user to explore transportation networks in the Ancient Roman world. This particular display is showing the most likely route from Rome to Constantinople under a certain set of conditions, but the user is invited to tweak those conditions, or the starting and ending cities, to whatever best suits their own research questions.
¶ 15 Leave a comment on paragraph 15 1 Exploratory visualizations like this one form a key part of the research process when analyzing large datasets. They sit particularly well as an additional layer in the hermeneutic process of hypothesis formation. You may begin your research with a dataset and some preconceptions of what it means and what it implies, but without a well-formed thesis to be argued. The exploratory visualization allows you to notice trends or outliers that you may not have noticed otherwise, and those trends or outliers may be worth explaining or discussing in further detail. Careful historical research of those points might reveal even more interesting directions worth exploring, which can then be folded into future visualizations.
¶ 16 Leave a comment on paragraph 16 0 Once the research process is complete, visualizations still have an important role to play in translating complex data relationships into easily digestible units. The right visualization can replace pages of text with a single graph and still convey the same amount of information. The visualization below created by Ben Schmidt, for example, shows the frequency with which certain years are mentioned in the titles of history dissertations. The visualization clearly shows that the great majority dissertations cover years after 1750, with spikes around the American Civil War and the World Wars. While my description of the chart does describe the trends accurately, it does not convey the sheer magnitude of difference between earlier and later years as covered by dissertations, nor does it mention sudden drop in dissertations covering periods after 1970.
¶ 18 Leave a comment on paragraph 18 0 Visualizations in publications are often, but not always, used to improve a reader’s understanding of the content being described. It is also common for visualizations to be used to catch the eye of readers or peer reviewers, to make research more noticeable, memorable, or publishable. In a public world that values quantification so highly, visualizations may lend an air of legitimacy to a piece of research which it may or may not deserve. We will not comment on the ethical implications of such visualizations, but we do note that such visualizations are increasingly common and seem to play a role in successfully passing peer review, receiving funding, or catching the public eye. Whether the ends justifies the means is a decision we leave to our readers.
Types of Visualizations
¶ 19 Leave a comment on paragraph 19 0 Up until this point, we have used the phrase information visualization without explaining it or differentiating it from other related terms. We remedy that here: information visualization is the mapping of abstract data to graphic variables in order to make a visual representation. We use these representations to augment our abilities to read data; we cannot hope to intuit all relationships in our data by memory and careful consideration alone, and visualizations make those relationships more apparent.
¶ 20 Leave a comment on paragraph 20 0 An information visualization differs from a scientific visualization in the data it aims to represent, and in how that representation is instantiated. Scientific visualizations maintain a specific spatial reference system, whereas information visualizations do not. Visualizations of molecules, weather, motors, and brains are all scientific visualizations because they each already have a physical instantiation, and their visual form is preserved in the visualization. Bar charts, scatter plots, and network graphs, on the other hand, are all information visualizations, because they lay out in space data which do not have inherent spatiality. An infographic is usually a combination of information and scientific visualizations embedded in a very explicit narrative and marked up with a good deal of text.
¶ 21 Leave a comment on paragraph 21 1 These types are fluid, and some visualizations fall between categories. Most information visualizations, for example, contain some text, and any visualization we create is imbued with the narrative and purpose we give it, whether or not we realize we have done so. A truly “objective” visualization, where the data speak for themselves, is impossible. Our decisions on how to encode our data and which data to present deeply influence the understanding readers take away from a visualization.
¶ 22 Leave a comment on paragraph 22 0 Visualizations also vary between static, dynamic, and interactive. Experts in the area have argued that the most powerful visualizations are static images with clear legends and a clear point, although that may be changing with increasingly powerful interactive displays which give users impressive amounts of control over the data. Some of the best examples modern come from the New York Times visualization team. Static visualizations are those which do not move and cannot be manipulated; dynamic visualizations are short animations which show change, either over time or across some other variable; interactive visualizations allow the user to manipulate the graphical variables themselves in real-time. Often, because of change blindness, dynamic visualizations may be confusing and less informative than sequential static visualizations. Interactive visualizations have the potential to overload an audience, especially if the controls are varied and unintuitive. The key is striking a balance between clarity and flexibility.
¶ 23 Leave a comment on paragraph 23 0 There is more to visualization than bar charts scatter plots. We are constantly creating new variations and combinations of visualizations, and have been for hundreds of years. An exhaustive list of all the ways information has or can be visualized would be impossible, however we will attempt to explain many of the more common varieties. Our taxonomy is influenced by visualizing.org, a website dedicated to cataloging interesting visualizations, but we take examples from many other sources as well.
Statistical Charts & Time Series
¶ 24 Leave a comment on paragraph 24 0 Statistical charts are likely those that will be most familiar to any audience. When visualizing for communication purposes, it is important to keep in mind which types of visualizations your audience will find legible. Sometimes the most appropriate visualization for the job is the one that is most easily understood, rather than the one that most accurately portrays the data at hand. This is particularly true when representing many abstract variables at once: it is possible to create a visualization with color, size, angle, position, and shape all representing different aspects of the data, but it may become so complex as to be illegible.
¶ 26 Leave a comment on paragraph 26 0 The above visualization is a basic bar chart of the amount of non-fiction books held in some small collection, categorized by genre. One dimension of data is the genre, which is qualitative, and each is being compared along a second category, number of books, which is quantitative. Data with two dimensions, one qualitative and one quantitative, usually are best represented as bar charts such as this.
¶ 27 Leave a comment on paragraph 27 0 Sometimes you want to visualize data as part of a whole, rather than in absolute values. In these cases, with the same qualitative/quantitative split in data, most will immediately rely on pie charts such as the one below. This is often a poor choice: pie charts tend to be cluttered–especially as the number of categories increase–and people have a difficult time interpreting the area of a pie slice.
¶ 29 Leave a comment on paragraph 29 0 The same data can be rendered as a stacked bar chart, which produces a visualization with much less clutter. This chart also significantly decreases the cognitive load of the reader as well, as they merely need to compare bar length rather than try to internally calculate the area of a slice of pie.
¶ 31 Leave a comment on paragraph 31 0 When there are two quantitative variables to be represented, rather than a quantitative and a qualitative, the visualization most often useful is the line graph or scatterplot. Volumes of books in a collection ordered by publication year, for example, can be expressed with the year on the horizontal axis (x-axis) and the number of books on the vertical axis (y-axis). The line drawn between each (x,y) point represents our assumption that the data points are somehow related to each other, and an upward or downward trend is somehow meaningful.
¶ 33 Leave a comment on paragraph 33 0 We could replace the individual lines between years with a trend line, one that shows the general upward or downward trend of the data points over time. This reflects our assumption that not only are the year-to-year changes meaningful, but that there is some underlying factor that is causing the total number of volumes to shift upward or downward across the entire timespan. In this case, it seems that on average the number of books collection seems to be decreasing as publication dates approach the present day, which can easily be explained by the lag in time it might take before the decision is made to purchase a book for the collection.
¶ 35 Leave a comment on paragraph 35 0 Scatterplots have the added advantage of being amenable to additional dimensions of data. The scatterplot below compares three dimensions of data: genre (qualitative), number of volumes of each genre in the collection (quantitative), and average number of pages per genre (quantitative). It shows us, for examples, that the collection contains quite a few biographies, and biographies have much fewer pages on average than reference books. The scatterplot also shows us that it is fairly useless; there are no discernible trends or correlations between any of the variables, and no new insights emerge from viewing the visualization.
¶ 37 Leave a comment on paragraph 37 0 The histogram is a visualization that is both particularly useful and extremely deceptive for the unfamiliar. It appears to be a vertical bar chart, but instead of the horizontal axis representing categorical data, a histogram’s horizontal axis usually also represents quantitative data, sub-divided in a particular way. Another way of saying this is that in a bar chart, the categories can be moved left or right without changing the meaning of the visualization, whereas in a histogram, there is a definite order to the categories of the bar. For example, the figure below represents the histogram of grade distributions in a college class. It would not make sense for the letter grades to be in any order but the order presented below. Additionally, histograms always represent the distribution of certain values; that is, the height of the bar can never represent something like temperature or age, but instead represents the frequency with which some value appears. In the visualization below, bar height represents the frequency with which students in a college course get certain grades.
¶ 39 Leave a comment on paragraph 39 1 This histogram shows that the distribution of student’s grades does not follow a true bell curve, with as many As as Fs in the class. This is not surprising for anyone who has taught a course, but it is a useful visualization for representing such divergences from expected distributions.
¶ 40 Leave a comment on paragraph 40 0 Despite their seeming simplicity, these very basic statistical visualizations can be instigators for extremely useful analyses. The visualization below shows the changing frequency of the use of “aboue” and “above” (spelling variations of the same word) in English printed text from 1580-1700. Sam Kaislaniemi noted in a blog post how surprising it is that the spelling variation seems to have changed so drastically in a period of two decades. This instigated further research, leading to an extended blog post and research into a number of other datasets from the same time period.
¶ 42 Leave a comment on paragraph 42 1 Basic maps may be considered scientific visualizations, because latitude and longitude is a pre-existing spatial reference systems which most geographic visualizations conform to exactly. However, as content is added to a map, it may gain a layer or layers of information visualization.
¶ 43 Leave a comment on paragraph 43 0 One of the most common geographic visualizations is the choropleth, where bounded regions are colored and shaded to represent some statistical variable. Common uses for choropleths include representing population density or election results. The below visualization, created by Mike Bostock, colors counties by unemployment rate, with darker counties having higher unemployment. Choropleth maps should be used for ratios and rates rather than absolute values, otherwise larger areas may be disproportionately colored darker due merely to the fact that there is more room for people to live.
¶ 45 Leave a comment on paragraph 45 0 For some purposes, choropleths provide insufficient granularity for representing density. In the 1850s, a cholera outbreak in London left many concerned and puzzled over the origin of the epidemic. Dr. John Snow created a dot density map (below) showing the location of cholera cases in the city. The visualization revealed that most cases were around a single water pump, suggesting the outbreak was due to a contaminated water supply.
¶ 47 Leave a comment on paragraph 47 0 For representing absolute values on maps, you should instead consider using a proportional symbol map. The below map, created by Mike Bostock, shows the populations of some of America’s largest cities. These visualizations are good for directly comparing absolute values to one another, when geographic region size is not particularly relevant. Keep in mind that often, even if you plan on representing geographic information, the best visualizations may not be on a map. In this case, unless you are trying to show that the higher density of populous areas is in the Eastern U.S., you may be better served by a bar chart, with bar heights representative of population size. That is, the latitude and longitude of the cities is not particularly important in conveying the information we are trying to get across.
¶ 49 Leave a comment on paragraph 49 0 Data that continuously change throughout geographic space (e.g. temperature or elevation) require a more complex visualizations. The most common in this case are known as isopleth, isarithmic, or contour maps, and they represent gradual change using adjacent, curving lines. Note that these visualizations work best for data which contain smooth transitions. The example topographic map below uses adjacent lines to show gradual changes in elevation; the closer together the lines, the more rapidly the elevation changes.
¶ 51 Leave a comment on paragraph 51 0 Geographic maps have one feature that sets them apart from most other visualizations: we know them surprisingly well. While few people can label every U.S. state or European country on a map accurately, we know the shape of the world enough to take some liberties with geographic visualizations that we cannot take with others. Cartograms are maps which distort the basic spatial reference system of latitude and longitude in order represent some statistical value. They work because we know what the reference is supposed to look like, so we can immediately intuit how cartogram results differ from the “base map” we are familiar with. The cartogram below, created by M.E.J. Newman, distorts state sizes by their population, and colors the states by how they voted in the 2008 U.S. presidential election. It shows that, although a greater area of the United States may have voted Republican, those areas tended to be quite sparsely populated.
¶ 54 Leave a comment on paragraph 54 0 In the humanities, map visualizations will often need to be of historical or imagined spaces. While there are many convenient pipelines to create custom data overlays of maps, creating new maps entirely can be a grueling process with few easy tools to support it. It is never as simple as taking a picture of an old map and scanning it into the computer; the aspiring cartographer will need to painstakingly match points on an old scanned map to their modern latitude and longitude, or to create new map tiles entirely. The below visualizations are two examples of such efforts: the first is a reconstructed map of the ancient world which includes aqueducts, defense walls, sites, and roads by Johan Åhlfeldt with Pelagios, and the second is a reconstructed map of Tolkien’s Middle Earth by Emil Johansson. Both are examples of extremely careful humanistic work which involved both additional data layers, and changes to the base map.
Hierarchies & Trees
¶ 57 Leave a comment on paragraph 57 0 The most common forms of visualization for this type of data are vertical and horizontal trees. The horizontal tree below, made in D3.js, shows the children and grandchildren of Josiah Wedgwood. These visualizations are extremely easy to read by most people, and have been used for many varieties of hierarchical data. Trees have the advantage of more legible than most other network visualizations, but the disadvantage of being fairly restrictive in what they can visualize.
¶ 59 Leave a comment on paragraph 59 0 Another form of hierarchical visualization, called a radial tree, is often used to show ever-branching structures, as in an organization. The radial tree below, a 1924 organization chart taken from Wikipedia, emphasizes how power in the organization is centralized in one primary authority. It is important to remember that stylistic choices can deeply influence the message taken from a visualization. Horizontal and radial trees can represent the same information, but the former emphasizes change over time, whereas the latter emphases the centrality of the highest rung on the hierarchy. Both are equally valid, but they send very different messages to the reader.
¶ 61 Leave a comment on paragraph 61 0 One of the more recently-popular hierarchical visualizations is the treemap designed by Ben Shneiderman. Treemaps use nested rectangles to display hierarchies, the areas of which represent some quantitative value. The rectangles are often colored to represent a third dimension of data, either categorical or quantitative. The below visualization is of the Washington D.C.’s budget in 2013, separated into governmental categories. Rectangles are sized proportionally to the amount of money received per category in 2013, and colored by the percentage that amount had changed since the previous fiscal year.
Networks & Matrices
¶ 63 Leave a comment on paragraph 63 0 Network visualizations can be complex and difficult to read. Nodes and edges are not always represented as dots and lines, and even when they are, the larger the network, the more difficult they are to decipher. The reasons behind visualizing a network can differ, but in general, visualizations of small networks are best at allowing the reader to understand individual connections, whereas visualizations of large networks are best for revealing global structure.
¶ 64 Leave a comment on paragraph 64 0 Network visualizations, much like network analysis, may or may not add insight depending on the context. A good rule of thumb is to ask a network-literate friend reading the final product whether the network visualization helps them understand the data or the narrative any more than the prose alone. It often will not. We recommend not including a visualization of the data solely for the purpose of revealing the complexity of the data at hand, as it conveys little information, and feeds into a negative stereotype of network science as an empty methodology.
¶ 65 Leave a comment on paragraph 65 0 Matrix diagrams tend to be used more by computational social scientists than traditional social network analysts. They are exact, colorized versions of the matrix data structure discussed in section [xxx], and are good for showing community patterns in medium-to-large networks. They do not suffer from the same clutter as force-directed visualizations, but they also do not lend themselves to be read at the scale of individual actors.
¶ 67 Leave a comment on paragraph 67 0 This figure is a matrix visualization of character interactions in Victor Hugo’s Les Misérables. We made this visualization using Excel to reinforce the fact that matrix visualizations are merely data structures that have been colored and zoomed out. Each column is a character in the book, as is each row, and the list of character names are in the same order horizontally and vertically. A cell is shaded red if the character from that row interacted with the character from that column; it is shaded white if they did not. Note that only one of the matrix’s triangles is filled, because the network is symmetric.
¶ 68 Leave a comment on paragraph 68 0 We performed community detection on the network, and ordered the characters based on whether they were in a community together. That is why some areas are thick with red cells, and others are not; each triangular red cluster represents a community of characters that interact with one another. The vertical columns which feature many red cells are main characters who interact with many other characters in the book. The long column near the left-hand side, for example, is the character interactions of Jean Valjean.
¶ 69 Leave a comment on paragraph 69 0 Matrix diagrams can be extended to cover asymmetric networks by using both the matrix’s upper and lower triangles. Additional information can be encoded in the intensity of a color (signifying edge weight) or the hue of the shaded cell (indicating different categories of edges).
¶ 72 Leave a comment on paragraph 72 0 This visualization, created using the Sci2 tool, represents a subset of the Florentine families network. Force-directed layouts like this one attempt to reduce the number of edges that cross one another while simultaneously bringing more directly-connected nodes closer together. They do this by modeling the network as though it were a physics problem, as though each edge were a spring and each node a unit connecting various springs together. The computer simulates this system, letting the springs bounce around until each one is as little stretched as it possibly can be. At the same time, the nodes repel each other, like magnets of the same polarity, so the nodes do not appear too close together. Eventually, the nodes settle into a fairly legible graph like the figure above. This algorithm is an example of how force-directed layouts work, but they do not all use springs and magnets, and often they are a great deal more complex than described.
¶ 73 Leave a comment on paragraph 73 0 There are a few important takeaways from this algorithm. The first is that the layout is generally stochastic; there is an element of randomness that will orient the nodes and edges slightly differently every time it is run. The second is that the traditional spatial dimensions (vertical and horizontal) that are so often meaningful in visualizations have no meaning here. There is no x or y axis, and spatial distance from one node to another is not inherently meaningful. For example, had Figure [xxx] been laid out again, the Acciaiuoli family could just as easily have been closer to the Pazzi than the Salviati family, as opposed to in this case where the reverse is true. To properly read a force-directed network visualization, you need to retrain your visual understanding such that you are aware that it is edges, not spatial distance, which marks nodes as closer or farther away.
¶ 74 Leave a comment on paragraph 74 0 This style of visualization becomes more difficult to read as a network grows. Larger instantiations have famously been called “spaghetti-and-meatball visualizations” or “giant hairballs”, and it can be impossible to discern any particular details. Still, in some cases, these very large-scale force-directed networks can be useful in discerning patterns at-a-glance.
¶ 75 Leave a comment on paragraph 75 0 Matrix and Force-Directed visualizations are the two most common network visualizations, but they are by no means the only options. A quick search for chord diagrams, radial layouts, arc layouts, hive plots, circle packs, and others will reveal a growing universe of network visualizations. Picking which is appropriate in what situation can be more of an art than a science.
¶ 76 Leave a comment on paragraph 76 0 We recommend that, where possible, complex network visualizations should be avoided altogether. It is often easier and more meaningful for a historical narrative to simply provide a list of the most well-connected nodes, or, e.g., a scatterplot showing the relationship between connectivity and vocation. If the question at hand can be more simply answered with a traditional visualization which historians are already trained to read, it should be.
Small Multiples & Sparklines
¶ 77 Leave a comment on paragraph 77 0 Small multiples and sparklines are not exactly different types of visualization than what have already been discussed, but they represent a unique way of presenting visualizations that can be extremely compelling and effective. They embody the idea that simple visualizations can be more powerful than complex ones, and that multiple individual visualizations can often be more easily understood than one incredibly dense visualization.
¶ 78 Leave a comment on paragraph 78 0 Small multiples are exactly what they sound like: the use of multiple small visualizations adjacent to one another for the purposes of comparison. They are used in lieu of animations or one single extremely complex visualization attempting to represent the entire dataset. The below visualization, by Brian Abelson of OpenNews, is of cold- and warm-weather anomalies in the United States since 1964. Cold weather anomalies are in blue, and warm weather anomalies are in read. This visualization is used to show increasingly extreme warm weather due to global warming.
¶ 80 Leave a comment on paragraph 80 0 Sparklines, a term coined by Edward Tufte, are tiny line charts with no axis or legend. They can be used in the middle of a sentence, for example to show a changing stock price over the last week ( ), which will show us general upward or downward trends, or in small multiples to compare several values. Microsoft Excel has a built-in sparkline feature for just such a purpose. The below figure is a screenshot from Excel, showing how sparklines can be used to compare the frequency of character appearances across different chapters of a novel.
¶ 82 Leave a comment on paragraph 82 0 The sparklines above quickly show Carol as the main character, and that two characters were introduced in Chapter 3, without the reader needing to look at the numbers in the rest of the of the spreadsheet.
Choosing the Right Visualization
¶ 83 Leave a comment on paragraph 83 1 There is no right visualization. A visualization is a decision you make based on what you want your audience to learn. That said, there are a great many wrong visualizations. Using a scatterplot to show average rainfall by country is a wrong decision; using a bar chart is a right one. Ultimately, your choice of which type of visualization to use is determined by how many variables you are using, whether they are qualitative or quantitative, how you are trying to compare them, and how you would like to present them. Creating an effective visualization begins by choosing from one of the many appropriate types for the task at hand, and discarding inappropriate types as necessary. Once you have chosen the form your visualization will take, you must decide how you will create the visualization: what colors will you use? What symbols? Will there be a legend? The following sections cover these steps.
¶ 84 Leave a comment on paragraph 84 0 Once a visualization type has been chosen, the details may seem either self-evident or negligible. Does it really matter what color or shape the points are? In short, yes, it matters just as much as the choice of visualization being used. And, when you know how to effectively use various types of visual encodings, you can effectively design new forms of visualization which suit your needs perfectly. The art of visual encoding is in the ability to match data variables and graphic variables appropriately. Graphic variables include the color, shape, or position of objects in the visualization, whereas data variables include what is attempting to be visualized (e.g. temperature, height, age, country name, etc.)
Scales of Measure
¶ 85 Leave a comment on paragraph 85 0 The most important aspect of choosing an appropriate graphic variable is to know the nature of your data variables. Although the form data might take will differ from project to project, it will likely conform to one of five varieties: nominal, relational, ordinal, interval, ratio, or relational.
¶ 86 Leave a comment on paragraph 86 0 Nominal data, also called categorical data, is a completely qualitative measurement. It represents different categories or labels or classes. Countries, people’s names, and different departments in a university are all nominal variables. They have no intrinsic order, and their only meaning is in how they differentiate from one another. We can put country names in alphabetical order, but that order does not say anything meaningful about their relationships to one another.
¶ 87 Leave a comment on paragraph 87 0 Relational data is data on how nominal data relate to one another. It is not necessarily quantitative, although it can be. Relational data requires some sort of nominal data to anchor it, and can include friendships between people, the existence of roads between cities, and the relationship between a musician and the instrument she plays. This type of data is usually, but not always, visualized in trees or networks. A quantitative aspects of relational data may be the length of a phone call between people or the distance between two cities.
¶ 88 Leave a comment on paragraph 88 0 Ordinal data is that which has inherent order, but no inherent degree of difference between what is being ordered. The first, second, and third place winners in a race are on an ordinal scale, because we do not know how much faster first place was than second; only that one was faster than the other. Likert scales, commonly used in surveys (e.g. strongly disagree / disagree / neither agree nor disagree / agree / strongly agree), are an example of commonly-used ordinal data. Although order is meaningful for this variable, the fact that it lacks any inherent magnitude makes ordinal data a qualitative category.
¶ 89 Leave a comment on paragraph 89 0 Interval data is data which exists on a scale with meaningful quantitative magnitudes between values. It is like ordinal in that the order matters, and additionally, the difference between first and second place is the same as the distance between second and third place. Longitude, temperature in Celsius, and dates all exist on an interval scale.
¶ 90 Leave a comment on paragraph 90 0 Ratio data is data which, like interval data, has a meaningful order and a constant scale between ordered values, but additionally it has a meaningful zero value. The year 0 AD (or CE) is not mathematically meaningful; there is nothing physically special about its being zero except our collective decision to make it so. Thus, date is on an interval scale. Compare this to weight, age, or quantity; having no weight is physically meaningful and different both in quantity and kind to having some weight above zero.
¶ 91 Leave a comment on paragraph 91 0 Having a meaningful zero value allows us to use calculations with ratio data that we could not perform on interval data. For example, if one box weighs 50 lbs and another 100 lbs, we can say the second box weighs twice as much as the first. However, we cannot say a day that is 100°F is twice as hot as a day that is 50°F, and that is due to 0°F not being an inherently meaningful zero value.
¶ 92 Leave a comment on paragraph 92 0 The nature of each of these data types will dictate which graphic variables may be used to visually represent them. The following section discusses several possible graphic variables, and how they relate to the various scales of measure.
Graphic Variable Types
¶ 93 Leave a comment on paragraph 93 0 Graphic variables are any of those visual elements that are used to systematically represent information in a visualization. They are building blocks. Length is a graphic variable; in bar charts, longer bars are used to represent larger values. Position is a graphic variable; in a scatterplot, a dot’s vertical and horizontal placement are used to represent its x and y values, whatever they may be. Color is a graphic variable; in a choropleth map of United States voting results, red is often used to show states that voted Republican, and blue for states that voted Democrat.
¶ 94 Leave a comment on paragraph 94 0 Unsurprisingly, some graphic variable types are better than others in representing different data types. Position in a 2D grid is great for representing quantitative data, whether it be interval or ratio. Area or length is particularly good for showing ratio data, as size also has a meaningful zero point. These have the added advantage of having a virtually unlimited number of discernible points, so they can be used just as easily for a dataset of 2 or 2 million. Compare this with angle. You can conceivably create a visualization that uses angle to represent quantitative values, as in the figure below. This is fine if you have very few, incredibly varied data points, but you will eventually reach a limit beyond which minute differences in angle are barely discernible. Some graphic variable types are fairly limited in the number of potential variations, whereas others have much wider range.
¶ 96 Leave a comment on paragraph 96 0 Most graphic variables that are good for fully quantitative data will work fine for ordinal data, although in those cases it is important to include a legend making sure the reader is aware that a constant change in a graphic variable is not indicative of any constant change in the underlying data. Changes in color intensity are particularly good for ordinal data, as we cannot easily tell the magnitude of difference between pairs of color intensity.
¶ 99 Leave a comment on paragraph 99 0 These three variables should be used to represent different variable types. Except in one circumstance, discussed below, hue should only ever be used to represent nominal, qualitative data. People are not well-equipped to understand the quantitative difference between e.g. red and green. In a bar chart showing the average salary of faculty from different departments, hue can be used to differentiate the departments. Saturation and value, on the other hand, can be used to represent quantitative data. On a map, saturation might represent population density; in a scatterplot, saturation of the individual data points might represent somebody’s age or wealth. The one time hue may be used to represent quantitative values is when you have binary diverging data. For example, a map may show increasingly saturated blues for states which lean more Democratic, and increasingly saturated reds for states which lean more Republican. Besides this special case of two opposing colors, it is best to avoid using hue to represent quantitative data.
¶ 100 Leave a comment on paragraph 100 0 Shape is good for nominal data, although only if there are under half a dozen categories. You will see shape used on scatterplots when differentiating between a few categories of data, but shapes run out quickly after triangle, square, and circle. Patterns and textures can also be used to distinguish categorical data; these are especially useful if you need to distinguish between categories on something like a bar chart, but the visualization must be printed in black & white.
¶ 101 Leave a comment on paragraph 101 0 Relational data is among the most difficult to represent. Distance is the simplest graphic variable for representing relationships (closer objects are more closely related), but that variable can get cumbersome quickly for large datasets. Two other graphic variables to use are enclosure (surrounding items which are related by an enclosed line), or line connections (connecting related items directly via a solid line). Each has its strengths and weaknesses, and a lot of the art of information visualization comes in learning when to use which variable.
Cognitive and Social Aspects of Visualization
¶ 102 Leave a comment on paragraph 102 0 Luckily for us, there are a few gauges for choosing between visualization types and graphic variables that go beyond the merely aesthetic. Social research has shown various ways in which people process what they see, and that research should guide our decisions in creating effective information visualizations.
¶ 103 Leave a comment on paragraph 103 1 About a tenth of all men and a hundredth of all women have some form of color blindness. There are many varieties of color blindness; some people see completely in monochrome, others have difficulty distinguishing between red and green, or between blue and green, or other combinations besides. To compensate, visualizations may encode the same data in multiple variables. Traffic lights present a perfect example of this; most of us are familiar with the red, yellow, and green color schemes, but for those people who cannot distinguish among these colors, top-to-bottom orientation sends the same signal. If you need to pick colors that are safely distinguishable by most audiences, a few online services can assist. One popular service is colorbrewer (http://colorbrewer2.org/), which allows you to create a color scheme that fits whatever set of parameters you may need.
¶ 104 Leave a comment on paragraph 104 0 In 2010, Randall Munroe conducted a massive online survey asking people to name the colors they were randomly presented. The results showed that women disproportionately named colors more specifically than men, such that where a woman might have labeled a color as neon green, a man might have just named it green. This does not imply women can more easily differentiate between colors, as some occasionally suggest, although part of the survey results definitely do show the disproportionate amount of men who have some form of color blindness. Beyond sex and gender, culture also plays a role in the interpretation of colors in a visualization. In some cultures, death is signified by the color black; in others, it is signified by white. In most cultures, both heat and passion are signified by the color red; illness is often, but not always, signified by yellow. Your audience should influence your choice of color palette, as readers will always come to a visualization with preconceived notions of what your graphic variables imply.
¶ 105 Leave a comment on paragraph 105 0 Gestalt psychology is a century-old practice of understanding how people perceive patterns. It attempts to show how we perceive separate visual elements as whole units–how we organize what we see into discernible objects. Among the principles of Gestalt psychology are:
- ¶ 106 Leave a comment on paragraph 106 0
- Proximity. We perceive objects that are close together as being part of a single group.
- Similarity. Objects that are visually similar (e.g. the same color) will be perceived as part of a single group.
- Closure. We tend to fill in the blanks when there is missing information; if we see a box of which two corners are absent, we will still see the box as a single unit, rather than two disjointed line segments.
¶ 108 Leave a comment on paragraph 108 0 These and other gestalt principles can be used to make informed decisions regarding graphic variables. Knowing what patterns tend to lead to perceptions of continuity or discontinuity are essential in making effective information visualizations.
¶ 109 Leave a comment on paragraph 109 0 At a more fine-grained level, when choosing between equally appropriate graphic variables, research on preattentive processing can steer us in the right direction. We preattentively process certain graphic variables, meaning we can spot differences in those variables in under 10 milliseconds. Color is a preattentively processed graphic variable, and thus in the visualization below, you will very quickly spot the dot that does not belong. That the processing is preattentive implies you do not need to actively search for the difference to find it.
¶ 111 Leave a comment on paragraph 111 0 Size, orientation, color, density, and many other variables are preattentively processed. The issue comes when you have to combine multiple graphic variables and, in most visualizations, that is precisely the task at hand. When combining graphic variables (e.g. shape and color), what was initially preattentively processed often loses its ease of discovery. Research into preattentive processing can then be used to show which combinations are still useful for quick information gathering. One such combination is spatial distance and color. In the visualization below, you can quickly determine both the two spatially distinct groups, and the spatially distinct colors.
¶ 114 Leave a comment on paragraph 114 0 Another important limitation of human perception to keep in mind is change blindness. When people are presented two pictures of the same scene, one after the other, and the second picture is missing some object that was in the first, it is surprisingly difficult to discern what has changed between the two images. The same holds true for animated / dynamic visualizations. We have difficulty holding in our minds the information from previous frames, and so while an animation seems a noble way of visualizing temporal change, it is rarely an effective one. Replacing an animation with small multiples, or some other static visualization, will improve the reader’s ability to notice specific changes over time.
Making an Effective Visualization
¶ 115 Leave a comment on paragraph 115 0 If choosing the data to go into a visualization is the first step, picking a general form the second, and selecting appropriate visual encoding the third, the final step for putting together an effective information visualization is in following proper aesthetic design principles. This step will help your visualization be both effective and memorable. We draw inspiration for this section from Edward Tufte’s many books on the subject, and Angela Zoss’s excellent online guide to Information Visualization (http://guides.library.duke.edu/content.php?pid=355157).
¶ 116 Leave a comment on paragraph 116 0 One seemingly obvious principle that is often not followed is to make sure your visualization is high resolution. The smallest details and words on the visualization should be crisp, clear, and completely legible. In practice, this means saving your graphics in large resolutions or creating your visualizations as scalable vector graphics. Keep in mind that most projectors in classrooms still do not have as high a resolution as a piece of printed paper, so creating a printout for students or attendees of a lecture may be more effective than projecting your visualization on a screen.
¶ 117 Leave a comment on paragraph 117 0 Another important element of visualizations often left out are legends which describe each graphic variable in detail, and explains how those graphic variables relate to the underlying data. Most visualization software do not automatically create legends, and so they become a neglected afterthought. A good legend means the difference between a pretty but undecipherable picture, and an informative scholarly visualization. Adobe Photoshop and Illustrator, as well as the free Inkscape and Gimp, are all good tools for creating legends.
¶ 118 Leave a comment on paragraph 118 0 A good rule of thumb when designing visualizations is to reduce your data:ink ratio as much as possible. Maximize data, minimize ink. Extraneous lines, bounding boxes, and other design elements can distract from the data being presented. The figures below show a comparison between two identical charts, except for the amount of extraneous ink.
¶ 120 Leave a comment on paragraph 120 0 A related rule is to avoid chartjunk at all costs. Chartjunk are those artistic flourishes that newspapers and magazines stick in their data visualizations to make them more eye-catching: a man blowing over in a heavy storm next to a visualization of today’s windy weather, or house crumbling down to represent the collapsing housing market. Chartjunk may catch the eye, but it is ultimately distracting from the data being presented, and readers will take more time to digest the information being presented to them.
¶ 121 Leave a comment on paragraph 121 0 Stylized graphical effects can be just as distracting as chartjunk. “Blown out” pie charts where the pie slices are far apart from one another, 3D bar charts, and other stylistic quirks that Excel provide are poor window decoration and can actually decrease your audience’s ability to read your visualization. In a 3D tilted pie chart, for example, it can be quite difficult to visually estimate the area of each pie slice. The tilt makes the pie slices in the back seem smaller than those in the front, and the 3D aspect confuses readers about whether they should be estimating area or volume.
¶ 122 Leave a comment on paragraph 122 0 While not relevant for every visualization, it is important to remember to label your axes and to make sure each axis is scaled appropriately. Particularly, the vertical axis of bar charts should begin at zero. The figure below is a perfect example of how to lie with data visualization by starting the axis far too high, making it seem as though a small difference in the data is actually a large difference.
¶ 124 Leave a comment on paragraph 124 0 There is an art to perfecting a visualization. No formula will tell you what to do in every situation, but by following these steps (1. Pick your data, 2. Pick your visualization type, 3. Pick your graphic variables, 4. Follow basic design principles), the visualizations you create will be effective and informative. Combining this with the lessons elsewhere in this book regarding text, network, and other analyses should form the groundwork for producing effective digital history projects.