|
An experiment in writing in public, one page at a time, by S. Graham, I. Milligan, & S. Weingart

Visual Encoding

1 Leave a comment on paragraph 1 0 Once a visualization type has been chosen, the details may seem either self-evident or negligible. Does it really matter what color or shape the points are? In short, yes, it matters just as much as the choice of visualization that is being used. And, when you know how to effectively use various types of visual encodings, you can effectively design new forms of visualization that suit your needs perfectly. The art of visual encoding is in the ability to match data variables and graphic variables appropriately. Graphic variables include the color, shape, or position of objects in the visualization, whereas data variables include what is attempting to be visualized (e.g. temperature, height, age, country name, etc.)

Scales of Measure

2 Leave a comment on paragraph 2 0 The most important aspect of choosing an appropriate graphic variable is to know the nature of your data variables. Although the form data might take will differ from project to project, it will likely conform to one of five varieties: nominal, relational, ordinal, interval, ratio, or relational.

3 Leave a comment on paragraph 3 0 Nominal data, also called categorical data, is a completely qualitative measurement. It represents different categories or labels or classes. Countries, people’s names, and different departments in a university are all nominal variables. They have no intrinsic order, and their only meaning is in how they differentiate from one another. We can put country names in alphabetical order, but that order does not say anything meaningful about their relationships to one another.

4 Leave a comment on paragraph 4 0 Relational data is data on how nominal data relate to one another. It is not necessarily quantitative, although it can be. Relational data requires some sort of nominal data to anchor it, and can include friendships between people, the existence of roads between cities, and the relationship between a musician and the instrument she plays. This type of data is usually, but not always, visualized in trees or networks. A quantitative aspects of relational data may be the length of a phone call between people or the distance between two cities.

5 Leave a comment on paragraph 5 0 Ordinal data is that which has inherent order, but no inherent degree of difference between what is being ordered. The first, second, and third place winners in a race are on an ordinal scale, because we do not know how much faster first place was than second; only that one was faster than the other. Likert scales, commonly used in surveys (e.g. strongly disagree / disagree / neither agree nor disagree / agree / strongly agree), are an example of commonly-used ordinal data. Although order is meaningful for this variable, the fact that it lacks any inherent magnitude makes ordinal data a qualitative category.

6 Leave a comment on paragraph 6 0 Interval data is data that exists on a scale with meaningful quantitative magnitudes between values. It is like ordinal in that the order matters, and additionally, the difference between first and second place is the same as the distance between second and third place. Longitude, temperature in Celsius, and dates all exist on an interval scale.

7 Leave a comment on paragraph 7 0 Ratio data is data which, like interval data, has a meaningful order and a constant scale between ordered values, but additionally it has a meaningful zero value Compare this to weight, age, or quantity; having no weight is physically meaningful and different both in quantity and kind to having some weight above zero.

8 Leave a comment on paragraph 8 0 Having a meaningful zero value allows us to use calculations with ratio data that we could not perform on interval data. For example, if one box weighs 50 lbs and another 100 lbs, we can say the second box weighs twice as much as the first. However, we cannot say a day that is 100°F is twice as hot as a day that is 50°F, and that is due to 0°F not being an inherently meaningful zero value.

9 Leave a comment on paragraph 9 0 The nature of each of these data types will dictate which graphic variables may be used to visually represent them. The following section discusses several possible graphic variables, and how they relate to the various scales of measure.

Graphic Variable Types

10 Leave a comment on paragraph 10 0 Graphic variables are any of those visual elements that are used to systematically represent information in a visualization. They are building blocks. Length is a graphic variable; in bar charts, longer bars are used to represent larger values. Position is a graphic variable; in a scatterplot, a dot’s vertical and horizontal placement are used to represent its x and y values, whatever they may be. Color is a graphic variable; in a choropleth map of United States voting results, red is often used to show states that voted Republican, and blue for states that voted Democrat.

11 Leave a comment on paragraph 11 0 Unsurprisingly, some graphic variable types are better than others in representing different data types. Position in a 2D grid is great for representing quantitative data, whether it be interval or ratio. Area or length is particularly good for showing ratio data, as size also has a meaningful zero point. These have the added advantage of having a virtually unlimited number of discernible points, so they can be used just as easily for a dataset of 2 or 2 million. Compare this with angle. You can conceivably create a visualization that uses angle to represent quantitative values, as in figure 5.27. This is fine if you have very few, incredibly varied data points, but you will eventually reach a limit beyond which minute differences in angle are barely discernible. Some graphic variable types are fairly limited in the number of potential variations, whereas others have much wider range.

12 Leave a comment on paragraph 12 0 5.275.27

13 Leave a comment on paragraph 13 0  

14 Leave a comment on paragraph 14 0 Most graphic variables that are good for fully quantitative data will work fine for ordinal data, although in those cases it is important to include a legend making sure the reader is aware that a constant change in a graphic variable is not indicative of any constant change in the underlying data. Changes in color intensity are particularly good for ordinal data, as we cannot easily tell the magnitude of difference between pairs of color intensity.

15 Leave a comment on paragraph 15 0 Color is a particularly tricky concept in information visualization. Three variables can be used to describe color: hue, value, and saturation (figure 5.28).

16 Leave a comment on paragraph 16 0 5.285.28

17 Leave a comment on paragraph 17 0  

18 Leave a comment on paragraph 18 0 These three variables should be used to represent different variable types. Except in one circumstance, discussed below, hue should only ever be used to represent nominal, qualitative data. People are not well-equipped to understand the quantitative difference between e.g. red and green. In a bar chart showing the average salary of faculty from different departments, hue can be used to differentiate the departments. Saturation and value, on the other hand, can be used to represent quantitative data. On a map, saturation might represent population density; in a scatterplot, saturation of the individual data points might represent somebody’s age or wealth. The one time hue may be used to represent quantitative values is when you have binary diverging data. For example, a map may show increasingly saturated blues for states that lean more Democratic, and increasingly saturated reds for states that lean more Republican. Besides this special case of two opposing colors, it is best to avoid using hue to represent quantitative data.

19 Leave a comment on paragraph 19 0 Shape is good for nominal data, although only if there are under half a dozen categories. You will see shape used on scatterplots when differentiating between a few categories of data, but shapes run out quickly after triangle, square, and circle. Patterns and textures can also be used to distinguish categorical data; these are especially useful if you need to distinguish between categories on something like a bar chart, but the visualization must be printed in black & white.

20 Leave a comment on paragraph 20 0 Relational data is among the most difficult to represent. Distance is the simplest graphic variable for representing relationships (closer objects are more closely related), but that variable can get cumbersome quickly for large datasets. Two other graphic variables to use are enclosure (surrounding items which are related by an enclosed line), or line connections (connecting related items directly via a solid line). Each has its strengths and weaknesses, and a lot of the art of information visualization comes in learning when to use which variable.

Page 53

Source: http://www.themacroscope.org/?page_id=878