Common Visualization Mistakes
Last updated
Last updated
Chapter 18
By Kathy Chang, Kate Eyler-Werve, and Alberto Cairo
Welcome to the last (but certainly not least) chapter of the book! We hope you’ve learned enough to appreciate how much good data and good design can help you communicate your message. Professional data visualizers get excited by the stories they want to tell as well, but sometimes they forget to follow some best practices while doing so. It happens to the best of us. So in this chapter, we’re going to cover what those best practices are.
One of the ways a graph can be distorted is by truncating an axis. This happens when an axis is shortened because one or both of its ends gets cut off.
Sometimes a distortion like this is really obvious. For example, let’s say there are two allergy medicines called Happajoy and Pollaway. The bar graph below compares how effective these two medicines are at reducing the tearful, congested misery known as allergy symptoms. If you quickly glance at the bars, you may think that Happajoy is twice as effective as Pollaway is because its bar is twice as tall. But if you examine the graph more closely, you’ll see that the y-axis is truncated, starting from 30.2 and going up to only 30.7 percent. The truncated y-axis makes the difference between the two bars look artificially high. In reality, Happajoy’s effectiveness is only 0.2% higher than Pollaway’s, which is not as impressive as the results implied by the bar graph.
Sometimes a truncated axis and the resulting distortion can be more subtle. For example, the next graph shows the quantity of Happajoy sold from January through April 2014.
At first glance, there doesn’t appear to be a truncation issue here. The y-axis starts at zero, so that’s not a problem. The critical thing to understand is that it’s the x-axis that’s been truncated this time: we’re seeing sales from less than half the year. Truncating a time period like this can give the wrong impression, especially for things that go through cycles. And—you guessed it—the sale of allergy medicine goes through a seasonal cycle since allergy symptoms are typically higher in the spring and lower in the winter.
What would be a better way to show sales of Happajoy? Something like the graph below:
This graph shows the same dataset of Happajoy sales, except this time the y-axis is proportional and the x-axis covers two full years instead of just a few months. We can clearly see that sales of Happajoy went down in the winter and up in the spring, but that the rate of sales didn’t change much from year to year. In fact, sales were a little lower in 2014 than in 2013.
When you compare the last two graphs of Happajoy sales, do you see how different their stories are? If you were an investor in Happajoy’s company and you saw the graph with truncated axes, you might dance happily through the streets because it seems like the company is doing really well. On the other hand, if you saw the graph with proportional axes, you might reach for some aspirin instead of Happajoy because of the headache you’d get worrying about the overall decrease in sales.
So watch out for truncated axes. Sometimes these distortions are done on purpose to mislead readers, but other times they’re the unintentional consequence of not knowing how truncated axes can skew data.
The first principle is that you must not fool yourself — and you are the easiest person to fool.
- Richard Feynman, 1974 Caltech Graduation Address
You know how famous people are sometimes criticized for something they said, and they often reply that they were quoted out of context? Context is important, especially when it comes to data. It’s very easy to fool yourself by leaving out variables that could affect how you interpret the data. So whenever you’re examining a variable and its relationships, carefully consider the ecosystem in which that variable exists and deliberately seek out other variables that could affect the one you’re studying.
This is easier said than done. For example, the map below shows each state’s market leader in allergy medicine: Happajoy is the leader in dark blue states, while Pollaway is the leader in light blue states. On the surface, it might seem like Happajoy is the market leader nationally, ahead of Pollaway. But to get the complete picture you have to pay attention to other variables.
For example, the bar graph below shows a breakdown of market share in each state. (We’re only going to look at the western continental states for now.) The margins by which Happajoy leads are significantly less than the margins by which Pollaway leads.
Combine the information from the bar graph with the table below. The total sales in states where Happajoy is the leader is also significantly less than the total sales in states where Pollaway is the leader. When you add up the numbers, Pollaway’s total sales are more than twice that of Happajoy’s. Assuming that a similar pattern holds for the rest of the country, would it still be accurate to say that Happajoy is the national market leader in allergy medicine?
The lesson here is that if you want to provide a fair picture of what’s going on, you have to expand your scope to show the variables that put things in their proper context. That way, you can provide your readers with a more complete and nuanced picture of the story you’re trying to tell.
Life is complicated, right? Data can be complicated, too. Complicated is hard to communicate, so it’s only natural to want to simplify what your data are saying. But there is such a thing as simplifying too much. Oversimplifying is related to the previous point about not expanding the scope enough to provide a clear picture of what’s going on.
For example, let’s say you’re an investor of RediMedico, the maker of Happajoy, and you attend the annual sales meeting. The CEO of RediMedico starts off the presentation with the following graphic:
Now the investor in you might look at that graphic and start daydreaming about all the wonderful things you’ll do with such a great return on your investment. But then the data pro in you kicks in, and you start wondering about what that 18% increase means. You ask yourself:
Compared to what?
Compared to when?
Compared to whom?
These are all worthwhile questions to answer with a visualization! Thankfully, the CEO of RediMedico agrees and presents the next graphic, which compares the revenues from the five top-selling medicines RediMedico makes:
If we do some number-crunching, we see that the average increase in revenue between 2013 and 2014 is indeed 18%. However, we also see that this increase is primarily due to a whopping 225% increase in revenue from a single medicine, Exoalgio. Revenue from 3 out of 5 medicines actually dropped. So RediMedico’s first graphic tells part of the truth, while the second graphic tells the whole truth by presenting the details behind the single number.
Using a graphic with a single number and no breakdowns is like writing a news headline without the news story. Keep the headline—that revenue improved by 18%—and then provide the context and the background to flesh out the full story. Try to be true to the underlying complexity by digging deeper into the data and providing readers with a better understanding of the numbers you’re presenting.
Creating a data visualization is a balancing act between form and function. When choosing a graphic format for your data, you’ll have to figure out how to effectively communicate to your audience in an aesthetically pleasing way. This may seem like a daunting task, but fear not! There’s actually a lot of research that can help us with this. In fact, you’ve already been introduced to some of this research in the previous chapter: Cleveland and McGill’s “Graphical Perception” paper, in which they rank how well different graphic forms help people make accurate estimates.
You can use this scale to help you choose the best graphic representation of your data. Area and shading are good at giving readers a general overview that helps them understand the big picture. Graphs with a common baseline (such as bar graphs) are good at helping readers make accurate comparisons.
Since we’ve already looked at examples of bar graphs and line graphs in this chapter, let’s take a look at a couple graphics that use area and shading.
The bubble graphic uses area to display the units sold of the top selling allergy medicine in some states. Based on bubble size, you can generally tell that more Happajoy was sold in Arizona than in New Mexico. But can you tell by how much? Is the Arizona bubble three times bigger than the New Mexico bubble? Four times? It’s hard to tell. It’s even harder to tell when the bubble sizes are closer together: Did Utah or Kansas sell more Happajoy?
We run into the same problem with the next graphic, which uses shading to represent Happajoy sales: California is definitely darker than Texas, but how much darker? Two times? Three times? Who knows? This is why area and shading are better for giving an overall picture instead of making precise comparisons.
In addition to area and shading, angles also give us a tough time when it comes to making accurate estimates. This is why it’s so hard to compare two pie charts, as in the example below.
It’s already hard to precisely compare the slices within the same pie chart. It’s even harder to compare slices across different pie charts. If the goal of this graphic is to help readers compare revenues from one year to the next, then something like a bar chart would have been a better choice.
That’s the key thing: think about which graphic forms will best facilitate the tasks you want your readers to do.
We just covered how different graphic forms are good at doing different things. So what do you do when you have a lot of data and you want to show different aspects of those data, but you also don’t want to overwhelm your audience with an overly complicated graphic? One way to deal with this challenge is to present your data in multiple ways. You can individually show multiple graphics, each one showing a different aspect of the data, so that taken together your audience gets a more accurate picture of the data as a whole.
For example, let’s say the CEO of RediMedico wants to show investors how Happajoy has been selling in the United States since it was first introduced to the market ten years ago. The available data consists of Happajoy sales in every state for every year since 2004. You’re the lucky data pro who gets to figure out how to present both the big picture and the small picture inside this data.
Let’s start with the big picture. Remember how graphic forms that use area or shading are best at giving a general overview? For every year that Happajoy has been on the market, a map that uses shading to represent sales can give people a general sense of how sales have changed across time and location:
Now let’s move on the small picture. Let’s say RediMedico started to advertise heavily in California and New York a few years ago, and the investors are wondering how sales in those states are doing. Using the same dataset, you can give a more detailed picture of the sales in one state:
Or you can compare the sales between different states:
See? Same data, different presentations. So if you want to show your readers different sides of the same story, give them multiple graphic forms.
They say that a picture is worth a thousand words, but that doesn’t mean you should forget about words entirely! Even your most beautiful and elegant visualizations can use text to highlight or explain things to your audience. This is especially useful when you’re presenting multiple layers of information because your annotations can help readers connect the various pieces into an understandable whole. And obviously, you’re not reading this book in order to make boring visualizations, right? Good visualizations engage an audience, so adding text is a great way to address questions that may come up as curious readers examine your graphic.
For example, let’s go back Happajoy sales. If you see a graphic like the following:
Then you might be wondering what happened between 2009 and 2010. Why was there such a sharp drop in revenue? In this case, it would be helpful to add some text:
So whenever you create a visualization, think about the “So what?” of your graphic: Why should people care about the information you’re presenting? Add annotations to help them understand why they should care. Write a good headline that catches their attention, a good introduction that highlights interesting data points, and a good narrative that structures your visualization logically. Good writing is an important part of good visualizations.
To close out this chapter, let’s take a look at all of these pro tips in action by going through a visualization made by real data pros: an infographic by The New York Times about breast cancer. The designers organized the information as a narrative with a step-by-step structure. This is an interactive graphic, so it’s best if you click through the link to get the full experience.
On the first screen, you see a bubble graphic that gives you a general sense of which countries have the most new cases of breast cancer . After clicking “Begin”, you see a scatterplot with proportional axes. The scatterplot shows that there is an inverse correlation between breast cancer detection and mortality: as more women are detected with breast cancer, fewer women die from it. A scatterplot is a good way to show correlations between variables, and a bubble graphic is a good way to show a general overview, so the designers chose graphic forms that matched well with what they wanted to show.
Notice how the designers use text to write a good headline that grabs the reader’s attention (“Where Does Breast Cancer Kill?”) and to highlight another aspect of this scatterplot—that highly developed countries have higher diagnosis rates (and lower mortality rates) while the opposite is true for the developing world. As you keep clicking “Next”, the designers guide you deeper into the scatterplot by highlighting a cluster of countries and providing an annotation that gives you further insight into that cluster. Without these notes, we would be left with a relatively straightforward scatterplot that leaves little room for us to explore the data further. By providing useful and well-placed annotations, the designers help us see relationships that we otherwise may have missed.
The designers also present the data in multiple ways. They use color to add another layer of detail: the development status of various countries. In addition, if you’re curious about the statistics for a specific country, you can mouse over that country’s dot to get those numbers.
Finally, by adding useful annotations and showing the data in multiple ways, the designers present the data within a context that doesn’t leave out important variables or oversimplify. By following through on some good data visualization practices, the designers created a clear, balanced, and engaging infographic. </div>
States
Happajoy
Pollaway
Wyoming (WY)
299,734
219,037
North Dakota (ND)
349,814
279,851
South Dakota (SD)
408,343
341,675
Montana (MT)
482,400
422,100
Idaho (ID)
782,040
654,360
Nebraska (NE)
872,320
798,080
New Mexico (NM)
1,043,000
834,400
Nevada (NV)
1,489,860
993,240
Utah (UT)
1,313,300
1,256,200
Kansas (KS)
1,414,140
1,183,260
Oklahoma (OK)
1,907,500
1,526,000
Oregon (OR)
2,027,480
1,481,620
Arizona (AZ)
3,014,380
2,883,320
Colorado (CO)
1,297,000
3,372,200
Washington (WA)
2,069,100
4,138,200
Texas (TX)
5,733,200
17,720,800
California (CA)
7,608,000
26,628,000
Sales Totals
32,111,611
64,732,343