Less ink, more think

Occasionally I am asked to give a lecture on how to draw good graphs. While I am always tempted to drone on interminably about abstract principles such as minimalism, balance, and consistency, I have discovered that it is much more fun to criticize bad graphs, and to show how they can be improved. But how to quire a truly bad graph? Easy! Just use the defaults in Microsoft Excel!

Here is an actual example of a graph drawn with those defaults. The data are fictitious, but the ugliness is breathtakingly real. It is sometimes said (unfairly) that engineers lack all sense of graphical design, but I think they must have hired specialists to create something so painfully wrong. vss2011workshop.010 But what specifically is wrong, and how can we make it better? Michelangelo once said “I saw the angel in the marble and carved until I set him free.”  So here too we will chip away at the obscuring excess, to reveal the beauty that Microsoft tried to hide.

First of all, what purpose is served by the heavy black rectangle that surrounds the graph? It serves two purposes: 1) to obscure useful information, and 2) to waste ink. Let’s remove it.

vss2011workshop.011

Better, but still bad. Next we note that the quantity being plotted is identified in three separate places: the vertical axis label, a title above the plot, and a key to the left. Is this really necessary? I think not. Lets get rid of two of them. Of course a key can be useful when several quantities are plotted together, but not when there is only one. Likewise labels above a plot have their uses, but should be avoided when they are redundant with other information, such as the axis label. We remove the key and the title. Apart from reducing clutter, this substantially increases the area available for the useful parts of the graph.

vss2011workshop.012

Now we ask the question: what purpose is served by the gray background? It serves two purposes: 1) to reduce the contrast and thus visibility of the data points, and 2) to waste ink. Get rid of it!

vss2011workshop.014

Aaah…so much more cheerful and relaxing to look at! But a few troubling questions remain. For example, what purpose is served by those shadows behind each data point? Do they indicate some exciting three dimensional aspect to the data? Of course not. But they do serve two purposes: 1) to render ambiguous the actual locations of the data points, and 2) to waste ink! Please people, can all just agree to never, never, use little shadows to suggest that our data are floating above the page? Thank you. The corrected graph is below. We have removed the shadows and also changed the diamonds to discs for the very important reasons that 1) they are simpler, and 2) I like them better.

vss2011workshop.015

Next we note that graphs are usually employed to show a pattern or trend. This pattern is not communicated well by a set of individual points floating out there, each an island, entire of itself. Only connect! A line drawn between the points aids enormously in conveying the visual sense of the data.

vss2011workshop.016

Next we correct an obvious (except to the Microsoft designers) flaw: the axis number labels running through the middle of the graph. We move them where they belong: to the axis, outside the graph.

vss2011workshop.017

Now we are getting somewhere. It almost looks ok. But we can do better. Gridlines can serve a purpose – for example, to let the reader easily judge approximate values – but there is never a reason for them to be dark and heavy, and to mask the useful information in the figure. Lighten up! In fact, the gridlines should generally be as light as possible, and still be visible. In this example, we make one gridline a bit darker than the others, to identify the y = 0 line.

vss2011workshop.018Now we see that the data really stand out. But we can do better still. What remains to distract the eye from the data? Well we could try removing the gridlines altogether, and then there is no need for the top and right borders of the frame.

vss2011workshop.019

Next we ask: what is the purpose of the bold font on the axis labels? Of course, it is to waste ink. Using a bold font for your labels is like writing your emails in all upper case. It is the digital equivalent of shouting. Don’t do it. Use your indoor voice.

vss2011workshop.020

And finally (yes, finally) we can reduce the line weight of the remaining axes. All we really need is enough weight to see them, and note their positions.

vss2011workshop021

Thus we arrive at our final graph. It is not particularly exciting, but the data are clear, the trends are evident, and there is little to distract the eye from the essential information. Clearly, not all graphs are this simple, and there are often reasonable justifications for more elaborate presentations. But it is often a good idea to start with the simplest possible presentation, and elaborate from there. 

We conclude with the motto of this presentation, and indeed of this entire blog:

“Less ink, more think.”

Advertisements

Friends don’t let friends use bar charts

A recent article in the New York Times drew attention to the lack of economic mobility in the United States, as compared to Canada and much of Western Europe. The article was illustrated with a graphic, which we reproduce below. It illustrates, for the US and Denmark, the percent of men raised in each fifth (quintile) of the economic range who end up in each fifth.

This graphic is not terrible: with enough scrutiny you can probably figure out the point being made. But because we enjoy picking on the Times, we will explore its various failings just for fun.

First, if I have told you once, I have told you a thousand times: no bar charts! They are almost always inferior to a comparable point and line chart. They waste ink, obscure trends, and – most relevant here – make it hard to compare two quantities. Note the extremes to which the Times artist has gone: the the Denmark data are a fat light gray bar, while the US data are a superimposed thin dark shaft. This trick to display two quantities in one location violates several canons of graphology. The first is that it is not “equitable:” the two nations are not plotted with symbols of equal visual weight. The second is that it makes it hard to see trends. And the third is that, worst of all, it actually makes it hard to compare, at a glance, the data from the two countries.

Here is a roughly similar graph that uses boring old points and lines. Color is used to distinguish the two countries.

We immediately see two things. First, only the first panel shows interesting differences between the two countries. This is the graph for men raised in the bottom fifth, and it clearly illustrates the point of the article: most men raised poor stay poor, and this trend is much more severe in the US than in Denmark.

The authors might have done us a favor by pointing out that in a true “opportunity society,” in which everyone regardless of economic origins has an equal chance at success, all of the graphs should be flat at 20%. The middle three graphs approximate this ideal, but both the leftmost and the rightmost graphs are very non-flat. This shows that the poorest and the richest are the least mobile; only the middle classes approximate the ideal of equal opportunity. This is true in both countries, but more severe in the US for those raised in the poorest quintile.

It is interesting to look for a way to depict the mobility within a single country, that does not require five separate graphs. One solution is to plot the percentages as a surface or an array.

Here is an example. Here we represented the starting and ending quintiles by rows and columns. Each cell shows the percentage of men that started in a given quintile (row), and ended up in a given quintile (column).  We scale the colors so that a percentage larger than the ideal 20% is red, less than 20% is green, and the equal opportunity ideal of 20% is a neutral color.

In the US, it is evident that the two “hot spots” are the lower left and upper right corners. These are the too many folks born poor who stay poor, or who are born rich and stay rich, respectively. The Danes suffer only from too many born rich who stay rich; those Danes born poor appear to experience nearly perfect mobility.

What are the lessons?

  1. Avoid bar charts, especially when trying to depict the covariation of two quantities.
  2. Don’t use five graphs when one (or two) will do.
  3. Something is rotten in Denmark, but two things are rotten in the US.

Sources:

New York Times

“Harder for Americans to Rise From Lower Rungs”

By JASON DePARLE

Published: January 4, 2012

http://www.nytimes.com/2012/01/05/us/harder-for-americans-to-rise-from-lower-rungs.html

Attack of the little people

Where did they come from, the little people? Like a horde of replicants they have streamed forth to cover the world of infographics. No trendy depiction of any statistic related to humans is complete without the little people. Consider todays freshly populated example, from our favorite whipping boy, the New York Times.

The graphic is an attempt to put “into perspective” the numbers of people in poverty in the US. It does this by rounding up a bunch of little people, and penning them in various corrals that seem to have something to do with states or demographic groups. Hard to tell, since it is an expository jumble.

Let us ask a few questions of this graphic. First, the question that we ask of every such graphic: does the point leap out at you, in a flash of effortless cognition? Uh…lets see, half the people in poverty live in New York, and half in Texas? Fail!

Some more questions. If the orange little people are women and girls, why are they all wearing men’s business suits, albeit in a saucy feminine color? And do all the impoverished women and girls live in Texas? Rick Perry, are you aware of this? The state could at least provide more appropriate apparel for those in need. If you are a woman or girl, going to a job interview in an orange men’s business suit is not advisable, especially in Texas.

There seem to be a lot of impoverished white people (31.7 million), but amazingly, none of them live in Texas or New York. And if you think that is amazing…wait for it…none of them are men, boys, women, or girls. Maybe they are little people.

Ok, but here is where it really gets crazy. There are 16.4 million aged 17 or younger in poverty. But evidently none of them are girls or boys!

What is the lesson? The little people are no substitute for clarity of expression. The artist is to be commended for attempting to make the numbers more meaningful, but the exercise is doomed from the start. First of all, there is a fundamental difficulty in trying to carve up a total population (those in poverty) into a large number of overlapping sets. To be an accurate depiction, the corrals (technically, we call these Venn diagrams) should contain the correct number of little people, but so also should the intersections between two or more corrals (e.g., Asian and male and living Texas). Easier said than done (and it wasn’t that easy to say). Second, comparisons with state populations are problematic, since most americans have only a dim sense of the population of any state, even their own.

As is so often the case, traditional methods of data representation are perfectly adequate, and much clearer than the sad corrals of little people. Below is my quick draft of a bar chart of the same data. I have used different colors to group the different sorts of comparisons (gender, age, ethnicity), and as sop to the New York Times, included horizontal lines indicating populations of a few states (source http://quickfacts.census.gov/qfd/index.html).

I hope you will agree that though my chart may be conventional, it is clear, and allows the viewer to make the comparisons that the Times felt were important.

The lesson? Beware the invasion of the little people. They look cute, and you figure they are so small they can’t do any harm. But invite them into your graphic, and they can create havoc. Advanced lesson: Venn diagrams are tricky to depict when many categories are involved.

Sources:

New York Times

The Impoverished States of America

By TOM KUNTZ and BILL MARSH

Published: September 17, 2011

http://www.nytimes.com/2011/09/18/sunday-review/the-impoverished-states-of-america.html

State populations in my chart:  http://quickfacts.census.gov/qfd/index.html).

The direction of time’s arrow

Once again, the target of our arrow of criticism is the estimable New York Times, and their estimable Charles M Blow, whose op-ed contributions are always interesting but almost equally often decorated with sadly defective graphics. In this example, we have a graph that is wrong in at least five ways. Can you spot them? Here is the graph.

The subject of the graph is the change in approval rating of President Obama following the killing of Osama bin Laden, for various selected groups. It is certainly possible to extract the information for any given group from the chart, especially because the artist kindly prints all the numbers, but in this regard it is little better than a table. And a graph should be more than a table, it should use your native perception of form to make a point.

The first error is the use of space. As is often the case with Mr Blow, the graph occupies a remarkable amount of vertical space, considering the modest data it contains. For this reason, you may have to expand the graph just to be able read its contents. As we will show, these data can be plotted in much less space, with an increase in clarity.

The second thing that is wrong with the chart is the selection of colors. Since before and after are depicted with color, we would like a strong contrast between the two. Instead we get a weak difference in brightness and saturation of two greens. Quick, tell me whether any subgroup showed a decrease in approval! I suspect you had to scrutinize each pair of bars, carefully ensuring that the darker one was shorter.

The third thing that is wrong is that the bar depicting “after” is about twice as wide as the “before” bar. Thus the area of the “after” bar is much larger, even if there were no change in approval. This is potentially confusing, ad certainly biased against the before figure.

The fourth thing that is wrong is that the bars are overlapping. This makes it harder to see the length of the “before” bar.

The fifth thing that is wrong is that the graph fails to exploit our native sense of how to depict an increase over time. By convention, in graphs time is always shown as proceeding from left to right. And positive quantities are always shown as increasing from bottom to top. The horizontal arrangement of the chart, and the overlapping of the before and after bars, fails to observe either of these conventions.

Another way to be absolutely sure that the viewer understands the direction of time is to actually show it as an arrow. This is especially appropriate when only two points in time are involved.

Correcting all of these errors, we produce the following chart.

While this chart should require no explanation, I will make a few comments on design. First, unlike the New York Times, I do not have an army of graphologists to tweak my product to perfection. This is a first draft, created in a couple of minutes, and could doubtless be improved. But it clearly shows that every group showed an increase, and the relative size of each increase. In each bar, time goes left to right, and approval increases from bottom to top, just as we expect. The arrows reinforce each trend with a strong graphic element, while the single green bar shows the absolute values of approval, and ties each arrow to its group name. We omit the actual numbers, but provide a 50% line for guidance.

My chart makes all the essential points, and does so in a way that is immediate and transparent. Mr. Blows chart has a certain graphical panache to it, and that is not a bad thing. But panache should never replace clarity.

Reference:

 New York Times
The Bin Laden Bounce
By CHARLES M. BLOW
Published: May 6, 2011
http://www.nytimes.com/2011/05/07/opinion/07blow.html

Show me the correlation!

Suppose that we have two quantities that vary over time. We want to know if their variations are correlated, that is, do the dance to the same tune, or does each march to a different drummer. Of course, we often ask this question because we want to know whether one of the two causes, or at least influences, the other. The most effective way to compare the variations of two quantities is to plot them one against the other.

But sometimes the simplest solution is not good enough for the pop graphologist. (A new term I just invented.) Take Charles M. Blow, the top pop graphologist for the New York Times. (I have commented on his work previously http://wp.me/p19RFk-a. ) Mr. Blow writes excellent columns, on important issues, but he decorates them with grievously flawed graphs. Consider the following graph, from the New York Times of April 16, 2011. Please forgive the gargantuan vertical extent of the graph (you will probably have to click on the graph to see the full extent), we will address this below.

The red line shows the variation over time of the top marginal tax rate in the US. The tiny green bars at the bottom show the % change in GDP from the previous year. The first question we always ask of any graph is: what pops out at you? I thought so: nothing. Now maybe that is Mr. Blow’s point – that there is no relation between the two quantities – but if so, this is hardly the way to show it. Partly because it is the wrong type of graph, and partly because the two quantities are so far apart on the graph, it is difficult to discern any relationship between the two quantities.

As we stated at the outset, the simplest way to graphically display a synchronicity, or in technical terms a correlation, between the two quantities is to plot them against each other. I have done this in the graph below. Using the same data as Mr. Blow, we plot the % change in GDP against the top marginal tax rate.

Now we can immediately discern whether there is a close relationship between the top rate and the changes in GDP. If there were, the points would cluster tightly together, forming some curve describing the relationship. Instead, the points are widely scattered, and so the main point of the article – that there is no strong relationship – is verified.

However! However! Plotting the data in this way actually reveals an additional surprise, completely obscured in Mr. Blow’s graph. There is actually a small POSITIVE relationship between the two quantities! Yes, you heard me right. In yet another death knell for supply-side economics, we see that HIGHER tax rates lead to LARGER increases in GDP. (Forgive my upper-case outburst, I got a bit excited). Note that the two lowest marginal rates are associated with some of  the largest declines in GDP, and two of the largest increases in GDP are associated with marginal rates near 90%! The red line shows the best-fitting linear relationship between the two quantities, and it climbs slightly as you go from low tax rates to high. True, the effect is weak. The slope of the line is slight; it takes a 17% rise in top marginal rate to get 1% rise in GDP. And the degree to which the points cluster around this trend is also weak. We measure this by the correlation statistic (http://en.wikipedia.org/wiki/Correlation_and_dependence), which must lie between -1 and 1, and in this case is a meager 0.26 (0 would mean no relationship at all).

But still. Mr Blow could have made a much stronger case if he had used the right kind of graph.

Now we are going to make a few nerdy points about graphs of correlation, and those of you who are only here for the entertainment portion of the show can go back to your other amusements.

In graphs of this sort, it is traditional to put the so-called “independent” variable on the horizontal axis, and the so-called “dependent” variable on the vertical axis. When we assign variables in this way, we are making an assumption about what causes what. That may or may not be reasonable, but it is good to adhere to this convention. In this case, the question addressed in the column is whether tax rates affect growth in GDP, so that is why we assign the quantities to the two axes as we have.

Another feature of a graph like this is the aspect ratio. One failing of Mr. Blow’s graph was the large vertical distance between the separate graphs of the two quantities. The distance was so large because Mr. Blow chose to plot them on the same axis (%). This may have seemed reasonable, since a % is a %, but in fact it is mistake. When we are exploring the relationship between variation in two quantities we should not presume that we know the ratio between them. It is better to let the data tell us what that ratio might be. The correlation graph does this by plotting the full range of one against the full range of the other. Because there is no reason to do otherwise, we make the two ranges the same size. In other words, the graph has an aspect ratio of one.

As I noted above, Mr Blow’s graph has an enormous vertical extent. It is so big that in native form it will not fit on a typical laptop screen without scrolling. OK, now, for extra credit, tell me the reason for using such a large vertical expansion of the graph? Take your time…no-one is timing you…plenty of time…all the time in the world. Not quite done thinking? Take a few more minutes. Ready? And the answer is…there is no reason whatsoever! Mr. Blow blows up his graphs to ridiculous proportions (usually in the vertical dimension) because he can. He is the big graph honcho at the New York Times!

Now I hesitate to make Mr Blow the poster child for bad graphs, since the intellectual points he makes are always good ones. But an intervention is required, for his sake, for the New York Time’s sake, and for the sake of the reading public.

Reference:

New York Times

The Pirates of Capitol Hill

By CHARLES M. BLOW

Published: April 15, 2011

Data at:

http://www.bea.gov/national/xls/gdpchg.xls
http://www.taxpolicycenter.org/taxfacts/displayafact.cfm?Docid=213

Letting it all hang out

Sometimes, when you look at a graphic, you can sense the frustration of the artist. There is, after all, rarely one best way to plot a set of multidimensional data, and sometimes the artist gives up and tries to show everything. The result is invariably a mess. Consider the graphic below, from the New York Times, in an article comparing economic growth and progress in health in various global regions. The main thesis is that there is a disconnect between the two: improvements in health may occur in the absence of significant economic progress. But the artist has chosen to present the data as a giant smörgåsbord of options, in a table in which four columns show life expectancy and GDP in four different ways. First there is a little graph showing growth in life expectancy over a decade, then the total gain is shown as number, then (for some inexplicable reason) the growth in GDP is shown by two disks, and finally, the total GDP growth is shown as percentage. As always, I ask the central question: “What jumps out at you?” Here, sadly, the answer is: nothing.

Consider instead the graph below. Here we have made some hard choices. We have dispensed with the first and third columns, and plotted the gain in life expectancy against the percentage gain in GDP. We show the final GDP by the area of the points. This graph makes the essential point quickly and efficiently: there is no obvious correlation between GDP and gains in life expectancy. The outliers, regions with big gains in life expectancy but little growth in GDP, such as Latin America and the Middle East, jump out from the pack, as do regions with enormous economic growth, but little change in life expectancy.

The lesson here is: make the hard choices; less is more. Show only the data you need to make your point.

Reference:

New York Times, “Hopeful Message About the World’s Poorest” By DAVID LEONHARDT

Published: March 22, 2011

Permalink: http://www.nytimes.com/2011/03/23/business/economy/23leonhardt.html

Ring-around-the-rosy

In my last post, I railed against the feeble graphics of a recent report on salaries in the life sciences.  In the same report, we find the following “infographic.”

Since there is no additional information, we are left on our own to deduce its meaning. It seems to be saying something about average salaries in various sectors of the economy: academic, government etc.

But why are the data arranged on a ring? Are we meant to perceive the fraction each sector takes of the total salary pie? No, because no information is given about the numbers of employees in each sector. If we are meant just to compare the numbers, displaying them as fractions of a ring is a remarkably bad way to do it. Quick: which sector has the largest salaries? I bet you cheated and looked at the numbers.

And what are those little people doing wandering around in the center of the ring? Are they matadors in the bullring? Christians in the Colosseum? Oh, I see. After getting out my magnifying glass I see that some of them are wearing skirts. Or have their pockets inside out? Evidently the numbers are being split by gender. Again, since the wedges are poor at conveying magnitude, the designer has helpfully provided numerical labels, for each gender, in what appears to be 4 point type. With the magnifying glass, the numbers can be read.

OK, Mr. Smarty-pants, how would you do it? Well how about the completely obvious solution of plotting the salaries on the vertical axis, connecting and color coding  data for each gender, and sorting by male salary. We assume that the big printed numbers on the ring are the mean, and we plot that too. We print the sectors in a light gray below the data, rather than have a separate key, which obliges you to look back and forth and back and forth and…well, you get the idea.

 

We also show the zero value on the vertical scale. This is called a “ratio scale”, but technical buzzwords aside,  the concept is very simple: you can compare the size of the differences to the size of the total salary. We also add very faint horizontal gridlines. This allows you to read off numbers if you wish, but does not clutter up the main contour lines.

I hope you will agree that my plot gets right to the point: it immediately and effortlessly shows the way in which salary varies with sector, and shows the disparity between men and women. Those are the key goals of a graph: effortlessness, and immediacy.

Reference:

Life Sciences Salary Survey, 2010

http://www.the-scientist.com/fragments/salary_survey/2010/ss-charts2010.jsp

%d bloggers like this: