Lies, damn lies, and (Referendum) statistics

The people I generally follow on Twitter are data analysts and statisticians, and tend to be level-headed folk,  so my Twitter feed has been mercifully free of the current flood of referendum tweets and the angry classist and ageist comments which often accompany them. But a few data-related tweets have got through, which I think deserve some critical discussion from a statistical perspective.  As analysts, we should be careful not to draw unwarranted inferences from the data, especially when the topic is an emotive one.

Here’s the first example.

The Half Truth

YouGov Referendum age Breakdown

This chart is based on YouGov data. It was reproduced in the Huffington Post, and it has been retweeted over 1,000 times.  The data shows a stark difference in preferences between older and younger voters, with a large percentage of younger voters favouring Remain, and a large percentage of older voters favouring Brexit.  Now I’m not arguing with the facts here. I’m not saying the pollsters used a biased sampling frame or an under-powered sample, or applied the wrong statistical test or that kind of thing.

The problem is that the tweeter (a journalist) prefaced the chart with this comment; “Older generation voted for a future the young don’t want.” Blaming groups is generally not a good idea anyway, but clearly, the implication here is that the older generation are somehow at fault, and the younger generation are blameless. Let’s examine this more closely.

There are two important dimensions to group opinion. One is the fraction of individuals in a group who hold a particular opinion or have a particular preference. But opinions and preferences can be strongly held or weakly held, so the second dimension we need to consider is strength of opinion; and the YouGov chart says nothing about that.  So how can we assess it?  An obvious way is to see whether people cared enough to vote. (It’s called voting with your feet.)  And the results are striking. Sky News report that only 36% of 18-24 year-olds voted compared to 81% in the 55-64 age group.

So it looks like there were two distinct generational factors at work in the referendum result. The older generation expressed a strong preference for Leave, and the younger generation didn’t care strongly enough to vote Remain. Those who blame the older generation – and only the older generation – for the referendum result are ignoring half the truth.

The Ecological Fallacy

This one is particularly disappointing to see in the national press.

Brexit ecological

This chart was published in The Telegraph, and it shows the relationship between the proportion of those with no education in a region, and the percentage of those in the region voting Leave.

The first thing to note is the regression slope is pretty flat, indicating a rather weak relationship.  I doubt whether it’s even statistically significant. But let’s leave that aside.  The implication here – and it’s spelled out at the top of the chart – is that uneducated people tended to vote Leave.

This is a classic example of the Ecological Fallacy, the unwarranted practice of drawing correlational conclusions about individuals from correlational data about groups. The maths behind the fallacy and why it happens were worked out by Robinson in his paper, Ecological Correlations and the Behavior of Individuals, which was published in the American Sociological review in 1950. Robinson showed that individual correlations can even be in the reverse direction to the group correlations. The Telegraph’s chart provides no evidence linking a person’s level of education to their referendum.

For those who don’t have the inclination to work through Robinson’s algebra, here’s one plausible mechanism which would generate the observed regional data:  a) Uneducated people tend to be over-represented in regions where there is the greatest pressure on public services. b) people who live in regions where public services are under pressure tend to vote Leave.

Any statistician or data scientist who liked the Telegraph’s chart should commit hara-kiri –  or at least read through Robinson’s paper!