Correlation does not imply causation!

Some football analysts seem to think this is the apotheosis of wisdom, but of course it’s a truth that every statistician learns at his mother’s knee. In the field of football analytics however, it is a truth remarkably easy to forget. It’s not just amateur statisticians and self-accredited data scientists who fall into the trap of wrongly imputing causation from correlation, even experts and academics can be bamboozled by the data into falsely attributing causality where none exists, or even getting the causal arrows the wrong way round.

## Statistical and Causal Predictions

If analytics is to be of practical use in clubs and gain acceptance by coaches, it must be able to make causal predictions. Causal predictions are not the same as statistical predictions. For example, if you tell me a team’s wage bill, I can tell you quite accurately where they will finish in the league, because I know the correlation between wages and league position is about 0.8. This is a *statistical* prediction based on a proven statistical association between wages and league position which says that teams with the highest wage bills consistently finish near the top. However, I cannot assure you that your league position will improve if you raise your player’s wages by 10%; I cannot make a *causal* prediction. And it is only causal predictions that can answer a *what if…* question.

In this post I want to illustrate how easy it is to infer false causality from football data. False causalities are dangerous because they can fool analysts into making recommendations that have no impact on performance - or even a negative impact.

## Establishing Causality

The classical way of establishing causality in science is the randomized controlled trial. In this procedure, the subjects of an experiment (footballers, rats, bits of metal) are randomly assigned to a treatment group and a control group. The treatment group is subjected to some condition (C) expected to produce an effect (E), while the control group is subjected to a null or to a different treatment. The advantage of this procedure is that it allows us to draw causal conclusions without knowing what the possible confounding factors are. If the assignment of subjects to groups is truly random, any confounding factors are by definition distributed equally in both groups. If a difference in probability of E between treatment and control groups appears, we can infer that any confounding factors are so arranged such that C and E are probabilistically dependent and hence that C causes E because there is no alternative explanation. This is the kind of reasoning used in sports medicine and sports science.

In football analytics however - and in economics and the social sciences generally - it isn’t possible to conduct randomized controlled trials. These sciences are observational, and causality must be inferred from probabilistic relationships (correlations) observed in natural settings. Somewhat surprisingly however, it turns out that in some circumstances it *is* legitimate to infer causality from a network of correlations. The rules under which causality can be inferred from statistical associations were discovered by the brilliant Judea Pearl. In his landmark book on causal reasoning and inference first published in 2000, Pearl (also the inventor of Bayesian networks) overturned the long-held idea that observational studies cannot be used to establish causality.

Pearl’s causal calculus is too intricate to discuss in this post. But researchers pre-Pearl have long applied some basic rules to avoid drawing false causal conclusions. For instance, the rule of temporal precedence states that cause must precede effect. If C does not precede E, C cannot be a cause of E. Furthermore, causality should not be inferred unless there is a plausible mechanism linking the occurrence of C with the occurrence of E. Annual cheese consumption 2000-2009 correlates with the number of people who die by becoming entangled in their bed-sheets, but since there is no conceivable link between the two, we may not assume causality. (For this and other amusing correlations, see here).

## Causal Illusion in Football Analytics

Sometimes it is easy to identify causal directions in football. We know for example there is a robust association between playing at home and scoring more goals. We also know the choice of venue is fixed in advance of the season, and there is no mechanism whereby scoring goals can influence it. In this case we can be sure that the causal arrow runs from venue to scoring, and not the other way round because there is no other explanation. But the way in which venue exerts its causal influence has been subject to much research; the effects of crowd support, travel, and decision making by match officials have all been found to play a part in the causal chain.

In most cases however the causal links are not so obvious, and it is not so easy to pin down the direction of the causal arrow. Take the case of crossing.

I recently came across a paper by Jan Vecer, a professor at Charles University in Prague. The paper was called *Crossing in Soccer has a Strong Negative Impact on Scoring: Evidence from the English Premier League the German Bundesliga and the World Cup 2014.*

A bold causal claim is explicit in the title; crosses cause a decrease in scoring. The paper is very well written; it is clear and coherent, the evidence is assembled logically, and the author demonstrates the assured grasp of statistics you would expect from a professor of mathematical finance. In the key analysis, goals per match is regressed on the number of open play crosses, assuming a Poisson distribution for goals. Random effects for the team intercept and slope are included in the regression equation.

The result was clear; the regression coefficient for crosses is negative and highly significant, and I was able to replicate it with my own data. I even tried to change the direction of the association by introducing new variables into the regression equation, but whatever I tried the relationship remained stubbornly negative. There was nothing wrong with Vecer’s analysis: there is a robust statistical association between crosses and goals showing the more crosses a team makes in a match, the fewer goals they score. Vecer suggested that the reason for the negative relationship is that crosses are an inefficient way to score, and crossing displaces other more efficient ways of scoring that a team might otherwise use. Vecer concluded:

*“…. the relationship between open crosses and goals is directly causal, meaning less open crosses will lead to more goals. … Thus an average team in the EPL could realistically add 0.393 goals per game and team if they reduced crossing, which translates to about additional 300 goals per season.”*

Other researchers have also concluded crossing has a negative impact. In a study of the 2014 World Cup, Liu and his co-authors claimed that crossing is one of four factors to have a negative impact on winning matches, along with red cards, blocked shots and dribbles.

Thus requirements for causality appear to be met. We have a statistical relationship between a proposed cause and its proposed effect; a cross occurs before a goal, so temporal precedence seems to be established; and we have a mechanism explaining how cause and effect are linked. Surely we are justified in drawing a causal arrow from crosses to goals, and recommending that teams cross the ball less.

But is this conclusion safe? Or is it an illusion?

## Identifying Causality

To see the causal connections more clearly, we need to delve into the match dynamics, and explicitly model the temporal relationships between goals and crosses. Let’s start by dividing a match into segments, which begin and end whenever a goal is scored. Each segment corresponds to a ‘game state’ which is simply the difference in scores between the two teams. For example, when the scoreline is 1-1 or 0-0, both teams are in a game state of 0; if Team A leads by 1 goal, Team A’s game state is +1, and the opponent’s game state is -1 and so on.

Now imagine two consecutive segments. as shown below, and consider the change in Team A’s crossing rate (if any) that occurs when they score or concede. Here, I am pinning down the temporal precedence.

By looking at the rate of crossing in segment *t+1*, we can directly compare the effects of scoring and conceding a goal. I used three seasons of EPL data to investigate the effects. The Poisson regression model I used is specified in the appendix. Briefly, I examined the number of crosses that Team A made in segment* t+1* depending on whether they scored or conceded at the end of segment *t*. The control variables in the regression were the team and opposition (encoded as random effects); the match venue; the game state of segment *t*; Team A’s crossing rate (crosses per 90) in segment *t*; and the elapsed time of the match. The coefficient for scoring vs. conceding was negative and highly significant (-0.635, *p* < .001) showing that the number of crosses in segment *t+1* is much lower when segment t is terminated by scoring than by conceding. The ratio of crosses made after scoring to crosses made after conceding works out to be 0.53. This indicates a substantial reduction in crossing attempts after a goal is scored.

This makes sense. 80% of game states are -1, 0, and +1, so the team that scores is usually the team that goes ahead, draws level from behind, or increases it’s lead, and in each case has a reduced incentive to attack. Conversely teams that concede generally need to attack more to get points out of the game.

We can conclude that teams do not score because they make fewer crosses; they make fewer crosses *because they score*. This is a completely different interpretation of the same statistical relationship.

But what about the effect of crosses on goals? The same segmentation framework can be used to measure the effect of crosses on goals. In this case, the direction of causality is from crosses in segment *t* to the outcome in segment *t*. Is a higher rate of crossing in segment *t* more likely to produce a goal for or a goal against? To model this I used a binomial regression in which the outcome (goal scored or conceded) was regressed on the rate of crossing in segment *t*. The control variables were the team and opposition (encoded as random effects); the match venue; the game state of segment *t*; Team A’s crossing rate in segment *t*; and the elapsed time of the match. (The model equation is shown in the appendix.) Only venue and crossing-rate were statistically significant. Contrary to what others have claimed, crossing has a positive impact on scoring. The coefficient for the crossing-rate was positive (0.025, p < .001) which means that a increasing crossing by 1 cross per 90 minutes increases the odds of scoring a goal by about 2.5%.

We can be reasonably confident that causality runs in both directions. There is a positive causal path from crosses to goals and a negative causal path from goals to crosses:

Of course, this is still a simplification of a more complex causal system. Crosses do not cause goals directly; they generate shots which are the primary cause of goals. Secondly goals do not directly reduce crosses; they engender a tactical mindset that dials down attacking, which in turn reduces crossing and depresses associated actions like shooting. However, the causal picture presented here seems far more plausible than one in which crosses negatively impact scoring, and which would imply that reducing the volume of crossing would lead to more goals.

## The Bottom Line

Causal chains in football can be subtle and are not always what they appear to be on the surface. Analysts should take considerable care in attributing causal relationships correctly because failing to do so can lead to biased conclusions, and even worse, flawed recommendations. If we are to advise and influence decision-makers in clubs and gain their trust, we must be sure our causal arrows are at least pointing in the right direction.

### Appendix: Regression Models

The models shown here are for information only. They are not needed to understand the post.

Effect of Goals on Crossing

This was a Poisson regression model. For a match between team *i* and team* j* we can write:

where

This model was probably more complex than it needed to be, but I wanted to take account of as many potential influences on crossing as I could. *N _{i, t+1} *is number of crosses attempted by team

*i*in segment

*t+1*. is a random intercept representing the base crossing rate for team

*i*, and is a random intercept representing the base crossing rate for team

*j*. ScoreOrConcede

*is 1 if team*

_{i,t}*i*scores or -1 if team

*i*concedes in segment

*t*; Venue

*is 0 if team*

_{i}*i*is playing at home or 1 if playing away; duration

*is the length of segment*

_{t+1}*t+1*(this is a rate offset which has a fixed coefficient of 1); elapsed

*is the time between start of the match and the mid-point of segment*

_{t+1}*t+1*; CrossingRate

_{i,t}is team

*i*‘s rate of crossing during segment

*t,*and e

*is the error term. GameState*

_{ij}*represents the set of K categorical game states,*

_{k,i,t}*k*= 1..K. The game states (K=7) were, -3, -2, -1, 0, 1, 2, 3 and game states bigger than -3 or 3 were assigned to -3 or 3 as appropriate.

Effect of Crossing on Goals

ScoreConcede_{i,t} is a binary variable indicating whether team* **i* scored or conceded at the end of segment *t*. and are random intercepts for teams *i* and *j* respectively, and venue, elapsed and duration have the same meaning as in the previous equation.

where