Expected Goals (*xG*) has been the subject of considerable interest in the football analytics community. For those unfamiliar with the concept, Danny Altman’s video provides a quick introduction, and Michael Bertin has offered a critique which is worth reading and contains a number of useful links. In this post my aim is more technical. I will describe the a three-stage model for Expected Goals, and discuss the advantages of this type of modelling framework.

# A Three-stage Model for Expected Goals

Let’s begin with some notation. We define *xg _{i}* (note the lowercase g) as the probability of scoring from the

*i*th attempt on goal. Then

*xG*is simply the sum of

*xg*taken over our domain of interest. For example, the expected goals for a player in a season is the sum of all his

_{i}*xg*’s for that season.

_{i}To predict *xG* we need a model to predict the* xg _{i}*’s, based on the characteristics of the attempt. We know for instance that shots taken closer to the goal line are more likely to score than shots from further away. So shot location will be a predictor in the model. All the

*xG*models I’ve read about include shot location as a predictor; other common predictors are the shooting context (e.g. regular play, set piece free kick etc.) and sometimes body part (e.g. header or kicked). So-called pre-shot models include only factors prior to the strike, while post-shot models which include factors determined by the striker such as shot direction, speed and so forth.

The simplest type of *xG *model is a single stage model where *xg _{i}*’s are estimated from a single regression equation. But an alternative is a multi-stage model of the kind I describe here. The multi-stage framework recognises that for a shot to become a goal, the ball must negotiate three stages.

- It must not be blocked
*and* - It must be on target
*and* - It must cross the goal-line

Corresponding to this, we can build an expected goals model by combining three sub-models, M1, M2 and M3 as shown below.

###### Structure of the Three-Stage Expected Goals Model

The first stage sub-model is applied to every shot, and estimates its probability (*p _{1}*) of being blocked. The second stage sub-model is applied to every non-blocked shot, and estimates its probability (

*p*) of being on target, and the third sub-model is applied to every on-target shot and estimates its probability of crossing the goal-line (

_{2}*p*).

_{3}Next we aggregate to our desired level (e.g. player or playerXseason) and compute the means of *p _{1}*,

*p*and

_{2}*p*at each aggregated datapoint. The law of probability says the total probability of scoring for a datapoint () is obtained by multiplying:

_{3}where the bars indicate means . Finally, multiplied by the number of shots gives the expected goals for that datapoint.

# A Practical Example

So much for the theory, but does it work in practice? I tested the method using Opta EPL data for the four seasons 2010-2013. I used a ‘train-test’ paradigm to assess model performance . Here the model is built using the training data, and fit is assessed on the test data. This avoids the over-fitting that can occur when we test a model on the same data used to build it.

### Model Fit

I used seasons 2010 and 2012 as training data, and seasons 2011 and 2013 as test data, and I included both pre- and post- shot predictors in the model.

To asses sub-model fit, I used ROC curves. In the diagrams below, the diagonal line represents random guessing and the curved ROC line represents the model. The “area under the curve” (AUC) is an indicator of model performance, and ranges theoretically from 0.5 (pure guesswork) to 1.0 (perfect prediction). The larger the area the better the model. We can see that M3 performs well, and M2 rather less well, with M1 in between.

###### ROC Curves for the Three Sub-models

But how well does the overall xG model perform? My aggregation level for this study was playerXseason. There were 1,795 playerXseason datapoints in the test dataset, but many consisted of only a few shots. Eliminating datapoints with fewer than ten shots left 573 playerXseason datapoints, with an average shot count of 34. The six datapoints with the highest *xG* values are shown below. We can see the model predictions are quite good.

###### Predictions for Top 6 Goalscorers

Player X Season | No. of shots | Expected Goals xG | Total Goals |
---|---|---|---|

Van Persie: 2011 | 171 | 26.7 | 28 |

Suarez: 2013 | 181 | 20.2 | 30 |

Aguero: 2011 | 127 | 17.7 | 20 |

Giroud: 2013 | 111 | 16.9 | 15 |

Rooney: 2011 | 149 | 16.3 | 21 |

Adebeyor: 2011 | 96 | 15.3 | 14 |

For comparison purposes, the next table shows two fit statistics for a model based on shooting distance alone, a three-stage model with only pre-shot variables, and the full three-stage model. The full model explains more variance than the other two and has the lowest mean absolute error; this means it is the best fitting model.

###### Model Comparisons

Model | Rsq Variance Explained | MAE Mean Absolute Error |
---|---|---|

Distance only | 70.5% | 1.5 |

Pre-shot | 72.6% | 1.4 |

Full model | 81.1% | 1.2 |

Clearly the multi-stage technique delivers credible results and predicts new data well. I did not spend a lot of time optimizing the model, so the fit could possibly be improved.

# Why Bother?

A three-stage model is more complex than a single-stage model, so why bother when there are some decent single‑stage models around?

The key advantage is that the three-stage model reflects reality better than a single stage model.

Consider swerve. Swerve is a significant predictor in both the M2 and M3 sub-models – but with opposite signs. It has a negative effect in the M2 sub-model, and a positive effect in the M3 sub-model. This fits with intuition. A swerving shot is less likely to be on-target, but if it is on‑target it is more likely to score. In a single stage model, these effects will wash out and underestimate the importance of swerve; but the three-stage model properly reflects the dual role of swerve within the scoring process.

In the same way, predictors which only have meaning at specific stages of the scoring process can be modelled in a natural way. For example, where the ball crosses the goal-line is an important predictor of goals, but is undefined for blocked shots and off-target shots. In a three-stage model, this factor could be included in the M3 sub-model without affecting the specification of the other sub-models.

# The Bottom Line

I believe that a multi-stage modelling approach offers potential advantages both in football and in other sports. It would be interesting to see if the same approach works for expected goals in hockey. Also staged processes like serving in tennis – where the ball must pass the net, land in court, and beat the receiver – could be modelled in a similar way.