The football analytics world is awash with Expected Goals (xG) models. An xG model estimates the probability of scoring a goal given the features of a shot such as its location, whether it was preceded by an assist and so on. Those new to this concept can find an explanation in Paul McInnes’s article in the Guardian.
xG models vary widely. Some like this one from Paul Riley use a small number of shot features to estimate the probability of scoring; others like this one from Michael Caley use complex feature transformations and systems of stacked equations; this one by Martin Eastwood uses a non-linear machine learning algorithm.
Not all models created equal. All are incomplete and imperfect; simplifications of the real world that help us understand it. So which xG model should we prefer? Is Paul Riley’s model too austere? Is Michael Caley’s model over-complicated? This is the first of a two-part post about assessing how well xG models describe the conversion of shots into goals. In this part I look at modelling individual shots. In Part 2 I will look at the more complex issue of modelling match scores.
While some great creativity has gone into model building, model assessment has remained remarkably unsophisticated. We simply don’t know how good or bad our xG models are at predicting goals – although we think we do. As Nils McKay has pointed out, most people have assessed their models by aggregating the predictions over a season and regressing the actual team totals on the predicted team totals. The quality of the model is then assessed by computing the “variance explained” of the aggregated data, i.e. how much of the variance in the data can be explained by the model. Models with high explanatory power generate predictions that sit close to the regression line, while models with low explanatory power generate outlying data points. Variance explained is denoted R2 and the formula is below:
R2 is widely used in OLS regression, where the outcome variable is continuous and linearly related to the predictors, and it has some nice properties. It always lies between 0 (representing a completely uninformative model) and 1 (representing a perfect model). It is a pretty good indicator of model fit for OLS regression – but in our case it is far from ideal. The problem lies in the aggregation. The R2 you get depends on how many shots are grouped at each datapoint. So for example a model would perform worse if you aggregated over a Bundesliga season than an EPL season, because there are fewer matches and fewer shots at each aggregate datapoint. Similarly, predictions will be more variable for teams with fewer shots. Secondly, even a completely non-informative model that assigns the same probability of scoring to EVERY shot does a pretty good job of predicting team totals over a season, and produces a respectable R2 value at this level of aggregation.
Finally, the results depend on the way the data is aggregated. For example I and others (Martin Eastwood, Will Gurnipar Morgan) have found that our xG models predict aggregate goals scored better than they predict aggregate goals conceded; same data, same models, but different aggregations. There may be a straightforward explanation for this. xG models don’t usually incorporate defensive information, so the estimated xG values reflect the probability of scoring against an average defence. This works when you aggregate shots taken by a team, but the shots conceded by a team involve a specific defence which may be stronger or weaker than average. Accordingly, xG will over-estimate the goals conceded for a strong defensive team, and under-estimate it for a weak defensive team.
There is nothing wrong with aggregating the data, for example to see whether particular teams or players under-perform or over-perform compared to expectations, but using aggregate data to assess the fit of an xG model is a sub-optimal strategy, at least when predictions for individual shots are available.
Fortunately, there is no need to aggregate at all. An xG model is a binary classifier; for every event it estimates the probability of belonging to the target class (in our case a goal). Assessing the model fit and predictive ability of binary classifiers is something that statisticians and data scientists have long studied; many useful fit measures have been developed, and their strengths and weaknesses have been established. In this post I will look at three of these measures, and apply them to several different xG models to highlight the differences between them.
1. Root Mean Square Error
The RMSE is simple to calculate. For each event, subtract the predicted value from the actual value and square it. Sum the squared values and take the average. Then take the square root of that. In the case of an xG model, the actual value will be 1 if the event was a goal and 0 otherwise, and the predicted value will be the model-predicted probability of scoring.
2. McFadden’s pseudo-R2
A pseud0-R2 is an analogue of the regular OLS R2 for binary classifers. There are many pseudo-R2 measures, but McFadden’s is the most popular.
McFadden’s R2 is based on comparing model likelihoods. The likelihood is defined as the probability of the data given the model, with higher likelihoods representing more likely or better fitting models. For practical reasons we usually deal in log-likelihoods (denoted LL). Likelihoods being probabilities fall between 0 and 1, so LLs are negative or zero. Imagine model M with a likelihood of .9 and a comparison model C with a likelihood of .1. The LLs are respectively -0.1 and -2.3, and their ratio is 0.04. Thus, when the LLM / LLC ratio is small, M is a better fit than the comparison. In McFadden’s measure, the comparison model is a ‘null’ model (a model without any predictors) and the R2 is calculated like this:
Like the regular R2 , McFadden’s R2 varies between 0 and 1.
3. ROC Curves and the AUC
Receiver operating characteristics (ROC) curves were developed in the 1950’s for visualizing the performance of classifiers. They
are commonly used today in medical decision making, as well as in machine learning and data mining. The ROC curve is a plot of the True Positive Rate (percentage of predicted goals that really are goals) against the False positive rate (percentage of misses classified as goals). It would take too long to explain ROC curves in this post, but if you want to know more you can watch an excellent video tutorial on ROC curves here . AUC stands for Area Under the Curve, and has an appealing intuitive interpretation; given a random event from one class and a random event from the other class, the AUC is the probability that the model identifies the correct class. AUC therefore varies between 50% and 100%. Figure 1 shows idealised ROC curves for four different imaginary models.
The blue curve is the ROC for a completely non-informative model, and represents the lower bound of performance. This model performs no better than chance and has an AUC of 50%. The black line is the ROC curve for a perfect model, which always selects the correct class, and the AUC is 100%. The yellow and red curves are indicative of typical models having AUC’s between 50% and 100%.
With these fit measures in mind, let’s look at how some models perform. The data consisted of 39,627 shots from four EPL seasons. Now it is always possible to improve the fit of a model by including more and more predictors; but doing so risks modelling the random error in your dataset instead of the underlying relationship, a process called “overfitting”. To avoid this it is preferable to test a model on previously unseen data. In the calculations below, the training set (used to build the models) was a 70% random sample of the data, and the test set (used to compute the model fit measures) was the remaining 30%. Penalties were included, and own goals were excluded. The fit measures discussed are summarized in a table at the end.
[#update] After publishing the first version of this post, I came across a Jan Mullenberg’s post on Opta Big Chances and David Sumpter’s analysis showing that the Opta Big Chance feature just on its own predicted goals almost as well as a model with 18 features in it. So I added a Big Chances Only model, and updated the tables and charts accordingly.
The models I checked out were:
Deadspin Model. This is the model made famous in a magnificent article by Michael Bertin. It simply assigns the same probability of scoring to every shot.
Conversion Rate Model. This model just assumes the probability of scoring is determined by one of the seven shooting contexts defined by Opta. (Penalty, Regular play, Fast break, Set piece, From corner, Free kick, Throw-in set piece)
Standard Model. This is a typical basic xG model which has three location features (x,y, angle) and three types of play feature (assisted/non-assisted;, header/non-header; shooting context). Shooting context is one of the seven types defined by Opta; Penalty, Regular play, Fast break, Set piece, From corner, Free kick, Throw-in set piece.
Stacked Equations. Same predictors as the Standard Model, but with separate regression equations for each shooting context.
Big Chance Model. One predictor only – the Opta Big Chance feature
Standard+Big. Same predictors as Standard Model, but with the Opta Big Chance feature added
(I also made a table entry for Martin Eastwood’s Support Vector Machine model, because the RMSE he reports is comparable.)
Table 1. Comparative Fit Measures
|Big Chance Only||0.271||0.18||0.725|
|Standard + Big Chance||0.264||0.22||0.807|
|Martin Eastwood SVM||0.269||?||?|
Looking at the fit measures table, we see the Deadspin model has an RMSE of .301. Actually that doesn’t look too bad compared to the other models, but the real failure of this model is revealed by the zero McFadden R2 and the 50% AUC, both of which show the model is completely useless as a classifier. The Conversion Rate model does only slightly better, with an RMSE of .291, a McFadden R2 of .04 and an AUC of 56%.
The Standard and Stacked Models model which include location information shows a dramatic change in quality, with a McFadden R2 of .17 and an AUC of 78.7%. But the surprising thing is how well the Big Chances Only model works. It actually outperforms the Standard model on RMSE and McFadden measures, but not on the AUC.
Finally, adding the Big Chance feature to the Standard model has a substantial effect on performance; adding this one feature produces noticeable improvements in all three fit measures. Figure 2. shows what happens to the ROC curve when we add Big Chances to the Standard Model.
The increased area under the red curve indicates the predictive ability of the model is elevated when the Big Chance variable is included, and we can see that the area of improvement is in the low False Positive region of the curve. (Jan Mullenberg has found precisely the same effect in the Dutch Eredivisie.)
The take-away here is that the standard measures of model fit do a good job, and ought to be used more widely. Measures like the AUC or McFaddens R2 are particularly useful because they are bench-marked at both ends. The lower end represents a model that performs no better than chance while the top end represents a model that delivers completely accurate predictions. Positioning a model on this scale gives us a good intuitive sense of its ability to describe and predict the data. For example the better models presented here seem to have a good ability to discriminate between shots that score and those that don’t, with AUCs between 75% – 80%. Although I don’t have other data outside of football to compare it with, that sounds quite good to me.
A secondary issue is the role of the Big Chance Feature in predicting goals. This is a tag that Opta encoders add to shots that they deem to be big chances, but it is a subjective judgement with is no formal definition, and there has been some speculation that its massive predictive power is due to outcome bias on the part of the encoders. The case for outcome bias has not yet been made to my satisfaction, and it is equally plausible that the Big Chance feature is a proxy for distance from the goal and a thin defensive formation in front of goal and so on, something I hope to explore in a later post.
APPENDIX: Conversation with Nils McKay
Whilst writing this post I had an interesting email exchange with Nils Mckay. To assess the quality of some of the most well-known xG models around, Nils tested them against a standard dataset (see here and here). Because it was difficult to get predictions for individual shots, the data was aggregated by Team*Match and the data to be predicted were the goals scored by each team. The performance criterion was an RMSE measure. Nils has done an excellent and valuable job of ranking the models; but the RMSE doesn’t tell us much about absolute performance.
To address this, Nils attempted to develop a best possible benchmark (a ‘perfect’ model) by computing the fit of his own model against 200 simulations of its predictions. In my view this is a bridge too far. Essentially it just evaluates the predictive ability of a particular model on the data used to generate it; so naturally it will perform quite well. But I can’t see how it could function as a best possible benchmark for other models. Indeed in the latest iteration of his tests, some of the models tested outperformed the ‘perfect’ model.
So what is the alternative? I will attempt to address that in Part 2 of this post.