We intuitively know that some passes are more difficult than others. A short pass is easier to complete than a long pass isn’t it? And unless this is taken into account, we can’t really measure passing ability. For this reason, some analysts have used event data to develop passing models. These models use characteristics such as the pass start and end locations to predict the probability of a pass being completed.  For example, Will Gurpinar-Morgan used the random forest algorithm to develop a passing model that achieved a very respectable AUC  of 0.87. (The AUC measures the performance of a classification model, and varies between 0 and 1, with 1 representing perfect performance.)  StatsBomb have also developed a passing model using 20 million passes, but they have not published any details about its performance.)  Once we know the probability of any given pass being completed, we can calculate measures of passing performance that are adjusted for pass difficulty.

I was interested to see whether I could use deep learning to develop a high-performing passing model. But why bother? After all, predicting the outcome of a pass from event data is a straightforward classification problem, and not the kind of a high-dimensional problem that deep learning excels at. We only have a limited number of features to care about - the pass coordinates, and perhaps some contextual details like the previous event, and maybe some game state variables if we want to include them.  But developing a passing model does involves examining a very large number of passes, and deep learning really comes into its own when we have large datasets to train our model on.

But before getting to the modelling itself, I want to talk about the data exploration phase. It turned up something curious which I thought I should write about.

(NB This is the second version of this post, following some helpful input from Thom Lawrence)

The Data

I used passing data from the Big 5 European domestic leagues between 2015-2017, a total of 15 seasons. Corners, throw-ins, set pieces, headed passes and so on were excluded, giving about 4.2 million open play passes to work with.  I first made a couple of adjustments to deal with non-standard passes.

Non-Standard Passes

One of the strongest predictors of pass completion is pitch location. When I examined this feature I found two kinds of non-standard pass in the data.

In the chart below, the blue curve shows that the probability of a pass being completed depends on where it originates on the pitch. But instead of a completely smooth curve, there are two discontinuities. First there is a noticeable peak in the middle of the plot, indicating an unusually high pass completion rate.   I suspected this was due to kickoffs, because kickoffs are taken from here and kickoffs are virtually always successful.  Next, there is a noticeable dip around the 16 metre mark, which I suspected was due to goal kicks.

Fig 1. Eliminating non-standard passes

I opted to keep kickoffs in the analysis, and tag them with a kickoff indicator.  I also tagged the pass following the kickoff.  As I was mostly interested in outfield play, I decided to drop all goal keeper passes.

The red curve shows the data when kickoffs and goalkeeper passes are removed, and we can see that the previous discontinuities have almost completely disappeared.

So far so good, but it was the next problem that literally gave me sleepless nights.

Short passes, blocks, interceptions and all that

Another key determinant of pass success is pass length.  We all know that short passes are easier to complete than long ones. Not exactly rocket science.  On the left-hand side of the figure below, the probability of pass completion is plotted against pass length (to clarify the effect, kickoffs are excluded).

But …. WAIT a moment!

We can see that passes shorter than about 5 metres have a disturbingly low probability of completion.  What’s going on here?

I’m not the first person to notice this; Neil Charles and Will Gurpinar Morgan both found something similar. Will suggested that many of these short passes are intended long passes which are blocked or intercepted. Such passes would have an end location at the point of interruption, giving what looks like a failed short pass.  This sounds like a reasonable interpretation of the data, and accordingly Will removed passes shorter than 5 m. from his analysis.

However, Will didn’t have any empirical data to support his conjecture, and I struggled at first to find any support for it.  For one thing, the total number of blocks and intercepts in event data is generally quite low compared to the total number of passes. In the present dataset, only 5% of passes are either blocked or intercepted.  For passes shorter than 5 m. however, the interruption rate rises to about 31%.  I removed the blocked and intercepted passes and re-plotted the data to see what would happen.  At first it looked like nothing much had changed. However, Thom Lawrence suggested that other events like ball touches could also be considered as pass interruptions. I added ball touches as a further interruption, deeming an interruption to occur if the touch followed an unsuccessful pass within 1 second within 5 m. from the point the pass was made.

The results are shown in the right hand plot in the figure below. This looks somewhat more encouraging, although short passes do still have unexpectedly low completion rates.

So why are short passes difficult to complete? One potential factor seems to pitch location.  It turns out that short passes are disproportionately likely to be attempted high up the pitch.  35% of short passes are made within 30 m. of the opposition goal line as opposed to only 14% of longer passes.  The figure below shows pass how pass completion varies with the pitch location for three categories of pass length.  We can see there is a disproportionately big drop-off in the completion rate for short passes high up the pitch. Perhaps many of these passes are made at pace during the attacking phase, and the receiver doesn’t have time to react.  But it doesn’t really explain why short passes are more difficult than medium length passes at all pitch locations. We probably need off-the-ball or tracking data to really understand what’s going on.

Fig 3. Effects of Pitch location for short, medium and long passes.

At this point I decided to park the issue.  I’m sleeping again now thanks to Thom Lawrence, but I think this issue could benefit from some further detailed analysis. Anyway, onto the problem of interrupted passes.

Dealing with Interrupted Passes

For the purposes of this post, ‘interrupted’ passes are the 5% of passes which don’t reach their intended destination because they are blocked or intercepted by an opponent, or prematurely terminated  by a ball touch.  Interrupted passes pose a tricky problem for models of pass completion which use the intended end point of a pass as a predictor, because the end point is not known.  One option would be to exclude such passes from the analysis altogether, but interrupted passes are about 26% of all unsuccessful passes, and to leave them out would result in a considerable overestimate of completion rates.  Predicting whether a pass would be interrupted or not without access to off-the-ball data didn’t seem like a viable solution either.

However, we do know the origin and direction (i.e. angle) of an interrupted pass, so it is theoretically possible to estimate the intended end point or at least a range of plausible intended end points.  I decided to use hot deck imputation to do this.  Hot deck imputation is a method for estimating missing data in which missing values are replaced by observed values from similar cases.  Our interrupted pass becomes an unsuccessful pass like any other except the intended end point is imputed rather than observed. (A nice discussion of missingness and imputation methods can be found here.)

Some experimentation was needed to arrive at an imputed distribution of end-points that mirrored the observed distribution for passes originating at every point on the pitch. I found that dividing the data into about 90 cells defined by the xy coordinates of the pass origin and the pass direction, and hot-decking within each cell, gave quite good results.

Figure 3 shows that the distributions of the observed and imputed data were quite similar.

Fig 3. Comparison of Observed and Imputed data.

It looks like the imputed passes are somewhat shorter than the observed passes, but I think the bias is less than leaving them out altogether.

In the next post, I will describe how I used this data to train a Deep Learning model of pass completion. Stay tuned!