The fluidity of football is one of the main obstacles to analysis. Unlike baseball or cricket which are divided into discrete one-v-one plays, football plays consist of passing and dribbling sequences which may be long or short, cover the whole pitch or some small area of it, and involve varying numbers of players. It would be useful if we could come up with a way of classifying these sequences, and this is what I will discuss in this post. The aim is to identify a manageable number of distinct sequence types.
The unsupervised statistical procedure for assigning data examples to distinct types is known as cluster analysis. I’ll first describe a clustering experiment, employing a type of neural network called a convolutional autoencoder. (Spoiler alert: it didn’t work.)
Clustering Sequences using A Convolutional Autoencoder
An autoencoder is a type of neural network model. Its purpose is to produce a low dimensional representation of its inputs, in much the same way as principal components analysis converts a large number of variables into a small number of components that preserves most of the original information. An autoencoder consists of two parts, an encoder and a decoder. The encoder compresses the input into a low dimensional space, and the decoder then expands it. The model is trained by making the output of the decoder resemble the original input as much as possible. When this has been achieved we know that the compressed data produced by the encoder is a fair representation of the input. This compressed version may then be used downstream in other analyses. A basic explanation of how autoencoders work can be found here.
My inputs to the autoencoder were RGB images of 20,000 pass sequences selected randomly from the 2015-2017 seasons of the Big 5 European leagues. The sequences were drawn as 420*272 pixel images, and then down-sized to 104*68 pixels. The start of a sequence was represented by a green blob, and the end of a sequence by a red blob (see Figure 1. below). The geometry of the sequence was traced out by a white line, with thick strokes representing a pass and thinner strokes representing a dribble. The number of features in the input was 21216 (104*68*3), obviously a very large number for most statistical procedures to handle.
I built the encoder using convolutional layers, which are typically used in image-processing applications. Convolutional neural networks (CNNs) learn image features rather than image patterns; a regular network can be trained to recognise say the figure “2” at a particular location within an image, but would not recognise the same figure at a different location. A convolutional network on the other hand can be trained to recognise a figure “2” located anywhere in the image. For an introduction to CNNs see here.
I experimented with various configurations of autoencoder. For the technically inclined, the encoder producing the encoded image in Figure 1a was shallow, consisting of a single convolutional layer with four filters and relu activation. The compressed layer had 1664 units, which was a substantial reduction in the input size, but as we can see the decoded output is rather rough. The encoder producing the image in Figure 1b was deeper, having two convolutional layers with 32 and four filters respectively, and 3328 units in its compressed layer. Clearly, this model captured more information at the expense of lower compression.
The next step was to cluster the compressed data. I tried both K-means clustering and Gaussian mixture modelling on both types of encoded data. What we hope to see is some evidence for a preferred number of clusters, and we typically do this by examining how a fit measure such as the Bayesian Information Criterion (BIC) evolves as a function of the number of clusters. The preferred number of clusters is the point at which the fit measure shows a minimum, or at least ceases to improve as the number of clusters increases, or when there is an “elbow” in the curve. However, under various fit measures, there was no clear clustering solution; model fit just kept improving smoothly as the number of clusters increased. Figure 2 shows a typical result.
To continue down this path, I would have needed to choose the number of clusters more or less arbitrarily, introducing some decision-making bias into the process which I wanted to avoid. So I decided to abandon this method, and try a more traditional approach.
Clustering Sequences using Sequence Attributes
In this approach, instead of using abstract encoded features, I attempted to cluster the sequences on observable attributes. The attributes I used were:
Start location: Pitch x,y co-ordinates
End location: Pitch x,y co-ordinates
Box width: Width of bounding box
Box Length: Maximum x coordinate - minimum x coordinate
Verticality: End x co-ordinate - start x co-ordinate
Number of passes
Total pass length: sum of individual pass lengths
Total dribble length: Sum of all dribble lengths
Number of different players involved
Average x,y coordinate
Zigzag: Sum of changes in pass angles
The pass sequences were clustered using both K-means and a Gaussian mixture model. No preferred number of clusters could identified using K-means, but as shown in Figure 3, the BIC fit curve for a Gaussian mixture model suggested 50 clusters would be a reasonable choice.
The 50 clusters are illustrated in the carousel below. Each image shows the three most typical pass sequences in the cluster (i.e. were identified by the clustering algorithm of having the highest membership score for their cluster.) The images can be expanded for viewing.
Table 1. lists the characteristics of each cluster
|Cluster||Percent of Sequences||Percent Shots||No. Passes||No. Players||Start x-Location||End x-Location||Start y-Location||End y-Location||Box Length||Box Width||Verticality||Pass Length||Dribble Length||Zigzag|
Most of the columns in Table 1 are self explanatory, but the Percent Shots column is the percentage of sequences that are followed by a shot within 8 seconds. So for example, we can see that 2.4% of sequences belong to Cluster 0; sequences in this cluster begin with the goalkeeper, or low down the pitch, and have moderate verticality; only 1% of sequences in this category are followed by a shot. On the other hand, the average sequence in Cluster 5, also originates low down the pitch but has high verticality and is followed by a shot 23.6% of the time.
Many of the clusters illustrate recognisable trajectories - Cluster 1 is a switch of play in the final third, Cluster 5 is an end-to-end attack, Cluster 6 is corners, Cluster 8 is a movement down the flank. Clusters like 33 and 39 represent sustained periods of possession covering large areas of the pitch, while others like Cluster 13 contain compact sequences of a few passes quickly terminated by the opposition.
Validating the Clusters
The next step is validate the clusters, i.e. determine whether they have any utility. We might expect that different teams display different mixes of clusters. To illustrate this, Table 2 below shows the frequency of Cluster 5 sequences for six selected teams. The differences in percentage are highly significant, and it certainly seems that this cluster at least carries some meaning.
|Team||No. of Cluster 5 Sequences||Total No. of Sequences||Percentage|
To validate the clusters more comprehensively, I conducted a multidimensional scaling (MDS) analysis on the cluster percentages. Essentially this encodes the team cluster usage in 2 dimensions (I used percentages because using the raw numbers would simply group teams together by amount of possession.) I mapped the EPL teams in Figure 4. below. Teams that are close together have similar scores on both dimensions, and hence similar patterns of cluster usage.
Figure 4. MDS of Cluster usage mapped against Performance
We can see that some teams preserve their location across seasons - others move around but generally stay in the same region of the map for at least two seasons.
More persuasively perhaps, the cluster mapping is congruent with performance. The coloured background shows how goals/match varies with the dimension scores, and we see that the elite Premier teams are located in the high-performing region of the map. This indicates a relationship between the dimensions (and therefore the pattern of cluster usage) and performance. In fact goals scored per season increases strongly with increasing percentage of certain clusters (e.g. the cluster percentages for clusters 4, 5, 27 and 43 all correlate at .63 -.69 with goals scored) and a regression analysis shows that the two dimensions explain 60.0% of the variance.
The Bottom Line
Clustering sequences can help organize the passing patterns of teams into useful categories. However, it will be interesting to try other analytical approaches, which I will look at in a future post.