# Archive: May 2016

In this post I want to look at shot selection.  The aim is to develop a quantitative description of the locations that players shoot from, and to use this description to classify players and compare them with each other.

##### An Identikit for Shooting

There are many ways to divide up a football pitch.  For example, one can simply draw a number of rectangles, and this works reasonably in many cases. But it ignores what happens on the pitch. It’s rather like dividing a face up into squares without caring that it actually consists of features like eyes, nose, mouth and so on. So another method would be to decompose the face into its component features, and use the features as building blocks. That’s what the police identikit system does.

What I attempt to do in this post is to develop a “shooting identikit”, in which each feature is an area of the pitch from which shots are taken. Then we could represent players by adding together the features in the appropriate proportions. The problem is of course that unlike the facial identikit, we don’t know what the features of the shooting identikit should be.  To solve this problem I use a technique called non-negative matrix factorization (NMF), which is a technique for finding and extracting the parts of objects. For example, when presented with images of faces NMF would learn that faces are made up of a nose, a mouth, two eyes etc. In our case we want to see if player shooting locations can be similarly represented by a manageable number of pitch areas.

##### The Data

I used shot location data from three EPL seasons. Players who had taken fewer than 50 shots were excluded, and so were penalties, free-kicks and shots taken from more than 50 m. out.  This left me with a dataset containing 18,865 shots from 174 players.

Figure 1 shows the locations of these shots, plotted using the OPTA co-ordinates and scaled up to metres using a standard pitch size of 68 x 103 m.

There isn’t much of a pattern, although if you squint a bit,you can see outlines of the six-yard box, the edge of the penalty area and the D, and maybe  the penalty spot.  I’m not sure whether these  patterns are real, or an artifact of the OPTA coordinate system.  (I suspect they are real, because when I plotted penalty kicks using the same method, the locations mapped within a radius of around half a metre.) But it doesn’t matter much, because for the main analysis I use a gridded area rather than actual locations.

##### The Analysis

There were several steps in the analysis.  I started by overlaying the playing area with a 2 m. square grid, and counted the shots in each square.  To illustrate, Figure 2 shows the gridded data for a representative player (Charles N’Zogbia); it consists of 63 shots with between 0 and 3 shots in each grid square.

The next step was to smooth the data, which produced a pattern like the one shown below.

I then ran the NMF procedure on the smoothed data to determine the optimum number of features (called basis vectors in NMF terminology) for modelling the data.  The result was clear; five features (the Fab Five) emerged. I call them Basis Zones in this post. In other words, player shot selection can be described – and players can be compared to each other – in terms of how often they shoot from each of five pitch areas. (Incidentally, I also got the same five basis zones when I used a 1 m. square grid instead of the 2 m. one, so they seem quite robust)

The next set of figures depicts each basis zone as a heat map.

##### Heat maps of the Fab Five Basis Zones [… Click to expand gallery]

Basis Zone 1 is the area directly in front of goal on the six-yard line. Basis Zone 2 is further back, straddling the 18-yard line.  Basis Zones 3 and 4 straddle the 18 yard line to the left and right of goal respectively, and Basis Zone 5 is just outside of the penalty area. This way of classifying shots on goal seems to make footballing sense.

The Fab Five zones vary somewhat in size; Zone 1 accounts for the most shots, and  Zones 3 and 4 account for somewhat fewer.

BASIS ZONE % OF SHOTS
1 24.5%
2 22.9%
3 17.3%
4 15.4%
5 20.0%

The specific types of shots and outcomes associated with each zone can also be calculated. The table shows some examples.

BASIS ZONE SHOT ATTRIBUTE 1 2 3 4 Scored 20.2% 9.4% 7.5% 5.5% 3.6% On-target 60.4% 63.9% 61.1% 60.3% 57.0% Blocked 17.9% 30.5% 28.6% 29.1% 31.3% Fast Break 3.4% 5.1% 5.2% 4.9% 2.9% Regular Play 75.1% 86.4% 86.1% 87.6% 91.5% From Corner 21.6% 8.5% 8.7% 7.4% 5.7% Swerve 3.5% 12.1% 14.3% 19.6% 21.9%

Shots from Zone 1 are the most likely to be converted and shots from Zone 5 least likely. A fairly consistent proportion of shots – around 60% – are on target from each zone. Zone 1 shots are less likely to be blocked, less likely to be produced from regular play, and more likely to occur from corners than shots from other zones. Shots from zones 4 and 5 are more likely to swerve.  These attributes seem plausible, and give us some confidence that the features we identified reflect the real world.

##### Player Preferences

Next we can look at player preferences for each of the Fab Five zones. The table below shows  the results for a few well-known players, which I’ve selected to highlight their differences. The numbers are weights reflecting the propensity of a player to shoot from each zone, and his preferred zones are shown in blue.

Player Shooting Preferences

PlayerBasis Zone 1Basis Zone 2Basis Zone 3Basis Zone 4Basis Zone 5
Robin van Persie0.420.210.240.140.00
Andy Carroll0.650.190.000.050.11
Antonio Valencia0.090.000.910.000.00
Luis Suarez0.230.250.290.230.00
Jermaine Defoe0.160.310.160.180.19
Michael Carrick0.090.140.230.020.52
Charles N'Zogbia0.030.130.420.430.00

The row weights sum to one, and so can be interpreted as percentages. For example, we see that 42% of van Persie’s shots are taken from Zone 1, 24% from Zone 2, 21% from Zone 3 and 14% from Zone 4.  Carroll on the other hand takes 84% of his shots from Zones 1 and 2, that is in a central position just beyond the six yard line or just inside the penalty area. Valencia is pretty much restricted to shots from the right; 91% of his shots come from Zone 3.  Suarez displays much more variety, with shots evenly distributed across Zones 1 to 4, while Defoe is even more versatile and uses all five basis zones to some degree, although he does show a preference for Zone 2. Carrick a midfielder, takes more than half his shots from Zone 5, which lies outside the penalty area.  These results seem to fit with our intuitive knowledge about these well-known players, but they provide a more precise assessment. Finally, N’Zogbia, the representative player discussed earlier, shoots mainly from Zones 3 and 4, splitting his shots equally between them.

To visualize the overall zonal profile for a player, we simply add the basis zones together, weighting them by the preference percentages from Table 1.  Figure 4 shows the zonal profile for our representative player, overlaid with his shot location data from Figure 2.  We can see how a player’s zonal profile reveals the structure underlying his shot selection, which otherwise can look like a fairly random pattern of dots.

The identikit analysis described here seems quite a useful one.

Obviously, the Fab Five framework uses a fairly broad brush to describe shot selection, and hides some of the detail. But the advantage is that we can see the wood for the trees.  We can describe shot selection, and picture the similarities and differences between players, in terms of a small number of dimensions. This could be useful in scouting and opposition analysis.  We could even model a hypothetical strike force, by combining the zonal profiles of selected players.