Wednesday, September 2, 2009

How Well Does Goal Differential Explain Points?

While championships, continental-cup qualification, promotion, relegation and so on are based on league points, much of my analysis is based on goal differential. The Poisson model effectively uses goal differential as its inputs are goals for and goals conceded. Any player analysis, whether based on statistics or watching the game, will overwhelmingly be related to how much a player contributes to his team scoring or preventing goals. An important question is how related goal differential and league points are. More specifically, how much are points based on goal differential and how much are they a result of other factors like luck, tactics and the ability for some players to play particularly well in situations where it matters more?

How Related Are Goal Differential and League Points?

In short: very.

To study this question, I took data from the Spanish Primera Division, English Premier League and Italian Serie A. Because I wanted to only look at seasons in which the given league had 20 teams, the number of years I could use varies by league. For Spain, I looked at all seasons between 1987-1988 and 1994-1995 as well as those from 1997-1998 to last season. From England I include all seasons from 1996-1997 to 2008-2009. The Serie A had 18 teams for some time, so I only looked at 2004-2005 to last year. For the first 8 La Liga seasons, only 2 points were given for a win. To keep things the same, I converted the number of points for each team to what they would have gotten under the current system in which 3 points are awarded for a win and 1 for a draw. I considered removing these seasons from the sample, as well as the 2005-2006 Serie A season due to the match-fixing scandal, but they didn't appear to be different so I left them in.

Here is a scatter plot with the combined results from all three leagues. The horizontal axix represents the team's goal differential while the vertical axis gives the number of league points they attained.

As you can see, there is a clear relationship between points and goal differential. In fact, the correlation coefficient is an incredible 0.955. The line in the graph is the result of what is called ordinary least squares regression. OLS takes the data and comes up with an equation for the line that minimizes the sum of the square of how far each point is away from the line. In other words, it derives an equation for the line that best fits the data. In this case, the equation is Points = 51.863 + 0.654*Goal Differential. So a team that scores the same number of goals as they conceded would be estimated to finish the league with about 52 points. A team that scores 10 more goals than they allow would finish with around 58 points. On the other hand, a team that scores 20 fewer goals than they conceded would be expected to finish the season with about 39 points.

For anyone who has worked with data before, something particularly impressive about this OLS regression is the R-squared (hereafter R2). R2, also called the coefficient of determination, is a measure of how much of the variation in the response variable is due to the variation in the explanatory variable. Applying this to goal differential and points, it tells us how much differences in points for teams in a given season are due to differences in goal differential. In more simple terms, it says how well points are explained by goal difference alone. For the data given in the graph above, the R2 is .9117. This means that about 91% of how many points a team gets at the end of the season is due to their goal differential. Less than 9% of it is explained by anything else!

What is going on here?

The incredibly high correlation and good fit of the linear equation is an indication that the timing of goals doesn't have much impact when it comes to how many points a team gets in a season. This seems odd since the (marginal) value of a goal is quite related to the score. If a team is ahead by 2 or more goals in the dying moments then scoring or conceding a goal makes no difference - they'll get 3 points either way. If they are tied in that situation then a goal for them is worth 2 points and letting one in would cost them a point. If they are down a goal then scoring one is worth a point and conceding one doesn't negatively affect them. On the other hand, no matter what the score is, scoring a goal increases the team's goal differential by 1 and letting one in decreases it the same amount. Despite the value of a goal changing a lot depending on the situation, the number of points a team gets in a season is almost completely explained by a stat that assumes all goals are equally important.

It must be the case that, for the most part, things tend to even out over the course of a season. One game your team may win 3-0, but later in the season they'll lose a couple matches by 2 goals. There is apparently little variation across teams when it comes to how many superfluous goals they score or concede compared to important goals.

Implication for future work.

This result is very good news for analyzing the game. For teams it implies that tools like the Poisson model, which are based on either goal differential or goals scored as well as goals conceded, can be very good at determining how well a team gets points. For player analysis, it is helpful because it means that if we can somehow sort out a player's contribution to his team scoring and being scored on then it would pretty directly determine his value. Overall it will simplify things a lot.

What about the other 9%?

This I will discuss in a future article.

The question is essentially whether anything but luck determines how well a team does compared to their goal differential. Is there a skill that causes some players or managers to perform disproportionally better or worse in close games? More succinctly, does clutch exist in football?


  1. Ah right man how ya getting on?

    I've been reading a lot of your stuff lately, it's quite brilliant. I'm currently at university and have to do a stats project. I was thinking of doing a "wins per season" based scenario, with regressors like goals scored/ conceded, cards(red + yellow separately), possibly experience at that level, no years manager at club and my main variable, no players U23.

    Basically I'd like to evaluate Hansen's "You don;'t win anything with kids" speech!

    I'm not really that familiar with the Probit and logit models, would you have any idea how I might use them to model my data?

    Much appreciated, keep up the writing!

  2. Actually, never mind, just found out I have to do a regression with a binary dependent variable :(

    cheers tho

  3. Thanks for the comment.

    If you need a binary dependent variable then you could maybe do something like what you were suggesting but look at the knockout round of the World Cup or the Euro. One team has to advance so there are only two possibilities. Maybe take some of those stats from the matches and then use logit, probit and just OLS. Something like:
    - being the host
    - shots on target
    - time of possession
    - corners
    - fouls
    - bookings and sendings off

    You then get some coefficients which tell you how important each of these are in winning and you can compare the different types of regressions and what they imply.

    Just an idea. Thanks for reading.

  4. Hey Jared.

    Man, that's actually a brilliant idea. I'd been asking people all evening for ideas!
    That's good, ok I might do this then:

    Progress? = bo +b1host +b2shots on target + b3 possession + b4corners(if I use this I'll probably reference your work on corners at some point) + b5Cards + b6No wins in group stages + ui.

    cool this could actually be pretty sweet!

    Would it make much difference if possession was in minutes or %?

    When I gather all this data I presume I'll have to do it like this : Italy06L16, Italy06QF, Italy06SF, Spain02L16, Spain02QF etc. That's really the only way to do it, right?

    Thanks a million for your help, even so far!

  5. Shouldn't matter how possession is done, it'll just multiply the coefficient by 100/90 or 90/100 depending on the way you go.

    I don't understand your last question. Each side of each match will be its own row with the variables making up the columns. So if Spain played Italy then each would have their own row, whichever one advanced would get a 1 in the progress column and then the other columns would be the relevant country's stat for the game or from the group stage or whatever.

  6. Why the 100/90 or 90/100? I've just put in the values as 47 for a 47% possession figure.

  7. Right that's fine. My point is that the effect will be the same, but the coefficient will change. Say an increase in the possession percentage leads to a 1% increase in the probability of advancing. Then doing as you are the coefficient will be 1. If you instead used number of minutes instead then the coefficient would come out slightly different because 1% of the match is less than a minute. It would have to be 10/9 or about 1.11 instead of 1.

    My point is that even though the coefficient changes, the actual effect is the same it just has a different interpretation.

  8. ooohhhhh right ok, I get ya now. I guess I should take the log of possession then. Thanks.