Sunday, September 13, 2009

The Poisson Model - a Deeper Look, Part 1

I've been using the Poisson model to make my World Cup qualification predictions. Let's take a closer look at it. I'm not sure how much a lot of you will be interested in this sort of article, it's a behind-the-scenes kind of thing. I will be on occasion making these kinds of posts for a couple reasons. For one, I think it's necessary to give justification for my predictions. If I were making picks based on my subjective opinion from watching matches I would give some reasoning. The same applies to using a model to make predictions, though in that case it would be justification for the model more than the individual picks. The second reason is that I'll be working to improve the models and writing these posts will help me with that.

Poisson Distribution Overview

The model is based on the Poisson distribution. Roughly speaking, the Poisson distribution describes how many occurrences of something you will have when they are infrequent but have a lot of chances to occur. This seems applicable to football because there are a lot of minutes (or seconds if you like) in a match and the probability of a goal any given minute is low.

The major assumption that the Poisson distribution makes, when applied to football, is that the probability of a goal being scored by a given team is the same every minute. In other words, it doesn't matter what the score is, when the last goal was scored or how much time is left. This assumption greatly simplifies things, but I can't imagine any fan of the game thinking it is correct. When a goal is scored things change for both teams. I suppose it's possible that things perfectly balance to cancel each other out - A attacks more because they are behind but B is more defensive as well so they have the same chance of scoring as when they were tied. That does not seem plausible though. I'll write more in the future on how the scoreline affects each team's chances of scoring. A preliminary look at a single-season of Premier League data suggests that there are significant differences in how frequently goals come for either team with different scorelines.

There are 4 inputs for the Poisson model: goals scored for each team, goals conceded by each team, a list of which teams played at home and away, and how many goals in total have been scored at home. Note that it does not use results from individual matches. Other than a list of fixtures, all inputs for the model are aggregate. This is possible precisely because of the above assumption. If the chance of a goal does not depend on the scoreline then you don't need to worry about how many goals were scored when the team was ahead or behind or how much time each team spent with the score a certain way.

How Well Does It Describe Football Scorelines

If you stopped reading there, you'd probably think that the Poisson model is useless because it is based on flawed assumptions. Before you jump to that conclusion, let's look at how well the model explains the game. In future articles I will discuss how well it predicts the future. That is a lot more difficult, in part because the sample sizes are smaller. Here I am discussing how well the model does with a full season of data. In other words, for every season in my sample I use the end-of-the-season information to run the model and then look at how well the model fits the data.

The data I used was the previous 9 seasons in the Spanish Primera Division. To test how well the model fits, I compared how often they actually happened to the frequency the model assigns to each result (home win, draw, away win) and each number of goals for the home and away team.

Match Outcomes

The first thing I looked at is how often the model indicated the match would result in a home win, draw and away win compared to how often those actually happened. Here is a graph of the predictions. I went with a line graph instead of the more typical bar graph so that you can more easily see how the differences vary by result. The first node is for the home win, the second for a draw and the third the away win. The vertical axis is the percentage of the time they actually occurred or the model claimed they would. The blue line is how often they actually happened, the red line represents the estimates of the model.

As you can see, the model under estimates the percentage of home wins and draws and over estimates the likelihood of an away win. Because the sample size is pretty large, these differences are statistically significant. The p-value for the Chi-squared goodness-of-fit test is less than .02. In other words, looking at the actual results compared to the estimates from the model, we can conclude that the predictions of the Poisson models are off. Despite that, the estimates are pretty good.


Let's now look at how often the model estimates each number of goals will occur in a match. The first panel is goals for the home team, the second the away team and the last all goals. Again, the blue lines represent the actual frequency and the red lines those predicted by the model.

Looking at the charts, the model overestimates the frequency of 0 goals and underestimates how often teams score 1 goal for both home and away teams. These differences are all statistically significant at the standard 5% level of significance. It also appears that the model underestimates how often the home side scores 2 goals and 3 goals, but more data would be needed to confirm that. Similarly the model may overestimate how often a team scores 4 or more goals for both home and away teams. The chi-square p-value is less than .01 so again we can conclude that the model does not fully fit the data.

You can see why the Poisson model overestimates how often the away team wins. The model is more off for the home team than away when giving the frequency of going goalless. It also seems to underestimate how often the home team scores two goals, which it does not do for the away side. In other words, the model is a bit off whether you look at the home team or the away team, but it is a bit harsher on the home side. I should say that these differences aren't statistically significant, but things appear that way and statistical significance would require an insanely large sample size.


Looking at the figures, the Poisson model does a pretty good job but is not perfect. Based on the difference between home and away goals, it appears that it is off in a systematic way - it underestimates scoring by a larger margin if the expected number of goals is higher. As a result, it will overrate teams that score few goals, concede a lot and/or are playing away and underrate those that score a lot, are stingy defensively and/or are playing at home. This is good and bad news. On one hand the model being wrong in a systematic way means that certain teams will consistently be rated incorrectly and their chances over or underestimated. On the other hand it means that adjustments can likely be made to improve the model.

The next article in this series I will tweak the model to hopefully deal with these issues. I will then compare the mid-season predictions made by the vanilla model and the alternative version(s) to see how much we can improve the Poisson model. I think and hope that improvements can be made.

No comments:

Post a Comment