Tuesday, September 22, 2009

A Look at Predictions

I've been making a lot of predictions using the Poisson model, now I'll look at how well its predictions are.

Determining how good the predictions are is difficult because the predictions are probabilistic (ie 40% chance of home win, 30% of an away win, 30% of draw) but the final result is deterministic (the home team won). I certainly welcome suggestions with other good ideas, here is how I think it best to analyze how good a given model is at making predictions. The first step is to get a large data set and make predictions using only data available at the time. After that, I put the predictions into bins. All matches in which the model predicted a home win less than 5% of the time are put together, all those where the prediction is that a home win will happen between 5 and 10% of the time are together and so on up to those between 95% and 100%. For every match in each of these bins, excel determines the average predicted home-win percentage and how often the home team actually won. For each bin I record how far off the prediction was and the absolute value of these numbers is added. Smaller the better.

If that explanation didn't make sense, I'll show it in graphical form here for the Poisson model. My data set is every season of the Spanish Primera Division from 1999-2000 through last season.I started the predictions early - after the third matchday. Because the predictions don't really work if one team has either not yet scored or conceded a goal, I did have to throw out some data from the early weeks.

The horizontal axis represents the predicted percentage of a home win, while the vertical axis represents the percentage of the time that the home teams for a given prediction probability actually won. For convenience, the red line has slope 1 and represents perfect predictions. Here is the graph for the Poisson model:



This goes in line with what I have written before: the model does a good job with the predictions when the teams are relatively even but overpredicts wins for underdogs and underpredicts them for heavy favorites.

An Improvement: the Poisson-Logit Model

In an effort to improve the predictions, I tried a new technique. The strategy was to take the expected goals for each team that the Poisson model spits out (for the home team their scoring factor times the away team's defending factor times the home-field factor) and then run what is called multinomial logit to get the result probabilities, instead of applying the Poisson model to estimate them.

The results of that were fantastic. Here is a graph of the results of the Poisson-Logit Model (hereafter PLM):



There are a couple nice features in this graph. Firstly, the PLM does not appear to be biased one way or the other. There is a small correlation between the predicted percentage and how far off the predictions are from reality, but that's only from the blip around 0.8 that is almost certainly an anomaly. Looking at the graph, it seems as though the errors are just as a result of noise but the predictions are quite good.

For comparison, I ran a simulation where the predictions were perfect and the sample sizes of each bin were the same as those for the graph above. For example, the data set contains 189 matches where the PLM predicts that the home side will win with probability between 0.7 and 0.75, with an average prediction of 0.726. For the graph below, I used the rand function in excel to simulate 189 matches where the home side has a 72.6% chance of winning.



From this graph you can see the effect of noise with this sample. In the graph above, the predictions were 100% perfect but because of variance and the sample sizes involved the blue line does not just run along the red line.

Comparing this to the graph above, I feel confident in the predictions of the new model. Adding up the squared differences between the predicted and actual percentages in each bin, the PLM actually did slightly better than perfect predictions! That's obviously just due to the sample sizes involved as it's impossible to beat perfect predictions, but it's a strong sign that the PLM is very good at making predictions.

From now until further notice I will use the PLM instead of the Poisson model. I am now working on gathering data for other leagues to look at how good the predictions are early on in the year as well as whether or not it's better to eliminate goals in blowouts.

No comments:

Post a Comment