Wednesday, September 30, 2009

Serie A review - matchday 6

Here were the results from the Serie A last weekend:

Catania 1 - 1 Roma
Chievo 1 - 1 Atalanta
Juventus 1 - 1 Bologna
Lazio 1 - 1 Palermo
Livorno 0 - 1 Fiorentina
Milan 0 - 0 Bari
Napoli 2 - 1 Siena
Parma 0 - 2 Cagliari
Sampdoria 1 - 0 Internazionale
Udinese 2 - 0 Genoa

Three matches warrant further comment. Sampdoria gave Inter their first loss of the season. It was one of those matches where the team that looks better ends up losing. Inter didn't dominate but, despite being without Motta and Sneijder, they were the better side on the day in every category except number of goals. They dominated possession (64-36%), had 6 more shots, 2 more shots on target and Sampdoria had 5 yellow cards to Inter's 1. Castellazzi, the Sampdoria goalkeeper made several big stops. Sampdoria's goal came on what I would describe as a defensive collapse after a bad turnover. The first mistake was Inter LB Davide Santon making an errant pass right to Sampdoria midfielder Angelo Palombo about 30 yards from goal. Palombo made a short run down the right side and cut it back to Mannini just inside the penalty area. Because of the scramble after the turnover, there was no defender within a couple yards of Mannini who then passed across to Pazzini on his first touch. Pazzini, despite being about 10 yards from the goal was similarly unmarked and made no mistake. After the goal Inter had a goal waved off on a legitimate offside call and had a tremendous free kick saved by the Sampdorian keeper.

In their matches, both Juventus and AC Milan could only manage draws at home against mid-table teams. For anybody that has followed the Serie A at all this season, AC Milan's result won't come as a huge shock. After being held scoreless, they have now scored just 3 goals in 6 matches and with just 8 points they find themselves closer to relegation than to the top of the table. There is a lot of football to be played, but it doesn't feel like it's too early to say that this will not be the season that they win their 18th scudetto. Juve allowed a stoppage-time equalizer in their 1-1 draw against Bologna. Bologna currently sit 16th in the table. Last year they finished in 17th, 3 points clear of relegation. This year it is likely that they will be in a relegation battle. It's only one match and Juve aren't anywhere near the kind of situation Milan are, but it's tough to win the league when you can't get 3 points against such teams at home.

Rankings:


GD rank: ranking by goal differential
S Rank: ranking by goals scored
D Rank: ranking by fewest goals conceded
EGD: expected goal differential if you started the season over and every team played all their matches at the level their results thus far have indicated

The table confirms what I said above: AC Milan have not been very good at all so far this season. The story in Naples is pretty similar. Those two teams are surely the biggest disappointments thus far in the campaign. While AC Milan weren't expected to be as good as their San Siro rivals, few predicted them as a mid-table team and they look worse than that if anything. This was supposed to be a breakout year at Napoli, but they are going to have to seriously improve on their results if they want to fight for a spot in Europe. At the top you have no real surprises other than perhaps the absence of those teams. Inter have been easily the best side.

Monday, September 28, 2009

English Premier League - September 28

The obvious place to start this week is with Chelsea who lost 3-1 at Wigan. Wigan went up 1-0 on a Titus Bramble header off a short corner in the 17th minute. That lead held for the first half. Two minutes into the second half, Malouda made a nice short run and through traffic found Drogba well inside the area. Drogba shot it on his first touch and it looked like the Wigan goalkeeper had it, he certainly should have, but it managed to squirm in. The match really turned on its head 4 minutes later when Petr Cech was sent off after he took down Rodallega who was running past him after doing the same to the Chelsea defense. Rodallega converted the penalty and Wigan never looked back. Chelsea actually went down to 9 men because they had used up their subs and Cole went off injured in the 86th minute. Wigan got the dagger in stoppage time when Scharner tapped in a cross from Figueroa. That and Burnley 1 - 0 Manchester United are surely the two surprising results of the short season. This was the first Premiership win by Wigan over the four big clubs.

The other results quite frankly were not surprising. Manchester United went top of the table with a 0-2 win in Stoke-On-Trent. That kind of win is big as City are looking like an actual Premiership side and will probably get some results against the big clubs at home. Berbatov and O'Shea were the scorers. Arsenal got a similar result with a 0-1 win against Fulham. Van Persie broke the deadlock early in the second half. Liverpool thrashed Hull City 6-1. Torres got a hat trick, Babel got two and Gerrard also made the scoresheet. Keane did one better than Torres, getting 4 goals in a 5-0 win for Spurs over Burnley.

In the match that delayed this report for a day, Manchester City beat West Ham 3-1 on Monday. Tevez put City up early. Petrov got through on the left and near the goal line just inside the penalty area played a great ball across to Tevez who was standing 5 yards in front of the goal. It went just in front of the keeper and would have been harder for Tevez to miss than make. The next few minutes it looked like it would be all over as Tevez had a couple great scoring chances, but neither shot went on goal. Despite getting thoroughly outplayed, West Ham managed to score a 24th minute goal off a set piece when City couldn't clear the ball out of the box. 7 minutes later Petrov put City back in front with a shot to the far post on a set piece. The goalkeeper may not have had a great view, but really should have done better as he got caught cheating to the near post. While there wasn't much in doubt, Tevez sealed it with a 61st minute goal on a header off a set piece. It was close, but in my view he was offside.

Other matches:
Birmingham 1 - 2 Bolton
Portsmouth 0 - 1 Everton
Blackburn 2 - 1 Aston Villa
Sunderland 5 - 2 Wolverhampton Wanderers

Power Rankings


GD Rank: offensive ranking
GA Rank: defensive ranking
EGD: expected goal difference if all teams started over and played a full season at the level of play of the results so far.

A couple teams I want to comment on. Firstly, I think the model underrates Everton. The reason for this is their schedule. They lost 6-1 to Arsenal, and since then have given up 4 goals in 5 matches. The model rates them as the worst defensive team because other than Arsenal they have played teams that are weak offensively. As a result of that, the model sees 10 goals conceded, a number around average, against what is easily the weakest set of opponents in terms of attacking strength that any team in the league has faced. This reveals a drawback of the model I touched on before - it considers all goals equally because it uses aggregate data. I'm still working on a better model on that end of things. Based on the results, I'd say Everton are a mid-table club. Chelsea is a bit surprising at 6th considering they are tied on points for first in the league standings. Their schedule has been the weakest of the top 6, so it makes sense.

If you are a fan of any club other than Manchester United, you can't like the rankings much as they look head and shoulders above the rest of the pack. They are the best attacking and defending team according to the model and have an expected goal difference that is 15 goals higher than second-best Arsenal. They'll definitely slip up and the expected goal differential will be somewhere more reasonable. It's also clear, however, that Arsenal and Liverpool will slow down and their incredible scoring rates will drop as well. I'm certainly not calling the league for Manchester United right now or anything close to that, but they have been the best team in the league over the seven matches.

Weekly Review: Spanish Primera División

Because there is a Premiership match on Monday, I am swapping Britain and Iberia and starting this week off with a review of the Spanish Liga.

The big name matches were on Saturday this week. Real Madrid took care of business with a 3-0 win over the islanders from Tenerife. The match was closer than the score would indicate. Tenerife were certainly in it in the first half and went in even at 0-0. Kaka and Guti came on at the half and that seems to have made a difference. The goals came from Benzema, who got the first two, and Kaka. Barcelona continued their roll with a 0-2 win in Malaga. The Catalans controlled the match from start to finish, with nearly 75% possession. The goals came from Ibrahimovic in the first half and Pique in the second.

Sevilla dominated Athletic de Bilbao 4-0. Renato scored in the fifth minute and they never looked back. Also scoring were Negredo, Kanoute and Jesus Navas. In a much more competitive match, Valencia and Atletico de Madrid finished 2-2. Aguero opened the scoring for Atletico in the 7th minute but goals in the 25th and 27th minutes by Pablo and Villa put the Che out front 2-1. It stayed that way until stoppage time in the second half when Maxi Rodriguez put away a cross from Antonio Lopez with his right foot.

Sundays results I'll just list:
Almeria 2 - 2 Racing
Osasuna 1 - 0 Sporting
Mallorca 3 - 0 Valladolid
Espanyol 0 - 0 Xerez
Zaragoza 3 - 0 Getafe
Deportivo 1 - 0 Villarreal

Best Result:
There were a lot of big wins this week, but I'll go a little unorthodox and say that Atletico de Madrid getting a draw in Mestalla was the best result of the week. Since qualifying for the Champions League the team has not played well at all. They got destroyed by Barcelona, only managed two draws at home against Racing and Almeria, and got a 0-0 draw at home against the weakest team in their Champions League group. To make matters worse, the club itself has been in total chaos due to issues surrounding the Gil family's ownership. Given the opponent, getting a draw in Valencia was big for Atletico, but things should still be pretty crazy for a while (ever?).

Worst Result:
Once again I will opt not to choose from the list teams that were blown out or even go with the other team from above since it could be argued that Valencia getting less than a win at home against an Atletico de Madrid in full crisis is a bad result. The worst result of the week, however, surely belongs to Espanyol. Xerez is the overwhelming favorite to finish worst in the league and will get very few points away from home this season. Managing just a draw is pretty close to giving away 2 points on whoever the rivals might be. Espanyol were expected to be a club that could challenge for a Europa League spot. That's going to be impossible to manage getting results like yesterday's. I didn't see the match, but all signs point to domination and poor finishing. According to Soccernet they were better than 60/40 on time of possession, got 10 corners compared to only 2 for Xerez and got 16 shots while the Andalusians only managed 4. Luck was obviously a factor, but not getting a win out of that effort is the worst result of the week.

Power Rankings

Here is the first edition of my power rankings. These rankings are based on the results and schedule. The methodology is the same as the Poisson model and the first part of the PLM. I get the scoring and defending factors and use them to determine how many goals for and against each team would have on average if you started a new season and they played at the level shown thus far. Because there have only been 5 matches played for each team these aren't all that reliable as an indicator of which team is best, but in my view they are a decent indicator of how well these teams have played so far.



GF rank: the team ranking in expected goals for
GA rank: the team ranking in expected goals against
GD rank: the team ranking in expected goal difference (goals for minus goals against)
EGD: expected goal difference if you started a new season and every team played at the level shown so far.

I think Athletic de Bilbao might be a bit low and certainly both Real Madrid and Barcelona's numbers will drop, though probably not their position in the rankings. Villarreal in the middle of the table is one thing that jumps out. They have performed poorly and currently sit second to last in the table but they have played the toughest schedule in the Liga thus far. Things should improve for them in the next two matchdays as they host Espanyol and then travel to Xerez after the international break. Atletico de Madrid also gets a schedule boost, they've now played both Barcelona and Valencia, but it's not as significant so they remain near the bottom.

At the top, Madridistas will obviously be happy to find their club ranked above Barcelona. In my opinion, the best indicator of all for them is that they rate second only behind Sevilla when it comes to keeping the other team from scoring. As I've argued, for Real Madrid defense will be the key if they are going to win the league. Barcelona are obviously on a roll as well. As for my Sevilla, the ratings are eerily similar to what we saw last year - very solid defense but not enough scoring. It's hard to knock the scoring of a team with 12 goals in 5 matches. The model does just that because it rates four of their opponents as weak. I think this will change in the next few weeks and their offensive ranking will move up. It'll be interesting to see if Depor and especially Mallorca can keep up this level.

Saturday, September 26, 2009

A Future Schedule

I think it would be best to organize things into a weekly schedule. This will apply to all weeks that have the usual structure where matches are played on the weekend with cup competitions possibly taking place in the middle of the week. It will shift around when World Cup qualifying gets going again.

I tend to write all articles late at night here on the West Coast of the United States. That makes the schedule easier to present because no matter where you are, except perhaps Hawai'i, the articles will be posted on the same day.

Monday the focus will be on England and Scotland. I will review the week in the English Premier League, the Scottish Premier League and keep you updated on the Football League Championship, which is the second flight in England.

Tuesday I will move to the Iberian Peninsula and cover the Spanish Primera División and often the Portuguese Liga and Spanish Segunda División.

Wednesday I will cover the Italian Serie A as well as the Serie B on occasion. The plan for now is to also cover another league or two on some kind of rotating basis. These will include the MLS, the Primera División de México, the Primera División de Argentina, the Dutch Eredivisie and the Brazilian Série A. If you want to see one of these leagues covered regularly then let me know by leaving a comment. Same goes if you want me to cover a league I haven't mentioned here.

Thursday will be the time to discuss the German Bundesliga and the French Ligue 1. I will also cover the second division for both of those at times.

Friday I will post my weekend preview, focusing on what I feel is the match of the week.

International Competitions

World Cup qualification resumes October 10th. They also play qualifying matches October 14th. After next week I will again shift my focus to national teams and discuss qualification from each region. I will also provide an update for each confederation in between those two matchdays and discuss the playoff round after.

Champions League and Europa League

I am slowly but surely adding UEFA countries to my spreadsheets (which at some point will hopefully be converted to databases) so that I can properly analyze the continental competitions. I won't get there by this week for sure, but can probably get there by the third matchday. Weeks where there are matches I will be posting a Champions League preview, similar to my weekend preview, each Monday and a review on Thursday. Time permitting I will also make a Europa League post on Tuesday.

The articles discussing each league will feature my power rankings, which I will unveil next week starting with the Premier League. Using the same technique as the Poisson model, the power rankings take into account schedule and results to determine a ranking of teams. Once I have finished adding UEFA countries, I will also give weekly rankings of all European clubs.

Friday, September 25, 2009

Classico Injury Update and a Correction

There are unfortunately more injuries to report.

For Porto, not only is Varela going to be out as I mentioned above, but Christian Rodriguez is also not able to play. He is a forward who has made three appearances as a sub thus far in the league and he started against Chelsea. He most certainly would have replaced Varela had he been able.

For Sporting, Marco Caneira was apparently hurt today in training. He had started one Liga match and come on as a sub in two others as well as starting a Europa League match. It's also being reported that they will be without defenders Pedro Silva and Andre Marquez.

It seems that Porto have injury problems up top while Sporting do at the back. Tough to say how that will affect things.

I also need to issue a correction. I said both teams would likely get out of the group stage of the Champions League. This is, of course, incorrect. Sporting were eliminated by Fiorentina on away goals in the playoff. They will be playing in the Europa League - formerly known as the UEFA Cup. They are surely huge favorites to get out of that group stage.

Thursday, September 24, 2009

Weekend Preview: the Clássico (Porto - Sporting Lisbon)

The biggest match of this weekend for the neutral fan is in the Portuguese Liga: F.C. Porto - Sporting Lisbon. The match kicks off at 7:30 PM local time, 8:30 for the central Europeans, 2:30 Eastern in the United States. Unfortunately for Americans, the game will not be broadcast here, at least through traditional means.

A look at the last four years shows why this should be a great match. Each of those seasons, Porto won the league and Sporting finished second. Two of those seasons Porto finished well clear but in two others the point gaps were just 4 points (last season) and a single point (2006-2007).

Summer Transfers

This summer was very big, particularly for Porto. Unfortunately if you are a Porto fan it was overwhelmingly going out; they lost some big names. The biggest transfer was Argentine winger Lisandro López who went to Lyon for 24 million euros. He scored 16 goals for them in 35 matches in all competitions. Perhaps more important was the loss of Lucho González. The influential captain who usually played in right midfield was sold to Marseille for 18 million euros. The other big name was Aly Cissokho who also went to Lyon. He got them 15 million euros. The left back only played 19 matches for Porto last season. To replace them and the other players that left they went the usual route and brought in lower-profile players, including several from South America. The only name I recognized in the list was Uruguayan left wing back Álvaro Pereira, though apparently Falcao, Colombian forward from River Plate, had garnered the interest of other European clubs. Since he's already scored 4 goals this young season that's not looking like a bad deal at under 4 million euros.

Down in Lisbon they were far less busy. The only players that left were lost because their contracts were up. Two of those played a significant amount: right midfielder Fábio Rochemback and forward Derlei. Both Brazilians played 18 matches for them last season, Derlei put in 7 goals. To help their attack they brought in attacking midfielder Matías Fernández who was at Villarreal.

Location, Location, Location

As you can see from the listing, the match is going to be in Porto. Last year Porto actually played better on the road. They were 9-5-1 at home and 12-2-1 away. They also scored 15 more goals, exactly one per match, away from home! While I'm sure they are a better team at home, that shows you how weird things can look when you have small samples at play. Sporting were a very good away team last year, second only behind Porto in that category. In comparison to their home matches, they appear to have been more defensive, they scored 13 fewer goals and allowed 6 fewer. Not much to read into it for either team. Porto should have an edge, but if anything smaller than what one might expect for a home side.

Form, Injuries and Suspensions

Despite the teams having the same records 5 matches into the season, they seem to be in different places as far as form goes. After drawing in their first match Porto won three in a row but lost last Saturday in Braga. They also lost 1-0 at Stamford Bridge the midweek prior. Sporting started out with a draw and then lost at home against the same Braga team that would later beat Porto. Since they have won their other three matches. Last Monday was particularly confidence boosting as they won after being down 0-2 to Olhanense. While it may be a bit of a reach, I'd give the slight edge to Sporting here. It may be a little weird given that we're talking about one loss, two if you count the one to Chelsea, but it's also weird that players would be asked if the team is in crisis because of it, as has happened here.

Both teams are relatively injury free. Porto will be without the services of Varela. The left winger, who arrived at the club this last summer, had played in all five matches this season. Sporting will not have left forward Postiga, who strained his thigh last Monday. They will also not have Caicedo, who sprained his knee when he came on as a substitute for Postiga.

Predictions

As last week showed, the model probably isn't all that reliable at this stage since not a lot of matches have been played - just 40 for the entire league thus far and 5 for each club. So take this with a grain of salt.

The predictions of the PLM are drastically different than what we saw last week. While it had the two Manchester clubs rated as having excellent defense and good attacking power, it rates both these teams as very strong at scoring and mediocre at preventing goals. Porto rank second best at scoring and Sporting just behind them. In case you're curious, Benfica is the team rated as having the best attack. Defensively Porto rank 8th and Sporting 10th. As a result, the model predicts a fun, high-scoring match. It also suggests that Porto should win. It gives them a 78% chance, Sporting only a 9% chance with just a 13% chance of a draw. The most probable scoreline according to the prediction is 2-1 to Porto.

Personally, I have no prediction. To be honest, I have not seen either team this season. I'm hoping I can, ahem, find it somewhere either during or after because it should be good and I'll be interested to see how these teams look given that they may well both get out of the Champions League group stage. While Benfica and Braga have gotten good results, I suspect these two will take the top places again in some order. If you pinned me down and asked for a prediction, I'd just go with the PLM one and say 2-1 Porto.

Tuesday, September 22, 2009

A Look at Predictions

I've been making a lot of predictions using the Poisson model, now I'll look at how well its predictions are.

Determining how good the predictions are is difficult because the predictions are probabilistic (ie 40% chance of home win, 30% of an away win, 30% of draw) but the final result is deterministic (the home team won). I certainly welcome suggestions with other good ideas, here is how I think it best to analyze how good a given model is at making predictions. The first step is to get a large data set and make predictions using only data available at the time. After that, I put the predictions into bins. All matches in which the model predicted a home win less than 5% of the time are put together, all those where the prediction is that a home win will happen between 5 and 10% of the time are together and so on up to those between 95% and 100%. For every match in each of these bins, excel determines the average predicted home-win percentage and how often the home team actually won. For each bin I record how far off the prediction was and the absolute value of these numbers is added. Smaller the better.

If that explanation didn't make sense, I'll show it in graphical form here for the Poisson model. My data set is every season of the Spanish Primera Division from 1999-2000 through last season.I started the predictions early - after the third matchday. Because the predictions don't really work if one team has either not yet scored or conceded a goal, I did have to throw out some data from the early weeks.

The horizontal axis represents the predicted percentage of a home win, while the vertical axis represents the percentage of the time that the home teams for a given prediction probability actually won. For convenience, the red line has slope 1 and represents perfect predictions. Here is the graph for the Poisson model:



This goes in line with what I have written before: the model does a good job with the predictions when the teams are relatively even but overpredicts wins for underdogs and underpredicts them for heavy favorites.

An Improvement: the Poisson-Logit Model

In an effort to improve the predictions, I tried a new technique. The strategy was to take the expected goals for each team that the Poisson model spits out (for the home team their scoring factor times the away team's defending factor times the home-field factor) and then run what is called multinomial logit to get the result probabilities, instead of applying the Poisson model to estimate them.

The results of that were fantastic. Here is a graph of the results of the Poisson-Logit Model (hereafter PLM):



There are a couple nice features in this graph. Firstly, the PLM does not appear to be biased one way or the other. There is a small correlation between the predicted percentage and how far off the predictions are from reality, but that's only from the blip around 0.8 that is almost certainly an anomaly. Looking at the graph, it seems as though the errors are just as a result of noise but the predictions are quite good.

For comparison, I ran a simulation where the predictions were perfect and the sample sizes of each bin were the same as those for the graph above. For example, the data set contains 189 matches where the PLM predicts that the home side will win with probability between 0.7 and 0.75, with an average prediction of 0.726. For the graph below, I used the rand function in excel to simulate 189 matches where the home side has a 72.6% chance of winning.



From this graph you can see the effect of noise with this sample. In the graph above, the predictions were 100% perfect but because of variance and the sample sizes involved the blue line does not just run along the red line.

Comparing this to the graph above, I feel confident in the predictions of the new model. Adding up the squared differences between the predicted and actual percentages in each bin, the PLM actually did slightly better than perfect predictions! That's obviously just due to the sample sizes involved as it's impossible to beat perfect predictions, but it's a strong sign that the PLM is very good at making predictions.

From now until further notice I will use the PLM instead of the Poisson model. I am now working on gathering data for other leagues to look at how good the predictions are early on in the year as well as whether or not it's better to eliminate goals in blowouts.

Monday, September 21, 2009

Manchester Derby Post-Match Summary and Thoughts

Having just finished rewatching the match, here is a summary of the derby and some thoughts.

For those that didn't see it, the match started off extremely shaky for City. Bridge had a horribly headed clearance in the first minute. The next minute was even worse as Evra ran into the box completely unmarked on a throw in. He took the ball down to the goal line and played it back in to Rooney at the corner of the six-yard box. There were a couple defenders near him, but he had space to play the ball between their desperate sliding challenges and bury it. For the next 12 or so minutes, it went back and forth with United looking the better of the two sides. In the 16th minute, United keeper Foster tried to play a ball to the outside of the box back in to pick it up when Carlos Tevez stole the ball off him. Tevez played the ball back to Barry who put it between the near post and Vidic who was where you would normally see the goalkeeper. For the rest of the first half things were pretty even and each team failed to convert a clear chance with Berbatov heading over the bar right in front of goal and Tevez hitting the post from inside the box.

The second half started similarly to the first. United scored three minutes in on a header by Fletcher on a floating cross from Giggs. A few minutes later after Given made a nice save on a great scoring chance by Giggs, City drew level on a fantastic shot by Bellamy from just outside the area. The next 28 minutes United completely dominated getting several clear chances. Given made three big reaction saves stopping two Berbatov headers and a very powerful Giggs shot from the edge of the area. United finally broke through in the 80th minute on another header by Darren Fletcher putting in a cross from Giggs - this time off a set piece. City didn't look like equalizing until the 89th minute when Rio Ferdinand played an absolutely terrible ball right to Ireland (I think, couldn't quite tell) who played it through to Bellamy who sprinted past Ferdinand and slotted it home behind Foster. Deep into stoppage time (some would argue too deep, see below) after a free kick was cleared, Giggs spotted substitute Michael Owen open in the box. Owen put it away and sealed the 4-3 win for United.

Initial Impressions

As I mainly follow the Spanish league, I don't usually watch a lot of Premier League games. I suspect that will change due to writing this blog. When I do, I'm always impressed with the atmosphere and this match was no exception. In Spain the singing and chanting is mostly done by the home-side's ultras, whereas it seems to be everyone in England. Another common thought I have when watching English matches is how much more direct the style of play is compared to Spain. As for the match itself, I thought both teams made a lot more mistakes than I had expected. All four of United's goals featured awful defending by City. Evra was allowed to just run into the area unmarked on a throw in, Fletcher twice wasn't properly marked nor was a defender in good position to clear the cross and Owen was left alone in goal with two defenders 5 yards away from him ball watching. City's first goal came as a result of a goalkeeper mistake that should not happen even at the youth level. Foster tried to dribble the ball into the area to pick it up when he clearly should have just banged it out. When Tevez is on the other team you can't mess around back there because the guy doesn't stop chasing the ball. The second goal for City was the only one of the match that didn't result from a bad play by the other team. City's third goal came as a result of Ferdinand giving away the ball and then failing to chase down Bellamy. As I said in preview, I think these teams are both good defensively. United last year were in my view the best in the world defensively. It was pretty shocking to see that kind of display.

On the positive side, it's clear that both sides are very good. I have a lot of respect for Rooney and Tevez. I can't think of two harder-working attacking players. I think City will be a lot better once Adebayor comes back and they can play Tevez in a deeper role. Bellamy obviously had a great game as well. For United, I thought Evra was particularly strong down the left. He also pressured Tevez late in the half and I'm sure played a role in his shot going into the bar instead of the goal. To be honest I didn't think Giggs was all that good in the match but he played a huge part in the last three United goals with two crosses that were headed in and then picking out Owen for the last goal. If I had to choose a man of the match it might be Shay Given. It's strange to give it to a goalkeeper when his side gives up 4 goals but I don't think you can say any were his fault and he kept his team in the match during the second half which United otherwise dominated. Obviously Fletcher is the easy choice with two goals in his team's win.

As far as the referee goes, I think he did a good job overall. Other than the timing controversy at the end, he was under the radar, which is what you're looking for. Something that needs mentioning is that the assistant referee should have signaled that Berbatov was offside on a great scoring chance in the first half. Berbatov put it over but was in great position and would have scored if he'd put it almost anywhere on target. The replay showed that he was well offside. Since it was a set piece, it should have been an easy call. It's the sort of thing that isn't considered a big deal because the ball didn't go in, but not getting that call right is terrible because a goal is scored in that situation frequently. We shouldn't give him a pass because Berbatov missed the header.

That brings us to the controversy. In the 89th or 90th minute the fourth referee signaled four minutes. Michael Owen's goal came a full five and a half minutes into stoppage time. Here's a rundown of what happened in the last stages:
90:00 - Bellamy scores for Man City
92:00 - Carrick is subbed on for Anderson
94:45 - a United player is fouled about 7 yards on the United attacking side of the center line
95:13 - Rooney takes a free kick
95:15 - it is cleared by Stephen Ireland between the penalty area and the 10-yard arc for penalty kicks
95:17 - it is headed further by a City player which I think is Tevez. Given the grass patterns he appears to be 36 yards out.
95:19 - Rooney sends it back into the area
95:22 - the ball is headed out by a City defender just outside the area, it goes to Giggs roughly 33 yards out
95:23 - Giggs plays the ball to Owen who is open in the area
95:27 - Owen scores

The amount of stoppage time to be added is at the sole discretion of the referee. In my view he should have added an extra minute to stoppage time as a result of the goal and substitution. I also think he should have extended the match to allow United to take the free kick as Rooney did at 95:13. I think that is pretty standard and most referees would have done the same. I do think he should have ended the match once that free kick was cleared, and I think it would be called there most of the time by other referees. If not there then the second clearance at 95:22 would have been the standard place to whistle it dead. Obviously everything would be different with a different referee, but I think well over half the time a random ref would have ended the match before the last goal.

Having said that, I don't think it was completely outrageous. As I said, it's up to the referees discretion. There are differences when it comes to how much time individual refs add. Also important is that it could have been the case that he had wanted to add something closer to four and a half minutes than four originally. There's nothing in the rules saying that an integer number of minutes must be added, but they do only display whole numbers on the board. If that's the case, then you can throw out everything I said above because the extra goal and substitution would push stoppage time up to 95:30 instead of 95:00. I am a traditionalist and would like for the timing of the match to remain as it is now, but controversies like this are the downside of having this timing system instead of something like what is used in basketball.

Sunday, September 20, 2009

Why Did the Poisson Model Get the Manchester Derby So Wrong?

I'll warn you in advance that this article is very much a thought-process piece. I work through why the Poisson model made the predictions it did and what potential issues there could be. If you'd rather not see how the sausage is made then I suggest you pass on this one and wait for my next article, which should be out tomorrow (tonight if you're in Europe). In that article I will actually discuss the match itself and give my subjective thoughts.

The Poisson predictions for the Manchester derby seem to have been pretty far off - the model predicted a low-scoring affair with 0-0 the most likely result. It's possible that it was a fluke and if these teams played a million times they would mostly be low scoring. It's also early in the season the predictions of the model with just four or five matches played probably aren't all that reliable. The small sample does help though as it makes it easier to work through how the model works.

Where did the low-scoring prediction come from?

The model takes into account home-field advantage and higher order relationships with the schedule - how good your opponents' opponents are and so on, but I will simplify things by looking only at how well these teams did against their opponents compared to other teams.

Before this match began (when the prediction came about), Manchester United had played 5 games. Their opponents were Arsenal, Birmingham, Burnley, Tottenham and Wigan. Against those opponents they averaged a respectable 2.2 goals per match and conceded just 0.6. Man City had played against Arsenal, Blackburn, Portsmouth and Wolverhampton. They averaged 2 goals for and 0.5 goals against.

Let's look at how many goals these teams scored and allowed on average before this weekend against their other opponents:

Manchester United (2.2 for 0.6 against) versus:
Arsenal - 4 goals for, 2 against per match
Birmingham - 0.5 for 0.75 against
Burnley - 0.25 for 2.5 against
Tottenham - 2.75 for 1 against
Wigan - 1 for 1.75 against
Average - 1.7 for 1.6 against

Manchester City (2 for 0.5 against) versus:
Arsenal - 3.67 for 1.33 against
Blackburn - 1.33 for 1 against
Portsmouth - 0.75 for 2.25 against
Wolverhampton - 0.75 for 1.5 against
Average - 1.625 for 1.52 against

Note that Arsenal's numbers are different for the two because for United City's score was included and for City United's match was included. Note that the average team right now has about 1.42 goals scored or conceded.

The point here is that the Manchester clubs played opponents that, taken collectively, scored and conceded at a rate above the league average. This made their well below average goals conceded look very impressive. It also made their strong scoring numbers look a bit less impressive. As a result, the model concluded that two elite defensive teams that are good at scoring were playing a match. Two teams that are world-class defensively and good but not great at scoring when compared to the best clubs in the world is a common situation in the knockout stage of the Champions League. The outcome tends to be what the model suggested of this match - both legs of the tie tend to be low-scoring and often one of them is 0-0.

So that explains why the model (naturally I'll be saying "the model" made the predictions when they went like this and that "I" made the predictions when they are spot on) thought this would be a low-scoring match: it was two teams that are incredibly good at stopping the other team from scoring and merely good at scoring themselves. That raises another important question: should we really conclude that those adjectives describe these teams based on the results? That's a tough call. On one hand, these teams did allow roughly a third of the goals that their opponents' opponents did while scoring only 50% more. That certainly points to great defense and good offense.

On the other hand, Arsenal dominates much of the opponents' average goals scored and they each played a very weak team that is responsible for a large percentage of the goals their opponents have allowed. Of Manchester United's 5 matches, 2 were against teams that, in their small sample of other matches, had not been very good at scoring having scored just half a goal and 3/4 of a goal per match. They played two good attacking teams, Arsenal and Spurs, and a team a little below average scoring. City faced Arsenal, a team about average in scoring and two teams that averaged only three-quarters of a goal per match. Looking at those numbers, it seems that each team had a couple soft spots on their schedule where a decent team would be capable of keeping the sheet clean a lot of the time. For both clubs their goal-conceding rates were impressive against those opponents, but not necessarily amazing. This is a place where sample size kicks in because any team can go on a four or five match run where it seems impossible to score on them. That's not my point though. I'm saying that if you look at the result of individual matches instead of aggregating it all, their defending records don't look quite as impressive.

The story at the other end is similar. Both teams played some fairly stingy opponents, but a couple teams that are bad defensively push the aggregate numbers up. It's not as bad on that end though.

Suggestions for a Better Model

What I've described is a simplification of what the model actually does. It doesn't just average the goals allowed by the opponents. It does something pretty close though. Each team's scoring factor is their total goals scored divided by the sum of their opponents defensive factors with the defensive factors. For that sum, those opponents they played at home have their factor multiplied by the home-field factor. As a result, a team playing against a good opponent and a weak one will appear to have a similarly difficult schedule as a team that plays two mediocre opponents.

This makes worse the previously discussed simplification that all goals are equally important according to the model. While I have argued that over the course of a season goal differential is a very strong indicator of points, it seems bad in the short run to rate two teams equal when one wins 4-0 against a bad team and loses 0-2 versus a good team and the other gets two 2-1 wins over average teams. I think this is a situation where things will even out as more matches are input, but at least with these small samples it can be a big problem and rate teams that have unbalanced schedules pretty badly.

I am still working out the kinks of a model that I think could be an improvement. The idea is to look at how well each team does in each match compared to how many goals on average their opponent scored and allowed. To make predictions, these are then combined with the average goals scored and allowed by the opponent. I'll work that all out and compare its predictions to that of the Poisson model.

Friday, September 18, 2009

Team and League Roll Call

I will be doing power rankings and giving previews and predictions starting soon. I will definitely cover the Spanish Primera Division, the English Premier League, the Italian Serie A and the German Bundesliga every week. Within the next few days I will also do a writeup for the MLS here in the US.

I would like to know from you readers what teams you support and leagues you follow. For leagues I'm mainly interested in those I didn't list, but for clubs it would be helpful to know even if they are a big club and/or in one of those leagues.

Please add a comment stating which club(s) you support and any leagues you follow that are not one of the five listed above.

Weekend Preview: Manchester Derby

The biggest match of the weekend in my view is Manchester United versus Manchester City at Old Trafford on Sunday. The match kicks off at 1:30 in the afternoon locally, that's 14:30 for most of continental Europe, 8:30 in the morning for those on the East coast of the United States and bright (or still dark perhaps) and early at 5:30 in the morning for those of us on the left coast. Unfortunately if you're in the United States it's only available on Setanta so it may be a good idea to hit up a pub for breakfast.

Because the season is so young, much of my preview of the match is based on last year.

Home and Away

Old Trafford was good to United last year. They were the best club in the Premiership at home and "only" third best on the road. While they actually conceded two more goals at home compared to the road on the whole season, they scored 18 more, nearly one more per match, in front of the home crowd. City last year were the club with the biggest difference between home and away results. They were the third best team at home and 17th best on the road. If there was something to that other than just variance and it carries over to this year, it points to Manchester United starting out with an edge based solely on the location of the match.

Recent Form

At this point form is just all results so far.
United: WLWWW
City: WWWW

Both teams are doing pretty well. The only blemish was a surprising 1-0 defeat by United at newly-promoted Burnley. Cup competitions gone similarly - City won 0-2 at Crystal Palace in the league cup while United won 0-1 against Besiktas in the Champions League. Nothing surprising here.

Something important for this match is that Adebayor will not play. He was given a three match ban for stomping on van Persie and his provocative goal celebration where he ran the length of the pitch kneel in celebration in front of the Arsenal supporters.

Scoring and Conceding

Last season Manchester United were tied with Chelsea as best defensive team and second best scoring team, 9 goals behind Liverpool. Manchester City were in the middle of the pack defensively and scored the fifth most goals. City scored 10 fewer goals for the season, 0.26 per game, and allowed 26 more goals, 0.68 per match. In short, Manchester United were incredibly good defensively, the best in Europe in my opinion, and very good at scoring while City were good offensively but mediocre defensively.

Last summer there were a lot of changes at both clubs. Cristiano Ronaldo moved from Manchester United to Real Madrid. It's very tough to replace a player of his quality and they didn't do so by bringing in another big name. Sir Alex Ferguson added
Michael Owen for free, paid out 18 million euros for winger Antonio Valencia and picked up promising 20-year old winger Gabriel Obertan for 3.5 million. City were a lot more busy. They brought in three defenders, Sylvinho, Lescott and Toure, midfielder Barry and three attacking players in Adebayor, Tevez and Roque Santa Cruz. As far as Man City goes, I think the situation is similar to what I wrote about Real Madrid. Bringing in Tevez and Adebayor was huge and got all the attention, but Toure, Lescott and Barry will be the signings that make a difference if City are to compete for the league and finish in a Champions League spot.

Looking at the short season thus far, the two teams are the best defensive teams in the league. The poisson model puts United best and City second best. In scoring, I have Manchester United fifth best and City 8th. Take those figures with a grain of salt as the sample sizes are ridiculously small and an extra goal here or there would make a huge difference at this point. Based on last year's results, this year's results so far, and the summer transfers I'd say that both teams should be near the top both offensively and defensively.

Predictions

As I said, I'm not sure this prediction is reliable given how few matches have been played. The Poisson model predicts a low-scoring affair with 0-0 the most likely scoreline. It gives United a 39% chance to win, City a 21% shot and predicts a draw 40% of the time.

I'll predict 1-0 to Manchester United.

Thursday, September 17, 2009

CONMEBOL Qualification: Post Matchday 16

Sorry for the delay. I got caught up in some of the other stuff. In the future I'll try to get things up within the next day or two.

I watched most of Paraguay - Argentina. The 1-0 scoreline does no justice to how much better Paraguay looked. My view before was that Argentina were struggling a bit but in no real trouble to qualify. Now they sit in the playoff spot and frankly do not look like able to pull themselves out of it. Costa Rica have been struggling as well, but while Argentina would clearly be favorites I don't think you can consider a playoff close to a lock. As a player I love Diego and think he was the best to ever play the game. I'm far from sold on his skills as a manager, though. It's easy to read too much from short-term results, many including me thought this about Mexico for example, but it seems to be more than just variance.

Results:

Brazil 4 - 2 Chile
Venezuela 3 - 1 Peru
Paraguay 1 - 0 Argentina
Uruguay 3 - 1 Colombia
Bolivia 1 - 3 Ecuador

Ignoring the Argentina match, the teams that needed strong results to stay in contention got them. Ecuador's win put them into fourth place. With the solid home win, Uruguay have turned it into a 3-horse race for two spots: one automatic, one a playoff, most likely with Costa Rica.

Here is the table:



With two matches to go, here are the schedules:

Chile: Colombia away, Ecuador home
Ecuador: Uruguay home, Chile away
Argentina: Peru home, Uruguay away
Uruguay: Ecuador away, Argentina home

Ecuador and Uruguay clearly have the harder schedules out of the three. Because Uruguay plays both, all three teams control their own destiny in the sense that they will at least make the top 5 if they win both of their matches.

Poisson Predictions

Brazil and Paraguay are in. Chile are also effectively in. They made the top 4 in 9,975 out of 10,000 simulations. The Chilenos are guaranteed at least a playoff spot and any draw by them or draw or loss for Argentina sees them through in the top 4. On the flip side, Bolivia and Peru are eliminated and Colombia and Venezuela very close to it. Both have a less than 1% chance at a top 4 finish and between 2 and 3% for fifth place.

I'll focus on the middle three. Here are the percentages for each of qualifying automatically with that 4th spot:

Uruguay - 35.2%
Ecuador - 34.7%
Argentina - 29.1%

Here are the chances of finishing 5th:

Argentina - 54.5%
Ecuador - 24.6%
Uruguay - 15.3%

Finally, here are the overall qualification percentages, assuming the playoff is a coin flip. As I hinted at above and mentioned in a previous article, I think any of these teams would be a favorite against Costa Rica. I think for Argentina you can add as much as 10% because they are so likely to make the playoff.

Argentina - 56.4%
Ecuador - 47.0%
Uruguay - 42.9%

Wednesday, September 16, 2009

Does Defense Wins Champions Leagues?

An old saying, at least for American sports, is that "offense wins games, but defense wins championships". In soccer the claim is often made that in cup competitions, particularly those like the Champions League that use the away-goals rule, strong defense is more important than the ability to score. The argument, which I've used myself, is that it's tough to advance if you can't win the home leg 1-0. There is an obvious flaw in this - advancing in the knockout stage is a zero-sum game. If you are very good at scoring then it's going to be tough for the other team to beat you 1-0 when they are at home. If keeping the other team from scoring an away goal is vital then a team that is very rarely held scoreless should have a similar advantage as one that is rarely scored on. Like a lot of cliches and common wisdom, this can be tested with data.

My work on this is preliminary. There is much more that I can and will do on this in the future, but I thought I'd share my findings thus far.

To test for this, I created a points system similar to the one used by FIFA for World Cup seeding. I awarded 32 points to the champion, 31 for the runner-up, 29.5 to the semifinal losers, 26.5 for those who busted out in the quarterfinals, 20.5 to the teams that were eliminated from the first knockout round, 12.5 to those finishing third and 4.5 for those finishing last in their group. I did not look at teams that were eliminated before the group stage. This scoring system may not be perfect, it's just meant to give a numerical representation of how well a team did in the Champions League for the season. Starting with the 2003-2004 Champions League, the first that featured 32 teams and 16 teams in the knockout round, I recorded Champions League points according to my system and points, goals for, goals against and goal differential per match for each team in their domestic league that season. I recorded this info for the 5 biggest leagues - English Premier League, Spanish Primera Division, Italian Serie A, German Bundesliga and French Ligue 1.

To test whether strong offense or defense is more important, I used the correlation coefficient as I did in the previous article on the role of luck in close matches. Here are the correlations:

Correlation coefficient between Champions League points (as defined above) and:
League points per match: 0.512
Goal difference per match: 0.567
Goals for per match: 0.390
Goals against per match: -0.501

All correlations are significantly different from zero and go in the expected direction. Teams that performed better in their league a given year, whether defined by league points or goal difference, tended to perform better in the Champions League than teams from these leagues that did worse domestically. Teams that scored more goals domestically tended to do better in the Champions League and those that conceded fewer also did better.

By using the results from a different competition, I'm essentially using domestic results as a proxy for skill. Teams that concede a lot of goals domestically will tend to be worse than those who allow fewer. It's not perfect, but it's a good way to represent offensive and defensive skill with a number.

Unfortunately, because the data set isn't large enough I can't say that correlation between Champions League results and goals against is larger than that for goals for. The 95% confidence intervals for correlations not only overlap, but actually contain the other value. Having said that, it appears to be larger. I would say based on these results that it is likely that defense is in fact more important than offense when it comes to getting results in the Champoins League, but further study is needed.

Correlation and Causation

A quick word about correlation. A common mistake people make in looking at data is believing that correlation implies causation. Two sets of numbers A and B can be correlated if A causes B, B causes A or some other thing causes both A and B. As an example, team goals scored and conceded tend to negatively correlated - teams that score a lot of goals tend to give up few. This is because overall difference in quality cause good teams to both score more goals and concede fewer goals than their weak opponents. It would be incorrect to say that they are negatively correlated, therefore playing open all-out attacking football would lead to conceded fewer goals than playing in a more balanced way. Similarly, packing 11 guys in the box isn't going to have you scoring 5 goals per match.

In this case though, I think correlation implies causation, if anything underestimating it. There is no reason to think that performing well in the Champions League would cause a team to become better or worse defensively. However, doing well in the Champions League could negatively affect league results since teams going deep will likely have to rest their best players in league matches more often.

Conclusion

This is a preliminary study that shows conclusively that both scoring and defending skill are important in the Champions League. There is nothing surprising there. They also indicate, though not conclusively, that defense is more important than offense. Similarly, goal difference appears to be a better indicator for how well a team will do in the Champions League than league points.

In the future, I will study this further. I will expand the points system to include previous editions of the Champions League. Also, in addition to looking at correlations I will run regressions that will take into account country effects, which I ignored here. Finally, at some point I hope to incorporate the Poisson model and use the scoring and conceding ranks it gives instead of just using league results.

Monday, September 14, 2009

Is Performance in Close Matches Luck or Skill? (Goal Diff and Points - Part 3)

Note: Here are parts one and two of this series. I strongly recommend reading those before reading this, though I give a summary at the beginning if you just want to get to the business of which teams have gotten lucky and/or been clutch over the last several years.

In the first article I discussed how well a team's number of league points at the end of the season corresponds to their goal differential. The answer was extremely well. In the second article I discussed why a team would over or under perform relative to their goal differential. This comes down to how good their results were in close matches compared to teams that are similarly skilled, at least as judged by their goal differential. I then discussed why some teams would getter better results than others.

The question at hand is whether there is a "close-match skill" that causes some teams to play better when it matters most compared to teams of the same skill level otherwise.

Correlation

To test this, I used the same data set as before. It includes every season of the Spanish Primera Division from 1987-1988 to 1994-1995 and from 1997-1998 to 2008-2009. I also have the English Premier League for all seasons since 1995-1996 and the Serie A since 2004-2005. I believe this to be all seasons for the three leagues in which they had 20 teams. For the first 8 seasons of La Liga I converted the points so that a win would be worth 3 points as it is today instead of 2 as it was then.

Using this data I looked at the correlation between how many points above or below goal-differential expectation each team gets a given season and the next season. In footballing terms this statistic will tell us if a team's performance in close matches is consistent from one season to the next. A large positive figure would indicate that teams doing particularly well in important situations this year will tend to also do so next year. A large negative value would mean that teams that do well for their skill level this year will tend to be bad at that next year. A value close to zero would tell us that performance in close matches this year means nothing as far as next year goes. In other words, values close to zero mean that there is no evidence that performing especially well in crucial situations is anything other than luck.

Here is the correlation coefficient for each of the three leagues:
English Premier League: -0.0218 (sample size 240)
Spanish Primera Division: 0.05579 (sample size 357)
Italian Serie A: 0.02312 (sample size 73)

As you can see these values are all very close to zero. In fact, the correlation coefficient for the Premiership is negative! For the fellow nerds out there, the p-values for the one-sided t-test were 0.147 for the Spanish league and 0.423 for the Serie A. The correlation for the Spanish league is in the ballpark of being statistically significant but is not at the usual 5% level, or even at 10%. Given the large sample size, it's safe to conclude that the correlation is effectively 0. There is no evidence that teams overperforming their goal-differential expectation can be attributed to anything other than luck, or at least anything that would carry over from one season to another.

A Look At Individual Clubs

While that is very conclusive, let's look at how different teams have done compared to their goal-differential expectation. For all teams in England with at least 10 seasons in the sample, here is a chart with the average difference between actual points and goal-differential expectation each season.



Liverpool have gotten an average of 1.5 fewer points than their goal-differential suggests that they should. At the other end, West Ham have managed to outperform their expectation by nearly 3 points a season. That Manchester United have done better than expectation might be the least shocking thing I have uncovered in all of my research for this blog.

Looking at the numbers, I think the chart is more evidence that there isn't skill involved in outperforming your goal-differential - that it's just luck. One reason for this is that the top teams are all over the place. Admittedly, Manchester United are near the top and are easily the best club in England over the sample. Other than them though, Chelsea and Arsenal have run close to average, while Liverpool are well below average.

On the flip side, West Ham United are the team that has been most successful compared to their skill level! While they haven't been the worst club to see the top flight, a look at them over the years of the sample gives one no reason to think that they have something that makes them the best team in close matches compared to their quality. They went through 4 managers during that time and were relegated. The relegation isn't really relevant as absolute level isn't what we're looking at but how well a team does in close games compared to their skill level. To me going through four managers is important. It would be easy to point to Alex Ferguson as the reason for Manchester United outperforming their goal-differential expectation, but then why would a club that sacked four managers in that time do even better compared to their expectation?

Here is the list for La Liga. I included all teams with at least 10 seasons in the sample:



Once again, we have a similar pattern. Strong teams such as Athletic Bilbao and Real Madrid are near the top but then again there are big clubs like Barcelona and Atletico de Madrid at the other end. Sporting Gijon is similar to West Ham. Again, just glancing at the list of teams that over and underperformed, it seems to be mostly luck, if not all.

For completeness here's the chart for the Serie A:



Just like the others it features a huge club at or near the top and bottom. Reggina plays the role of club that has seriously over performed in close matches for no obvious reason other than luck. Looking at both this and the Premieship data, I found it interesting that Liverpool and AC Milan have under performed in their leagues. Both have a reputation for grinding out wins in cup competitions, most notably the Champions League, even when playing against superior opponents. Apparently those reputations didn't help them any in those spots in league fixtures.

Conclusion

Looking at points earned compared to the goal-differential expectation from one season to the next there is no correlation. Teams that over perform in close matches this season are no more likely to do so next year than those that under performed. In other words, there is no evidence that some teams are able to step up in important situations better than others. Looking at which teams have historically done better or worse than their goal-differential expectation helps to confirm this. While for each of the three leagues there is a team or two near the top that one might think could have this ability to step up, there are other teams at the bottom that it seems should also have it. As importantly, there are teams that have overperformed that have lacked consistency in management and squad so it seems they couldn't have done so if the ability to rise to the occasion were real, or at least an important factor.


Scoring and being scored on are based on a combination of skill and luck, but having the goals come at the right time seems to be all luck.

Sunday, September 13, 2009

What to Expect from this Blog in the Near Future

Now that we're winding down from the World Cup qualification craziness and I have a lot of new readers, I want to give a quick description of what you should expect for the next few weeks.

Here are articles coming down the pipeline:

- a CONMEBOL update after the matches last week. I apologize for not having written that earlier. I also will write an article discussing qualifying from Africa.

- more in the series on goal differential and points (see 1 and 2). The next article will discuss whether the data suggests that some teams use their goals more efficiently than others by performing better than similarly skilled teams in important situations or if it just comes down to luck. I will then write an article where I look at managers and whether there is evidence that some are better than others in important situations, meaning they will outperform what their goal differential would suggest.

- more in-depth work on modeling and predictions. I want to study the Poisson model in depth before considering alternatives. The reason for that is that it very easy to implement since it only uses aggregate data. I'll look into altering it while keeping the main structure in place so that I can quickly look at any league or competition and make sound predictions. At some point I'll introduce an alternative model that uses individual-match data. I suspect that it will be an improvement over the Poisson model, though more cumbersome.

I will attempt to address these questions:
- How much do match stats like time of possession, corners, fouls and bookings correlate with goal differential? I will also discuss whether we can conclude that a causal relationship exists for each.

- Do scoring-oriented or defensive teams do well in the Champions League and domestic cups? Conventional wisdom says defense wins championships in any sort of football it seems. I'll look at this claim.

- What is the effect of the timing of goals on the match outcome? Are goals scored at the end of the half more valuable for psychological reasons? Again, conventional wisdom says yes. We'll see what the data says.

- How does the score affect the likelihood of a goal for either team? Teams that are behind tend to attack more and those ahead are more careful defensively. The flip side of that is that the team that is behind is more vulnerable defensively, but the team that is ahead is less eager to throw it forward. How this all impacts scoring rates isn't clear because the effects run in opposite directions.

- How well do the results from one season correspond with those from the next? Can results from the past be used to predict the current season? I will also look at this for teams that have been promoted or relegated, though that seems more tricky.

Further on up the Road

My overall plan is to work on the modeling stuff now while the seasons are young in most leagues. Once they have played enough matches for there to be sufficient data, I'll apply the best model I have to make predictions. The timing was unfortunate in that the World Cup qualifiers went off just after I started this blog so I had to use the basic Poisson model for the predictions without having properly analyzed it. By the time October 10th comes around I'll hopefully have an improved model to work with. In general I think it will be pretty standard for me to work on improving the models or at least assess how well they're doing but use the best one available to make predictions, even if it is not completely perfect. Over time the predictions should become better and better.

Once the leagues have played enough, I'll write weekly articles about the biggest leagues and frankly any league that you readers would like me to discuss. These articles will be pretty similar to what you saw last week with World Cup qualifying.

Comments and Suggestions

Once again I want to plea for your comments and suggestions. The above is a partial list of topics I have in mind to write about. I'm open to looking at other things and have no schedule in mind for when I will write what. If you really want to read about a subject I mentioned above or one doesn't seem interesting at all, add a comment to let me know and I'll take that into consideration when deciding what to write about next.

I'd also love any feedback on the blog in general. Do you like the types of articles I write? Hate them? What about the overall style of the blog? Anything you have to say, positive or negative, would be greatly appreciated.

The Poisson Model - a Deeper Look, Part 1

I've been using the Poisson model to make my World Cup qualification predictions. Let's take a closer look at it. I'm not sure how much a lot of you will be interested in this sort of article, it's a behind-the-scenes kind of thing. I will be on occasion making these kinds of posts for a couple reasons. For one, I think it's necessary to give justification for my predictions. If I were making picks based on my subjective opinion from watching matches I would give some reasoning. The same applies to using a model to make predictions, though in that case it would be justification for the model more than the individual picks. The second reason is that I'll be working to improve the models and writing these posts will help me with that.

Poisson Distribution Overview

The model is based on the Poisson distribution. Roughly speaking, the Poisson distribution describes how many occurrences of something you will have when they are infrequent but have a lot of chances to occur. This seems applicable to football because there are a lot of minutes (or seconds if you like) in a match and the probability of a goal any given minute is low.

The major assumption that the Poisson distribution makes, when applied to football, is that the probability of a goal being scored by a given team is the same every minute. In other words, it doesn't matter what the score is, when the last goal was scored or how much time is left. This assumption greatly simplifies things, but I can't imagine any fan of the game thinking it is correct. When a goal is scored things change for both teams. I suppose it's possible that things perfectly balance to cancel each other out - A attacks more because they are behind but B is more defensive as well so they have the same chance of scoring as when they were tied. That does not seem plausible though. I'll write more in the future on how the scoreline affects each team's chances of scoring. A preliminary look at a single-season of Premier League data suggests that there are significant differences in how frequently goals come for either team with different scorelines.

There are 4 inputs for the Poisson model: goals scored for each team, goals conceded by each team, a list of which teams played at home and away, and how many goals in total have been scored at home. Note that it does not use results from individual matches. Other than a list of fixtures, all inputs for the model are aggregate. This is possible precisely because of the above assumption. If the chance of a goal does not depend on the scoreline then you don't need to worry about how many goals were scored when the team was ahead or behind or how much time each team spent with the score a certain way.

How Well Does It Describe Football Scorelines

If you stopped reading there, you'd probably think that the Poisson model is useless because it is based on flawed assumptions. Before you jump to that conclusion, let's look at how well the model explains the game. In future articles I will discuss how well it predicts the future. That is a lot more difficult, in part because the sample sizes are smaller. Here I am discussing how well the model does with a full season of data. In other words, for every season in my sample I use the end-of-the-season information to run the model and then look at how well the model fits the data.

The data I used was the previous 9 seasons in the Spanish Primera Division. To test how well the model fits, I compared how often they actually happened to the frequency the model assigns to each result (home win, draw, away win) and each number of goals for the home and away team.

Match Outcomes

The first thing I looked at is how often the model indicated the match would result in a home win, draw and away win compared to how often those actually happened. Here is a graph of the predictions. I went with a line graph instead of the more typical bar graph so that you can more easily see how the differences vary by result. The first node is for the home win, the second for a draw and the third the away win. The vertical axis is the percentage of the time they actually occurred or the model claimed they would. The blue line is how often they actually happened, the red line represents the estimates of the model.



As you can see, the model under estimates the percentage of home wins and draws and over estimates the likelihood of an away win. Because the sample size is pretty large, these differences are statistically significant. The p-value for the Chi-squared goodness-of-fit test is less than .02. In other words, looking at the actual results compared to the estimates from the model, we can conclude that the predictions of the Poisson models are off. Despite that, the estimates are pretty good.

Goals

Let's now look at how often the model estimates each number of goals will occur in a match. The first panel is goals for the home team, the second the away team and the last all goals. Again, the blue lines represent the actual frequency and the red lines those predicted by the model.



Looking at the charts, the model overestimates the frequency of 0 goals and underestimates how often teams score 1 goal for both home and away teams. These differences are all statistically significant at the standard 5% level of significance. It also appears that the model underestimates how often the home side scores 2 goals and 3 goals, but more data would be needed to confirm that. Similarly the model may overestimate how often a team scores 4 or more goals for both home and away teams. The chi-square p-value is less than .01 so again we can conclude that the model does not fully fit the data.

You can see why the Poisson model overestimates how often the away team wins. The model is more off for the home team than away when giving the frequency of going goalless. It also seems to underestimate how often the home team scores two goals, which it does not do for the away side. In other words, the model is a bit off whether you look at the home team or the away team, but it is a bit harsher on the home side. I should say that these differences aren't statistically significant, but things appear that way and statistical significance would require an insanely large sample size.

Conclusion

Looking at the figures, the Poisson model does a pretty good job but is not perfect. Based on the difference between home and away goals, it appears that it is off in a systematic way - it underestimates scoring by a larger margin if the expected number of goals is higher. As a result, it will overrate teams that score few goals, concede a lot and/or are playing away and underrate those that score a lot, are stingy defensively and/or are playing at home. This is good and bad news. On one hand the model being wrong in a systematic way means that certain teams will consistently be rated incorrectly and their chances over or underestimated. On the other hand it means that adjustments can likely be made to improve the model.

The next article in this series I will tweak the model to hopefully deal with these issues. I will then compare the mid-season predictions made by the vanilla model and the alternative version(s) to see how much we can improve the Poisson model. I think and hope that improvements can be made.

Friday, September 11, 2009

Goal Differential and Points - Part Deux

In a previous article I discussed goal differential and how well it explains how many points a team gets in a season. I would highly recommend reading that article for those who haven't. The short of it is that goal differential almost perfectly explains how many points a team gets. Other than goal differential, how well a team does in close matches compared to blowouts determines how many points they will get. If a team wins a lot of games by a single goal and then gets beat several times by 3 or more goals then they will over perform. In other words, they will get more points than there goal differential indicates that they "should" according to their goal differential.

The question at hand is whether this is based on luck, skill or perhaps something else entirely. Here are some possible explanations.

Luck

The main argument for luck is that it plays a role in every single goal that is scored. The difference between a perfectly played ball and one that goes wildly into touch is probably less than a millimeter in where the foot strikes the ball or a tiny change in the amount of force applied. The team aspect of the game also throws an element of luck into the build up and scoring chances. Sometimes your perfect through ball will be run onto by your teammate, other times he won't get there because he was expecting you to play it directly to him. The referee also adds some variance. Sometimes the linesman will miss a player just offside, other times he'll get it right and on occasion he'll rule that a player that was onside was off.

Over the course of a season, these things will even out so that the number of goals scored and allowed will be pretty close to representing how much skill teams have at each end of the field. On the other hand, luck still will have played a big role in the timing of those goals. Some matches it will all be clicking and you'll beat a team by 3 goals, netting two you didn't really need. Other days you'll play an inferior opponent but will be held to a 0-0 draw because the bounces weren't going your way. If things happen to go your way when the score is close more often than they do when you are well ahead or behind then you will do better than teams with the same goal differential and vice-versa.

Luck certainly plays some role, I don't think anyone could reasonably argue otherwise. Whether it's a small factor or the only thing that matters is impossible to prove as it is the null hypothesis. In other words, we can only show that luck is the only factor by showing that nothing else could explain it.

Player Skill

The argument for player skill is that in addition to football skills there is a mental skill that allows some players to play better than they usually do (or not as much worse as usual than others) in more important game situations. In American sports, this is often referred to as a player being clutch. In soccer this could appear a number of ways. The most obvious example is penalty kicks. When you're watching it certainly seems like some players are better than others at not letting nerves get the best of them and just burying it as they should. In the run of play, you could see this if an attacking player is more likely than others of the same skill level to score in the dying minutes when his team needs a goal to even the score or go ahead.

Unfortunately, because football is a team sport evidence of this is going to be hard to come by. Perhaps one could look at attacking players' scoring rates in various situations. That doesn't seem like it would work though as there would be serious sample-size problems and it would be hard to untangle other possible explanations like tactics and the play of teammates.

Manager Skill

Differences in tactics could also play a role. Suppose a manager has a tactic that drastically improves his team defensively, but that also makes them unlikely to score. If he is ahead by a goal near the end then he will use it and as a result his team will give up fewer equalizing goals when they are ahead. They will also score fewer goals in that spot. This has a double effect. They will perform better than the average team of their skill level in close games and also score fewer useless goals making their goal differential less than other teams of their skill level. This will cause them to over perform relative to their goal differential. Similarly, if a manager has a tactic that drastically increases his teams chances of scoring but also being scored on then the same will hold because they'll draw even more often when behind in a one-goal match and will also give up more extra goals.

This and player skill is difficult to untangle, but the argument could be made that a manager skill is picking players who are able to step up in important situations. To look for this, I will compare how well different teams perform compared to their goal differential through various seasons. If the manager or group of players is important then some teams should consistently outperform those with the same goal differential.

Other Factors

The above explanations focused on doing better or worse than teams of a similar skill level in close games. It could also be the case that some teams do better or worse than those of the same skill level when they are well ahead or behind another team in a match. On the positive side of things, if a team has a lot of competition for starting spots then the players might be more willing to play with full effort even when the result has been decided in order to impress the manager. If this is the case, then they will get more goals on average when they are already ahead by two or three goals and will underachieve compared to their goal difference. Another way to think about it is that their goal differential would be higher than their actual skill level so they would get fewer points than those with a similar difference in goals. On the negative side, a team may give up more easily than others. This could be due to a lack of mental toughness or players smartly backing off a bit in order to conserve energy for the long season and prevent injury. To clarify, it not only has to be the case that a team does this, but they would have to do so more than the average team for it to matter.

Future Work and Discussion

The next article in the series I will look into these explanations and examine whether some teams seem to perform better in important situations than others with similar skills.

In the meantime, I'd love to know your opinion. Do you think it's all luck or is player and manager skill important? Do you think it might be something else that I didn't mention? Please give your thoughts below.

Thursday, September 10, 2009

World Cup Qualifying - CONCACAF

Things are starting to shape up and return to normal in the hex. The top two are now the United States and Mexico. The US won a pretty standard CONCACAF road game. It wasn't pretty, they didn't dominate and even got outplayed for stretches but managed to grind out a very important win. It's something vital at this stage of qualifying and something you see a lot of teams fail to do, Costa Rica yesterday being a good example.

Results:
Trinidad and Tobago 0 - 1 United States
Mexico 1 - 0 Honduras
El Salvador 1 - 0 Costa Rica

Here's the table:


Assuming no crazy 10-goal wins happen, here are the scenarios for each team to qualify automatically with the top-3 finish.

United States:
The Yanks go marching in if they:
- win against Honduras or
- get a draw or win against Costa Rica

They would also be in if one of these happens:
- Costa Rica lose or just get a draw at home against Trinidad and Tobago
- they draw with Honduras, Honduras does not win at El Salvador
- Mexico get only one point combined against El Salvador and Trinidad and Tobago

Mexico
Mexico qualify automatically if they:
- win either of their next two matches

If they manage two draws then they would need for one of these to happen:
- Costa Rica draw or lose in either of their last two matches
- Honduras lose either of their last two
- the US lose both of their last two

With just a single point from their last two matches they need one of these:
- Costa Rica draw or lose in either of their last two
- Honduras fail to win either of their last two

If Mexico lose both then they can still make it if:
- Costa Rica lose a match or fail to win either match
- Honduras get a draw and a loss or two losses

Honduras
Honduras are in if they win both matches.

If they get a win and a draw they need one of these to happen:
- Costa Rica get a draw or loss in one of their last two matches
- Mexico fail to win either of their last two matches
- the win is against the US and the US also lose to Costa Rica

If they get a draw and a loss they need one of these to happen:
- Costa Rica get a draw or loss in one of their last two matches
- Mexico get a draw and a loss or two losses in their last two matches
- the win is against the United States and the US also lose to Costa Rica

Costa Rica
If Costa Rica get two wins then they are in.

If they get a win and a draw then they also need one of these to happen:
- Honduras win neither of their last two matches
- Mexico lose both of their last two

If they get a win and a loss then they need for Honduras to get either a draw and a loss or two losses in their last two matches.

If they get two draws then they need for Honduras to lose both of their last two matches.

Poisson Predictions

Here are the updated percentages for automatic qualification according to the Poisson model:
Mexico - 94.1%
United States - 91.6%
Honduras - 89.8%
Costa Rica - 24.3%

To be honest I'm surprised Costa Rica is that high since the model does not take into account that the US might have nothing to play for.

Here are the overall qualification percentages assuming the playoff would be a coin flip:
Mexico - 96.8%
United States - 95.8%
Honduras - 94.8%
Costa Rica - 60.9%

I think that severely overrates Costa Ricas chances. The Ticos are in a pretty serious tailspin at the moment and even though Argentina doesn't look good and currently sit fifth in CONMEBOL I think Costa Rica is an underdog in the playoff round. I'd say they are more like 45% to make it.

2010 World Cup Seeding, Summary Version

Note: This article was written with the assumption that FIFA would use a formula similar to the one they had used for the previous 3 or 4 World Cups. They instead went purely off of the October FIFA rankings making all of this incorrect. Here is a recent article on the draw setup.

I tend to write too much so as a service I'll at times have a summary article for those that just want the damn point instead of reading through all the methodology stuff. Here is the original article. Read it if you want the methodology.

Being seeded is a big edge because it means your opponents are not as strong.

The following teams will definitely be seeded:
1. South Africa
2. Brazil
3. Italy
4. Spain
5. England

The last three spots go in this order:
1. Germany
2. Argentina
3. France
4. Portugal
5. The Netherlands
6. Mexico
7. The United States

So Germany, Argentina and France will all be seeded if they qualify. Portugal will be seeded if they qualify and one of those teams slips up. The Netherlands need for two of those teams, including Portugal, to fail to qualify in order to be seeded. Mexico need for three of Germany, Argentina, France and Portugal to not make it to South Africa and the United States need for all four of those teams to fail to qualify.