*I want to thank Kenny Rudinger for helping me check and correct some of my numbers on this analysis. Kenny is a physics graduate student at UW-Madison, a great Ultimate Frisbee teammate, and a Bills fan. His website is here.*

Abstract

A subtle result of the (relatively) short football season is that teams have a very small sample size of games from which to determine how good they are. The fact that the outcomes of individual games can strongly depend on a few crucial plays adds an additional element of chance any given Sunday. Combine these features with a fairly inclusive playoff system and you have a recipe for upsets. In this post I'll talk about a theoretical model which attempts to produce a handicap, giving better teams a point boost to reduce this variance.

Introduction

Generally the ideas for this blog come while I'm watching NFL games; sometimes they come from discussions with friends. But recently I came across an article in Slate by Neil Paine that got me thinking. In case you're not a fan of following links (your loss, Slate is generally awesome), the basic premise is that the current playoff structure of the NFL, with only a minimal nod to regular season performance, doesn't properly reward teams who've played statistically better during the season. Paine's insane (his words, not mine) suggestion is to give the supposedly superior team a starting point advantage.

Putting aside whether such a handicapping system would be good for the game (as Paine himself argues it's probably not, given that it's fun to see unlikely outcomes and boring to watch your team start a game with a 20-point deficit), how would you go about computing how many points to spot the better team? Paine advocates an approach based on assuming that win-loss records and point differentials are governed by normal distributions.

If you're interested in the full calculation you can get all the details from the links in his article, but the gist is fairly simple. You start win the raw win-loss records for each team, then adjust them to account for the fact that 16 games isn't enough to properly sample a team's actual talent level. The next step is to convert the adjusted winning percentages into an estimate the expected likelihood for each team to win the game. Finally, this win expectancy can be converted into a point value, where broadly a higher likelihood translates into more points.

The result is a prediction of how many points the team with the better winning percentage would need to start off with in order to win with the same probability that they are truly the better team – for instance, a team with a 65% probability of being the better squad would be spotted enough points so that statistically they should have a 65% chance of winning the game.

Paine's method relies on several assumptions which, while not unreasonable, have not all been well-tested. Specifically, although the fact that winning percentages and point spreads are normally distributed has been fairly well tested, his analysis is also predicated on the premise that the relative skill of the two teams have no bearing on the outcome of any individual game. That's quite a claim, and one that I think is worth testing.

Data

I grabbed game records from 2002-2011 from my copy of the Armchair Analysis database, then computed running winning percentages for all teams over the course of each season. Using Paine's methodology I computed the pregame win expectancy for the team with the better record. I then converted these percentages into points, using the same values for home-field advantage and scoring variance as Paine does.

If you're interested in the full calculation you can get all the details from the links in his article, but the gist is fairly simple. You start win the raw win-loss records for each team, then adjust them to account for the fact that 16 games isn't enough to properly sample a team's actual talent level. The next step is to convert the adjusted winning percentages into an estimate the expected likelihood for each team to win the game. Finally, this win expectancy can be converted into a point value, where broadly a higher likelihood translates into more points.

The result is a prediction of how many points the team with the better winning percentage would need to start off with in order to win with the same probability that they are truly the better team – for instance, a team with a 65% probability of being the better squad would be spotted enough points so that statistically they should have a 65% chance of winning the game.

Paine's method relies on several assumptions which, while not unreasonable, have not all been well-tested. Specifically, although the fact that winning percentages and point spreads are normally distributed has been fairly well tested, his analysis is also predicated on the premise that the relative skill of the two teams have no bearing on the outcome of any individual game. That's quite a claim, and one that I think is worth testing.

Data

Before getting into the real meat of the data, I first wanted to test another assumption Paine makes, which is that the variance of team win-loss records is equivalent to that from repeated coin flips. Since NFL scheduling is definitely not random I was skeptical about this assumption, but a quick Monte Carlo analysis of randomly generated standings indicated that the details of how match-ups are determined is not strongly biasing these statistics.

I grabbed game records from 2002-2011 from my copy of the Armchair Analysis database, then computed running winning percentages for all teams over the course of each season. Using Paine's methodology I computed the pregame win expectancy for the team with the better record. I then converted these percentages into points, using the same values for home-field advantage and scoring variance as Paine does.

There's no point in having data from the first week of any season, so I remove all of those games from my data set. (I actually ignore games in the first three weeks to minimize statistical fluctuations.) To remove the possibility of bias from teams which have already qualified for the playoffs but are resting their starters, I also do not use data from weeks 16 or 17. I do however include playoff games in my analysis.

Now that I have the win expectancy and how many extra points that translates into, I can compare how often teams win to how often they should (assuming we've accurately measured a team's true skill, of course). To help preserve signal in the data I binned the games up in win expectancy to have a roughly equal number of games in each bin.

First, to provide a point of comparison, in Figure 1 I plot the winning percentage for each bin provided that the win expectancy directly correlates to win percentage. These points are what we should be shooting for with our handicapping model, as they represent the pure translation between win expectancy and actual winning percentage.

Figure 2: Same as Figure 1, but now with the raw winning percentages of the games in our sample shown as the blue histogram. The errors come from counting statistics. |

Next, let's look solely at how raw winning percentage trends with win expectancy, which is shown in Figure 2. You can see that while there is a positive correlation between the win expectancy of the statistically better team and their likelihood of actually winning, it's not enough to produce a one-to-one relationship. Finally, in Figure 3 I've added the winning percentage that would result if the "better" team was spotted the number of points given by the model.

Figure 3: Same as Figure 1, but now showing what would happen if teams were given the handicap suggested by Paine as the red histogram. |

Discussion and Conclusions

It's pretty clear from Figure 3 that Paine's model, while better than nothing at producing game results in line with the pregame win expectancy, will actually give the team with the better win-loss record a slight additional advantage. Considering the data show that individual games aren't pure coin-flips, an aforementioned assumption of the model, this result is unsurprising.

That's not to say that Paine's scheme has some sort of fundamental problem. It's clear that individual games are not fully representative of which team is better overall, and there's nothing wrong with the idea of using statistical game results to construct a correction (again, ignoring the issue of whether or not to actually apply such a correction). Games aren't truly random events (another example of reality getting in the way of beautiful, pure statistics), and so any putative correction must take this into account.

That's not to say that Paine's scheme has some sort of fundamental problem. It's clear that individual games are not fully representative of which team is better overall, and there's nothing wrong with the idea of using statistical game results to construct a correction (again, ignoring the issue of whether or not to actually apply such a correction). Games aren't truly random events (another example of reality getting in the way of beautiful, pure statistics), and so any putative correction must take this into account.