PhD Football

PSA: Moving to a new location

2021-01-24T08:41:00.001-06:00

Hey folks! I'm starting to do these kinds of analyses again, but any future work will be published at my new blog, https://skeweddata.github.io/. Why? Two main reasons:

I find myself interested in writing about more than just football analytics, and this site was really designed just for that one purpose.
I wanted to have more control over the site itself. Blogger is great for quickly generating prose content, but for more complex things (fancy formatting, javascript) it's not so wonderful.

So take a look. You can filter to just football content if you want, and I just put up a polemic about the NFL's "Next Gen Stats" initiative. Hopefully soon that will be joined with more interesting analysis and discussion. Happy reading!

NFLDash: A responsive dashboard for play-by-play data

2016-12-13T20:03:00.000-06:00

Between the regular season and the playoffs, there are about 44 thousand plays run in the NFL each season. That's an enormous number, and with so much happening not only in the play itself but also in the broader game-state. There's lots of cool insights in this data, but you could spend your entire life slicing it in different ways looking for them - especially if you need to do separate analyses for each line of inquiry.

So I built an interactive dashboard to parse just about every play run in the NFL since 2009 (excluding the preseason, because seriously, who cares?). That's over 300,000 plays, with detailed information on each from nfldb, and even win probability estimates through NFLWin. Oh, and I put it online for everyone to use. Feel free to skip to the bottom if you want to go check it out right away, but since it's a pretty big sandbox here are a couple of ideas:

Find the times when runs on 3rd and long resulted in first downs:

Remind yourself of some of Tom Brady's greatest moments:

Get a list of all the worst safeties quarterbacks have taken:

The site's here - go nuts! If you need more inspiration, there are more ideas (as well as details about how it all works) in the about page.

Introducing NFLWin: An Open Source Implementation of NFL Win Probability

2016-09-01T20:30:00.000-05:00

tl;dr: I made a Python package to compute NFL Win Probability - given a specific game state, what are the odds the offensive team will go on to win the game? Code on GitHub, documentation on Read the Docs, or just 'pip install nflwin'.

One of the most common advanced statistics used by NFL analysts is Win Probability. Put simply, Win Probability (WP for short) is an estimate of the likelihood that, given a specific game state one team will go on to win the game. For example, at the very start of the game between evenly matched opponents each team's WP will be very close to 50%, while a team up by 20 points with a minute left to go will have a WP of essentially 100%. Down, distance, field position, and other variables can also be added to the model in order to produce an extremely granular WP estimate.

While WP alone is a useful tool for condensing the myriad variables surrounding the game state into a single, easily interpretable number, it becomes even more useful when compared across plays. The difference in WP between two plays (also known as Win Probability Added, or WPA) provides a way of measuring how effective a given play was at helping your team win. Instead of grading a running back's performance based on rushing yards or yards-per-attempt, for instance, summing the WPA from each rushing attempt automatically produces a statistic which gives more importance to a 2 yard rush on a critical fourth-and-one than for a 7 yard draw play on third-and-18.

Despite its easy interpretability, which is relatively rare in the world of advanced statistics, WP is not a straightforward calculation like yards-per-rush or even QB rating. WP isn't based on a simple formula; rather it requires one to build a detailed model based on historical data. This model can be quite complex, both in terms of the specific data used to construct it but also in the choice of model itself. As a result computing WP from scratch is not feasible for a large number of would-be analysts. That's why I built NFLWin.

NFLWin is a Python package designed to make estimating WP robust yet simple. It provides a simple interface for pipelining raw data through all the steps necessary to compute WP along with great documentation that covers installation and use. The code is fully open-source so anyone can inspect its guts or modify it to suit their purposes, and while it includes a WP model to make it easy for anyone to get going right away, NFLWin also includes utilities and instructions for rolling your own model if you so choose.

NFLWin is far from the first effort to compute Win Probabilities for NFL plays. Brian Burke at Advanced NFL Analytics was one of the first to popularize WP in recent years, writing about the theory behind it as well as providing real-time WP charts for games. Others have picked up on this technique: Pro Football Reference (PFR) has their own model as well as an interactive WP calculator, and the technique is offered by multiple analytics startups.

So why create NFLWin? Well, to put it bluntly, while there are many other analysts using WP, they're not publishing their methodologies and algorithms or quantifying the quality of their results. This information is critical in order to allow others both to use WP themselves but also to validate the correctness of the models. Brian Burke has never discussed any of the details of his WP model in any depth (and now that he's at ESPN, that situation is unlikely to improve any time soon), and analytics startups are (unsurprisingly) treating their models as trade secrets. PFR goes into more detail about their model, but it relies on an Estimated Points model that is not explained in sufficient detail to reproduce it.

Possibly the best description of a WP model comes from Dennis Lock and Dan Nettleton, who wrote an academic paper outlining their approach and results. Lock and Nettleton's paper provides information regarding the data source used to train the model, the type of model used, the software used to build the model, and some statistics indicating the quality of the model. It even includes a qualitative comparison with Brian Burke's WP estimates. This is far and away the most complete, transparent accounting of the guts of a WP model and is laudable. However, as often happens in academia, none of the code used to build and test their WP model is available for others to use; while in principle it would be possible for anyone to recreate their model to build on or validate their work, this would require building their entire pipeline from scratch based off of dense academic prose.

"But Andrew", you may say, "What about the PFR online WP calculator you mentioned only two paragraphs ago? Surely we can just use that instead of having to create our own." Well, unfortunately there are two main problems with that approach:

If you ever want to programmatically compute WP you'll need to write a web-scraping algorithm to do so. The end result will require the user to be online, and, like most web-scraping, be fairly brittle - if PFR changes their website your scraper has a good chance of breaking. Not optimal.
There is something obviously wrong with the PFR calculator. Go to the calculator page and ask it to tell you the WP for a tie game with zero point spread, with 5:01 to go in the 4th quarter and the offense at first-and-goal from the 5. You'll see that their model gives the offense a 50% chance of winning the game. Now compute the WP for the same exact situation but with 5 minutes left to play - one less second than before. Suddenly the WP prediction has jumped to 76.69%, a increase of over 25% just from having one fewer second on the clock!

While the first issue is unpleasant, the second is a huge problem. I don't know whether its a buggy implementation or a bad underlying model, but this discontinuity makes no sense. If PFR posted its algorithms publicly it would be possible to diagnose the problem. If their code was on GitHub I could even patch their code and contribute back.

This lack of transparency is endemic in the field of sports analytics. By not publishing their methodologies and the code behind them they are failing the reproducibility test as well as their readers who trust them to provide honest and unbiased stats. I get that controlling access to these algorithms can represent a competitive advantage, but frankly it's impossible to trust any analysis when there's no way to assess its accuracy or even verify that it's not flat-out wrong. How correct is Brian Burke's model for a given game state? Is the PFR model buggy just in this one case or is it pathologically incorrect? There's no way to tell.

NFLWin doesn't have that problem. Anyone can inspect the code to look for bugs, and accuracy measurements are built into the model. To be completely honest the default model in this initial release isn't particularly good - plotting the expected WP based on an aggregated validation set against that predicted by the model shows clear deviations from perfection (see below) - but if you want to use it you can see exactly how much you should trust the model, and it's now possible to quantify improvements made as time goes on and the model is iterated upon.

The default model in Version 1.0.0. Note the deviations from perfect predicted WP.

The OSS community has shown time and time again the value to be gained from open development - not only is there direct benefit to the public but having more eyes on the project leads to better code. By creating NFLWin I hope to not only empower others to produce robust, reliable WP estimates but also to use the knowledge of others to build a better tool than I could construct on my own.

So check NFLWin out. Read through the documentation. Install it and play around. Post an issue if something is missing or wrong. And, of course, contributions are welcome :).

Isolating Player Movement by Eliminating Camera Motion: An Ongoing Project

2014-09-03T11:51:00.000-05:00

Note: This post departs from the general format used on this blog. That's because the post is not about a specific analysis I've done but rather a demonstration of a tool I've been developing to make more detailed studies possible.

For those not counting at home, this is the 18th post on this blog. In the last 17 entries I've investigated a wide variety of topics, from whether or not it's a good idea to start rookie QBs, to home field advantage's dependence on time zone shifts, to crowd noise's impact on penalties. All of these analyses were performed using the standard statistics that the NFL keeps, chiefly play-by-play data from Armchair Analysis. These stats are very rich in content, and so it's not too surprising that I (among many, many others) have been so successful in exploiting them.

At this point, however, I believe that the returns from this kind of data are rapidly diminishing, and fewer and fewer cool new results will be forthcoming. At this point the community has explored a significant fraction of the power of the existing data; most of the really interesting questions that can be satisfyingly answered with play-by-play data already have been addressed. That's not to say this resource is fully exhausted, but I strongly believe that smart people have been using these statistics for long enough now that new work will be increasingly more difficult to find and perform and will require even greater care to ensure its validity.

That doesn't mean that there's nothing more to be done with advanced football analytics. In fact, I would argue that everything done up to now has only scratched the surface of what could be possible with better data.

I'm talking, of course, about player tracking systems.

It's taken many years of innovation, but technology has finally progressed enough to allow the positions of athletes to be monitored with high accuracy during a game or match. This capability opens up an entire new world of possibilities for improving our understanding of sports as it eliminates the reliance on simple statistics that can be easily tracked by humans. (Of course, if you like looking at the old statistics because they're easy to understand, don't worry: those will stick around, at least for the foreseeable future. In fact, they'll likely be improved, as player tracking systems can be used to automatically reconstruct all of the common statistics and thus eliminate transcription mistakes.)

While the NFL has generally taken a rather dim view on modern technology, in the last couple of years things have started to change – slowly. Here's an article from the league's website breathlessly describing how teams have finally started keeping their playbooks on computers – from just two years ago. This year the NFL has finally relaxed its rules prohibiting new tech on the sidelines, only to force teams to use old Surface tablets which have been modified to prohibit anything except viewing photos of plays from the prior drive.

Given this spotty track record, I was pleasantly surprised (stunned, really) to read the news that the NFL is jumping on-board the player tracking train in a big way, outfitting 17 stadiums with the technology to read RFID chips mounted in the shoulder pads of the players. (For a less pleased – but quite amusing – take on the announcement, see here.) Once statisticians and commentators get used to the new data at their disposal I'm sure we'll see a bevy of interesting new stats during games and in write-ups.

I want to be clear that I see the NFL's unexpected embrace of player tracking as an undeniable good: obviously the stats viewers and fans are provided will only get better once the system is up and running. However, given the limited nature of the stats the league provides on their website I would be shocked if the raw data from this system was provided to fans. I imagine that instead we'll be drip-fed highly aggregated statistics of minimal use for deep analysis, similar to the exceedingly limited results provided to the public by the NBA from their similar system.

I'm not sure why major sports leagues are so stingy with their statistics. It's possible they just don't see the demand to justify adding that capability to their websites, although the cynic in me thinks they're worried that with free access to the data the general public will start putting their own analysts and pundits to shame.

Regardless of why this information tends to be rationed, it's pretty clear (to me, at least) that despite the NFL's new commitment to this technology there is still value in developing an open system for player tracking. It's also clear that the broadcast footage is woefully inadequate for this task, as it's far too zoomed-in to see what players are doing away from the ball.

Fortunately, despite some ridiculous objections, the NFL has recently decided to make All-22 game footage available to the public. For those of you who don't follow my hyperlinks, the All-22 footage refers to two very wide camera angles, designed to show all 22 (hence the name) players on the field rather than focusing on the football. One camera shoots from the sideline at midfield while the other one films from one of the endzones, both from a very high vantage point. If you want to do any kind of player tracking based on game video, this is the footage you want to use.

Thanks to the excellent nflvid module, I was able to download the All-22 footage for the 2013 Jets-Falcons game to use as a guinea pig. My first discovery was that, despite the name, the All-22 film does not show every player for the duration of each play. Rather, the All-22 cameras function much like the regular cameras you see during the broadcast, except from a wider angle. For running plays, it's not a terrible issue, as the play is fairly compact and action away from the ball is of marginal importance:

But for passing plays, especially longer ones, the camera tends to lose the other receivers (as well as the linemen and QB) after the catch, when it zooms in to follow the ball:

(Note that the quality in the actual video is significantly better than in these GIFs, as I didn't want this post to take hours to load so I compressed them by a fair amount.)

A proper system would cover the whole field at once, allowing the viewer to watch every player for the full duration of each play. Unfortunately this is clearly not the case for the All-22 footage (the endzone view is actually worse in this regard, for reference), which makes identifying and following players between video frames much more challenging.

So, rather than being able to use the raw All-22 footage for player tracking, first we need to remove the camera motion. This is necessary in order to both get absolute position shifts for players as well as to keep track of which player is which between frames. The gist of the technique is simple: Find easily identifiable regions in each frame, and then compare their locations between frames to compute how the camera has moved (the homography, if you want to sound smart). Once you know the motion of the camera, you can correct the frame for it, effectively removing all camera motion.

Football is actually really well-suited for this process, as there are so many lines and markers painted on the field (soccer, for instance, would be significantly more challenging). The program I wrote to do this is relatively straightforward, and can be found on github. You can compare the original footage with the motion corrected video for a run:

and a pass:

I personally just prefer watching with the camera stationary – it's much more like actually being at the stadium, where your attention isn't controlled by the camera operator. Of course, it's not perfect; the NFL shield now moves, and you can tell from the pass play that the re-projection isn't exactly accurate on a few of the frames. The program also tends to have trouble when the ball is near the sidelines as well as at the end of most plays, when the camera has zoomed in really far. It isn't very fast (it takes about 10 minutes or so to process most plays) and can require several gigabytes of RAM since it currently stores all the motion-removed frames in memory as it processes the play.

Overall, however, it generally works fairly well, especially for a first attempt. From this point you can begin testing player identification and tracking algorithms for a variety of different plays, while continuing to iterate on the robustness of the camera motion program. Everything is open source, so if you're interested in contributing, or just want to try out the code for yourself, feel free to grab it and go nuts!

Classifying WRs, TEs, and RBs by Where They Catch the Ball

2014-03-23T20:54:00.000-05:00

Abstract
Principal Component Analysis (PCA) is a useful tool to simplify complex datasets. The results of the PCA can be then used either to reconstruct the original data or to classify it into different groups. In this post I apply PCA to reception data for a sample of 150+ NFL receivers. I find that PCA generally does a good job of discriminating between wide receivers, tight ends, and running backs. A few tight ends, however – generally ones known more for their use as receivers than blockers – have significant overlap with the wide receivers. This result indicates that PCA may be useful for determining how to designate players, Jimmy Graham for example, for the franchise tag.

Introduction

Alright, so I lied. Well, partially – I am very busy with job applications, but I've also been teaching myself some new machine learning techniques (mostly from this excellent textbook) and they're just so damn cool that it's been hard not to think of ways to apply them to NFL data.

One of these methods is called Principal Component Analysis (PCA for short), and it's designed to reduce a large, complex dataset down into its most important pieces. These pieces (the 'component' part of PCA) can be used as basis functions to reconstruct the original data with minimal information loss, providing a form of data compression. Or, the coefficients for a given component can be compared between all the observations in a dataset, and trends in these coefficients may be used to classify the data into groups.

One of the great things about PCA is that it relies on no assumptions about how the data are distributed. This means that PCA can be used on just about anything. Something that is especially appropriate for a PCA is the distribution of yardage gained by a player every time they touch the ball. Credit where credit is due: the analysis in this post is partly inspired by Brian Burke of Advanced NFL Stats, who looked at the distribution of yards gained (or lost) on rush attempts in an effort to distinguish between power running backs and smaller, faster RBs. Burke (largely visually) compared the raw yardage histograms, and found that there were only small differences between each type of back. Burke suggests using a gamma distribution to parameterize these gains, although given the distinct rush distribution for each player he shows it seems unlikely that every running back will be well-represented by such a (relatively) simple model (to his credit Burke himself is quite upfront about this). PCA allows us to produce accurate representations of such data without choosing a distribution a priori, which means we don't have to worry about limiting or biasing ourselves by such a decision.

For this post I'll apply PCA to reception statistics rather than rush attempts. One reason for my choice is that Burke's analysis (despite the limitations I mentioned earlier) is pretty thorough, and I prefer to break new ground when I can. The other (more interesting) reason is that while most rush attempts come from a single position group (RBs), the target for a pass attempt can be a WR, TE, or RB. So in addition to looking for differences between possession receivers and home run threats it's also possible to see how the different positions are utilized on passing plays.

Data and Model

I queried my copy of the Armchair Analysis database (which spans the 2000-2011 seasons) and grabbed the yardage gained from every reception for each player with 200+ catches in the database. (I impose this reception threshold to ensure that statistical noise doesn't dominate the data.) The final sample consists of 114 wide receivers, 37 tight ends, and 33 running backs. The reception distribution of the total dataset is shown in Figure 1.

Figure 1: Distribution of all receptions in the sample. It has a strong peak at a gain of around 7-10 yards, with a long tail showing big passing plays.

I next computed the reception distribution for each player, then ran the PCA. The details of exactly how PCA works are beyond the scope of this blog, but I'll give a brief overview of the method here so that at least the general concept is (hopefully) clear.

First off, each player's reception distribution is normalized so receivers with more catches don't bias the analysis, and the mean yardage distribution for the whole dataset is subtracted. From this point the algorithm gets to work, computing a function which minimizes the variation in the residual data. This process is repeated, and each successive iteration accounts for more and more of the fine details of the dataset. Eventually (when the number of iterations approaches the number of players in the sample) the PCA will perfectly reproduce the original data. Of course, that sort of exact duplication isn't the point of PCA; rather since the most variation is explained by the first components, the goal is to truncate the algorithm after only N iterations, where N is much smaller than the number of players in the dataset.

The script I wrote to do this analysis can be found here. It's a fairly long program, but a large chunk of it is just to make the diagnostic plots to show how well the PCA worked – the meat of the PCA happens between lines 107-115.

Results

Figure 2: The first four components of the PCA as a function of reception yards. The first component is the average of all the players, while subsequent components have been computed by the PCA algorithm. Components beyond the second and third are very jagged, signs that they are fitting individual player variation rather than useful information.

I ran the PCA on the reception data to N = 15, but a look at the first four components (Figure 2) indicates that after the first couple of iterations the PCA is mostly fitting differences in the reception distributions for individual players. I can prove (hopefully) this to you via Figure 3.

Figure 3: Sample PCA reconstruction for Anquan Boldin, showing both the original reception distribution (in black) and a reconstruction using the first three PCA components (red). The reconstruction generally does a good job of mimicking the data even with only three components.

In addition to providing the maximal reduction in variance, the PCA also provides a list of coefficients for each component. These coefficients can be used with the components to produce a reconstruction of the original data – in the case of Figure 3, for Anquan Boldin. You can see that just the first three PCA components are required to recover a fairly good representation of Boldin's catch distribution – consistent with what the shape of the components indicated in Figure 2.

Now that we have verified that the PCA is working as intended, we can get to the good stuff – using the PCA to differentiate between players. As I mentioned earlier the data contain WRs, TEs, and RBs. A plot of the coefficients of the first PCA component (PCA1), color-coded by player position, is shown in Figure 4.

Figure 4: The distribution of the first PCA coefficient. Note how WRs are cleanly separated from RBs, while TEs partially overlap with WRs.

This figure is quite striking – running backs all cluster with (relatively) large coefficients, while nearly every wide receiver has negative values for PCA1. Tight ends tend to fall in the middle, although there is substantial overlap with the wideouts. What this means is that there is something inherently different about where each position grouping tends to catch the ball (and by extension, what routes they run). This is not inherently surprising, given it's fairly easy to see this just by watching how players at the different positions move during a game.

Discussion and Conclusions

What is interesting, however, is the fact that tight ends and wide receivers aren't as cleanly separated from each other as they are from running backs. In fact, while TEs and WRs are clearly not drawn from the same distribution, there is definitely some overlap. This implies that some TEs are being used more like wideouts. Additional evidence for this hypothesis comes from looking at which tight ends are most and least 'wide receiver-like'. Table 1 lists the top and bottom five TEs, sorted by PCA1.

Table 1: Tight Ends with Extreme PCA1 Values
Most WR-like		Least WR-like
Name	PCA1	Name	PCA1
Owen Daniels	-3.8x10^-2	Steve Heiden	6.8x10^-2
Antonio Gates	-2.9x10^-2	Donald Lee	5.2x10^-2
Tony Gonzalez	-2.4x10^-2	Bubba Franks	4.6x10^-2
Marcedes Lewis	-1.4x10^-2	Eric Johnson	3.9x10^-2
Tony Scheffler	-7.8x10^-3	Freddie Jones	3.7x10^-2

The left side of Table 1 generally contains TEs, most notably Antonio Gates and Tony Gonzalez, who are able pass-catchers. The right-hand side, however, consists of players generally not known for their receiving ability. It seems prudent to reiterate here that I'm not claiming that PCA1 is a predictor of skill in any way; rather it merely indicates that some tight ends are being used more like wide receivers than others.

Aside from being a cool result on its own, it also provides a way to classify players based on a statistic that's directly comparable between positions. This is especially relevant right now, as New Orleans Saints TE Jimmy Graham attempts to be treated as a wideout for the purposes of contract negotiation. You can read up on the details for yourself, but the upshot is that if Graham can get himself classified as a WR he can earn himself an extra $5 million over what he would get as a TE. A lot of the discussion has centered around statistics, such as where Graham lines up before the snap or how many receptions he had last year, that aren't directly comparable between wideouts and tight ends.

Unfortunately my data isn't current enough to actually include Graham in this sample (ditto Rob Gronkowski, just FYI), but I would bet that he winds up in the same regime as Gates and Gonzalez. Regardless of whether my intuition is correct, however, PCA provides a way to directly compare players at different positions based only on very basic data, and therefore it could be a very useful tool for position disputes like these.

Hiatus

2014-03-17T20:16:00.002-05:00

Hi everyone!

As some of you may know, I'm on the hunt for a new job – outside of academia. Unfortunately, sending out applications takes a lot of time, and despite my best efforts I can't keep doing football analytics and give the job search the attention it needs. So I'm putting PhD Football on hiatus while I figure out what I'm doing next. Once I get my work situation in order I'll get back to the blog – hopefully soon! (And if you happen to know of anyone hiring science PhDs I'd love to hear about it.)

First Down Probability

2014-03-03T22:29:00.000-06:00

Abstract
In this post I compute the First Down Probability metric, which predicts how likely a drive will produce at least one more first down for a given down and distance. I find similar overall first down conversion rates to prior studies in the literature, including that third down rushing plays are significantly underutilized. Unlike previous studies, however, I break down these rushing plays by the position of the ballcarrier, and find that a significant portion of this discrepancy comes from rushes by the quarterback, likely from scrambles on broken passing plays. More puzzling is the fact that QB runs on first and second downs don't show this trend, a result that is difficult to convincingly explain.

Introduction

During the course of a football game a fan gets a lot of statistical information. These numbers – QB rating, a running back's average yards per carry, time of possession, etc – generally lack any kind of contextual information about how the game is actually going. At best these statistics are incomplete (showing a WR's average yards per catch after a 99-yard completion, for instance); at worst, they're downright misleading (That QB just had 5 completions in a row...but they were all screens for minimal yardage).

A better statistic is one that takes the game situation into account. For instance, a 5-yard completion should count for more on third and 4 than on third and 16. There are several such statistics already in existence, such as Football Outsider's DVOA metric or Expected Points. These sorts of metrics generally depend on using historical play-by-play data to compute average outcomes for plays at any given down and distance. This approach is (unsurprisingly) more computationally complex, and often can appear opaque to the casual fan. Some of these stats, such as the DVOA, are intricate enough that their creators have decided to keep the full details of their computation private.

A direct and (relatively) simple context-sensitive statistic is Brian Burke's First Down Probability, which I will abbreviate as FDP. That link has more details, but the core insight of this metric is that the average odds of converting the next first down in a series can be estimated for any given down and distance. With this information in hand, it's possible to evaluate the result of a play based on whether it improves or harms the offense's chance of eventually getting a first down.

In this post I'm going to compute the FDP for the plays in the Armchair Analysis database. One may ask why I would recompute this quantity when Burke has already done quite a good job of it. One reason is to ensure the reproducibility of results – while I trust Burke's analysis, everyone makes mistakes. A more basic reason is that while Burke produces a nice visualization of his computed FDP he doesn't provide his data in a tabular form, which makes using his FDP values difficult (at best). I can also extend the FDP calculation to all four downs (Burke only considers second and third downs in his post). Finally, I can (spoiler alert) start using FDP to generate new insights about how teams approach different down-and-distance situations.

Data

As I mentioned before, I'm using the Armchair Analysis database, which covers the 2000-2011 NFL seasons. I grabbed the play-by-play data for all regular season and playoff games, then filtered out plays for several reasons. Plays inside the two minute warnings were discarded because teams play differently in those situations; I removed plays when the game wasn't close (defined as one team being up by more than 16 points) for the same reason. I cut out all punts and field goals as well as penalties (although I keep the results of the penalties in the data: if a team runs for -5 yards on second down but then is the beneficiary of a 15-yard roughing the passer call on third down, the second down play would be considered as ultimately resulting in a first down for the purposes of this analysis). Finally, to avoid biasing the data based on field position I only include plays between the offense's own 10-yard line and the redzone.

Ultimately this results in a dataset of 262,601 plays, split 56%-44% in favor of passes over runs. I bin these plays as a function of current down and yards to go, eliminating bins with fewer than 200 plays in my dataset. This cut ensures that there are no bins with conversion rates dominated by sampling error. The Python script I used to do this data querying and processing (as well as produce the plots in later sections) can be found here.

Results

Figure 1: FDP as a function of down and distance. The colors denote different downs, while the line styles break down success if the next play in the drive is a run or a pass. In some cases the data for the individual types of plays does not cover the same range of yards to gain. This is due to the minimum play cutoff detailed in the Data section.

Figure 1 shows the raw results, split by down and distance. For the benefit of anyone looking to check my results or to build on them I have also tabulated these results in text files, which can be obtained from my GitHub repository. Feel free to use them as long as you explain where you got them from (and a link back here would be nice as well!).

Anyway, the first thing to do is to check my results with what Burke obtained. It's a bit difficult to compare directly since I can only eyeball our plots, but in general my results seem to be fairly copacetic with his. The data between downs look fairly similar, with a ~15% shift each down as you go from first and N to second and N, increasing to ~20% from second to third down. There's not much data on fourth down, but I see no reason why it wouldn't resemble the other downs for conversion attempts beyond 2 yards.

More interesting is what happens when you break the conversion percentages down by type of play. Note that when comparing the FDP of runs versus passes at a given down and distance, a higher conversion rate for e.g. a pass doesn't necessarily mean you should always throw the ball in that situation; rather, it implies that currently NFL teams are not playing at the Nash equilibrium. This means that NFL teams should call more passing plays in that situation than they currently do; as defenses adjust to this new reality, there should be more opportunities for successful rushing plays, and eventually the FDP of both types of plays will equalize. Burke has some more detailed discussion of this in his breakdown of first down probability for runs and passes (although he restricts his analysis to third downs).

So again we are treading on old ground, and again it makes sense to compare results. Here we find a bit of a discrepancy, with Burke's rushing FDP on third and short ~5% lower than mine. It's not clear why this would be, although it might be due to the fact that Burke's data only goes through the 2007 season or how he considers sacks (the Armchair Analysis database considers sacks to just be really crappy passes). Regardless, things appear similar enough to proceed.

It's clear that teams aren't passing enough on first and second downs with more than 5 yards to go. Considering teams are already passing a lot in those situations, especially in second and 10+, this would imply that even the occasional rush in such circumstances is too much.

In short yardage, however, things are reversed. On second and 3 or less teams are running less often than they 'should', although the difference is only at about 7% or so. Third down is even more striking: whenever there are fewer than 9 yards to go the data indicate that teams should be running more. This is an even larger discrepancy than Burke finds, and is downright shocking given how unusual 7+ yard runs are under normal circumstances.

But there are two kinds of runs – designed runs and aborted passing plays. Burke considers the latter category to be rare enough to be inconsequential, but I wasn't so certain. So I modified my program to separate out rushes by the position of the ballcarrier – it can't tell if a QB rush was designed that way or if it was improvised, but it's better than nothing.

Figure 2: FDP, corrected for the influence of QB runs (the uncorrected rushing percentages are shown in gray to facilitate direct comparison).

Figure 2 shows the result, and it turns out that without the QB involved a third down rush becomes a much worse proposition. Indeed, now teams should only be running more on third and 3 or less, consistent with what the data show for second down.

While teams are generally doing better at finding the equilibrium between passes and rushes with RBs, these results indicate that teams are letting their signal-callers run the ball far too infrequently. If you look at the conversion rates just for QB scrambles it's generally 10% or more higher than a rush from a running back in the same situation! Even more interesting is that this offset only applies on third down. On first and second down a QB scramble appears to have similar conversion rates as a regular rush.

Discussion and Conclusions

First of all, the fact that QB rushes are so underused compared to other types of plays is quite interesting. Given the fact that teams generally do not want their prize passers taking hits down the field, most of these successful conversions are likely due to scrambles on passing attempts. But given how high the conversion rate is perhaps coaches should consider running a few more QB draw plays, especially with all the mobile passers entering the league.

But what's really weird is that QB's rushes aren't more successful than the regular variety on earlier downs. A possible explanation is that defenses are more keyed toward stopping shorter-yardage plays on second down, whereas on third down they sit back and follow the WRs down the field. But in that case you would expect third down rushes to be equally successful, regardless of the runner. I think it's more likely that on second down a QB under pressure isn't concerned with making the sticks, but rather simply looks to get out of trouble. On third down, however, the consequences of playing it safe are much more clear, which encourages passers to scramble for every last yard.

Of course, I'll be the first to admit that this is just speculation. A definitive analysis of this phenomenon would probably require deep analysis of individual quarterback scrambles, which is way beyond the scope of this work. But it is a cool result from a (relatively) simple metric, and illustrates how deep insights can be gleaned from just a little bit of intelligent digging.

What Positions Do Teams Value in the Draft?

2014-02-17T20:08:00.001-06:00

Abstract

Where players are taken in the NFL draft is based not only on their raw skill and potential, but is highly influenced by the perceived value of the position they play as well as the overall supply of players at that position. While quarterback is clearly the most important position on the team, investigating where players at other positions get drafted may provide insights into how NFL teams evaluate the relative importance of those positions. In this post I do a couple of simple analyses of where players get drafted, breaking the data down by position groupings and finding that there are slight variations in where players at different positions can expect to be taken, although more draft data and/or deeper analyses would be necessary to decisively show disparities between the positions.

Introduction

It's pretty clear that quarterback is the most important position on a football team, and their value is reflected in the draft – in the last decade 13 QBs have been taken in the top five picks. But exactly how much are QBs favored by GMs and coaches come May? And what about the perceived value of other NFL position groups?

Data

This one's fairly easy, as Armchair Analysis lists where each player was drafted as well as what position they play in the same table. I downsampled the data to include only players drafted since 2001 (up to 2011, the last year in the database), because the table contained only partial records before that year. You can find the script I used here.

There is also a limit to the granularity of the positions in the database. For instance, no distinction is made between any of the players on the offensive or defensive lines. This does limit how detailed I can make my analyses, and there may be significant difference in the valuations between positions on the O-line (for instance, a left tackle–protecting a passer's blind side–is likely to be more highly sought-after than a right tackle, although things are never that clear-cut).

Results

I first took a look at the raw data, plotting what fraction of picks go to each position grouping in Figure 1. To improve the signal-to-noise of the data (and just make things easier to visualize) I binned the data in groups of 10 picks.

Figure 1: Percentage of players drafted at each position, as a function of draft position. Kickers added for scale.

While colorful, the stacked nature of this figure makes it somewhat difficult to parse. Figure 2 shows where players of each position get drafted, independent of any other position groups in the sample. While there are still a bunch of overlapping lines, it's now easy to see if and where teams prefer to draft players at each position.

Figure 2: Where players in each position grouping get picked. Most of the positions have flat distributions. The only notable exception to this trend are QBs, which are highly peaked near the first few picks.

What stands out most is the large (and unsurprising) upturn in QB picks in the first bin. Otherwise, however, there don't appear to be any obvious trends. But the eye can deceive, so let's try to be just a bit more rigorous. Toward that end I computed the expectation value for each position. You can get more detail about the expectation value from Wikipedia, but in this case it's basically just the average place players at each position get drafted. You can't make a pretty graph with it, but you can see the results in Table 1.

Table 1: Draft position as a function of player position
Position	QB	RB	WR	TE	OL	DL	LB	DB	K
Expected Draft Position	102±8.1	126±5.1	117±4.4	131±5.8	122±4.0	114±3.9	118±4.2	119±3.2	164±9.1

Table 1 starts off by confirming what we saw visually in the figures – quarterbacks are by far the most sought-after position, with an expected draft position 12 picks higher than any other position (although the bootstrapped standard deviations are just consistent with defensive linemen being equally valued). Wide receivers are drafted slightly earlier than running backs, a trend that's been picking up steam in recent years as teams realize that RBs tend to have short careers and therefore don't provide as much value for an early pick. Linemen are the highest-drafted of all defensive position groupings, probably driven by teams' desire to press the point of attack – the general wisdom is that QBs have an advantage against the secondary due both to their skill and rules restricting contact on WRs (although the Seattle Seahawks would beg to differ) , and that generating pressure and sacks is seen as the best way to defend the pass. More data would be required to definitively prove that these differences are real, however.

Discussion and Conclusions

While it's unsurprising that QBs are the hottest position in the draft, it's nice to see it confirmed with numbers. More interesting are the expectation values for the other positions, which while much closer to each other could be used as signifiers of broad NFL trends in how talent is evaluated between positions. The analyses presented here are pretty simplistic, but they are indicative of the potential power of draft data.

Do Defenses Get Tired?

2014-02-03T20:42:00.001-06:00

One of the hallmarks of good science is reproducibility – the ability for other researchers to repeat (and thus verify or disprove) your work. While I hope I have laid out enough details in each of my posts for anyone interested to check my analyses, I am happy to report that I will now be uploading my code for each post to GitHub. Check it out!

Abstract

One of NFL announcers' favorite statistics is the time of possession, which is usually discussed in the context of how tired the defense must be when they've been on the field for a long time. But do defenses actually get fatigued over the course of the game? To answer that question I used the raw number of plays a defense is on the field (rather than the less accurate time of possession) and computed the probability that the offense will score as a function of this number. Ultimately, even after 70+ plays there is no increase in the offense's point production – a clear indication that defensive players have plenty of endurance to make it through even the longest games.

Introduction

A common statistic to see quoted during a game is time of possession (lazily referred to as ToP in the rest of this post). Usually referenced between quarters or near the end of the game, commentators generally talk about ToP in the context of noting how long one team's defense has spent on the field. (Offenses generally have more flexibility in keeping their players fresh through skill package substitutions.) The not-very-subtle implication is that the defense is getting worn down by the amount of time they've been playing and will therefore be more likely to allow points.

This is, of course, largely bullshit. Since so much more time is spent between running plays than actually ticks by while the football is in motion, ToP is really only a good indicator of how much standing around the teams are doing. Additionally, since the game clock stops for an incomplete pass (and pass-heavy offenses tend to pick up yards in chunks and have shorter drives as a result) ToP is naturally skewed towards favoring rushing offenses. If ToP was only collected during a play it might have some value, or better yet just strap some pedometers onto the players and figure out how much they're really running around on the field.

The idea at the core of ToP, however – that a defense spending more energy on the field may eventually show signs of fatigue and therefore allow more points – is not unreasonable. The current ToP statistic is just a terrible way of measuring it. This question is especially interesting because if defenses do get tired over the course of a game it would add more value to a strong rushing attack, a facet of the offense that has come under significant fire in recent years as being strictly inferior to the passing game.

While perhaps not quite as good as my earlier pedometer suggestion, the raw number of plays run should be a much better proxy than ToP for investigating whether defenses get tired. By comparing the results of drives as a function of the number of plays run will therefore indicate whether defenses ever become fatigued enough to affect play.

Data

I started with all the play-by-play data in the Armchair Analysis database, and computed the beginning and end of each drive as well as whether any points were scored. By separating this data out between the home and away teams for each game I constructed a running tally of the number of plays run by the offense at the start of each drive.

Before getting into the results it's important to note that for this analysis the devil really is in the details. The data can be biased in many ways, some subtle and some not. First and perhaps most obvious is that while all games can be expected to start in a similar way, a drive in the 4th quarter of a blowout is going to look much different than one in a close game. To avoid this problem I restricted my sample only to games where the final tally is within one score (8 points). I also throw out special teams plays, as I am most focused on how the defense plays as a unit (although note that on most teams at least some special teams players will see snaps on the offense or defense).

Another issue is penalties. Most infractions are only called after the play is over, and even though (if the penalty is accepted) the original play doesn't count for statistical purposes I still want to count it for this analysis. Some penalties, however, result in the refs immediately blowing the play dead (the most notable examples of this being false starts and encroachment). These penalties I strip out from the final play-counts. Occasionally a penalty occurs after the play is over (e.g. many unsportsmanlike conduct calls). A dead-ball foul should be purged from the data; unfortunately (as far as I can tell) there is no indication in the database whether a penalty is a dead-ball infraction or not, so I choose to leave all of these penalties in my sample. Fortunately these types of penalties are relatively infrequent, and therefore shouldn't significantly affect the results.

Lastly, drives near the end of halves create significant additional bias as well, since many of them are kneel-downs or result in unusual play-calling (Hail Mary passes, record-setting field goals, etc). I cut out the result of any drive that starts within the 2-minute warning of either half, although I include the plays run on those drives in the running totals of plays run during the game.

It is also worth noting that occasionally there are errors in the database, where the down sequence counter I use to determine the length of each drive is not reset between possessions. This issue is most obvious in the existence of some unusually long (20+ play) drives, although it likely affects shorter drives as well. Generally the incidence of these errors is very low (there are only ~10 of these very long drives in the entire sample, for instance), so I do not believe they will bias the results – especially not for shorter drives, where the sheer number of actual drives should drown out the few erroneous ones.

Results

Before diving into the full analysis, I think it's interesting to look at some raw numbers about NFL drives that aren't usually discussed. Take a look at the distribution of drive lengths in Figure 1, and the distribution of drives per game in Figure 2. The plurality of drives take 3 plays, which makes sense as these are 3-and-out possessions. The occurrence of drives longer than ~6 plays is fairly well described by a power law (a straight line on this log-normal plot) with a cutoff at 21 plays (The few plays above this threshold are likely all spurious results as mentioned above). Note that, assuming a team would punt on any 4th down, the maximum number of plays an NFL drive could take would be 30.

Figure 1: Distribution of drive lengths. After about 5 plays the frequency of drive length decreases quickly, and very few drives take more than ~15 plays.

While Figure 1 has home and away drives lumped together, I've left them separate in Figure 2 – it's pretty clear that there's no significant difference in the number of drives per game between the home team and the visitors. The distributions are well fit by a Gaussian distribution with an offset of almost exactly 10 drives and a standard deviation of a little less than two drives. This indicates that in a normal game a team will have less than 12 chances to score points – not a lot of opportunities! (It also implies that a team scoring 40+ points in a game is reaching the endzone on at least half of their possessions.)

Figure 2: Distribution of drives per game, for both home and away teams. There is very little difference between the home and away histograms. Solid lines show Gaussian fits to the data, which peak around 10 drives.

With the basics out of the way, now let's delve into the good stuff. I have the number of plays already run by the offense at the start of every drive lined up with the result of that drive. From there it's fairly straightforward to calculate the fraction of drives that end in scores as a function of the number of plays that have been run, which is shown in Figure 3.

Figure 3: Fraction of drives resulting in scores as a result of plays run. No trend is observed.

The errors on Figure 3 come from simple counting statistics, and the bin widths are adaptively chosen to have similar errors. If defenses really did fatigue as they spend more time running around on the field, the percentage of drives ending with points should increase as a function of the number of plays, but there is no evidence for this trend. If you look at touchdowns or field goals individually the picture remains the same – even if an offense runs 70+ plays the defense doesn't budge an inch.

Discussion and Conclusions

It's pretty obvious from Figure 3 that defenses don't get fatigued during games. On a given drive the offense has a ~35% chance of scoring regardless of how much the defense has been on the field. If you look at ho w rushing averages change over the course of a game you reach the same essential conclusion, which is a good indication that my results are indeed accurate. While on the surface it seems totally reasonable that players would wear down as the game wears on, given the fact that the number of plays a team runs per game is a well known quantity it makes sense that players would have enough conditioning to make it well beyond even the longest of games. (It would be interesting to repeat this analysis for overtime games but my sample size is far too small.)

So what are the implications of this result? Well, for one it means that announcers should stop talking about how long the defense has been on the field over the course of a game! More importantly, it means that there's one less reason for teams to rely on running the ball – if a coach feels that throwing deep every play best suits the talent on his offense, they should feel free to do so without consideration for their defense.

Playoff Fairness Through Win-Loss Records?

2014-01-20T21:32:00.002-06:00

I want to thank Kenny Rudinger for helping me check and correct some of my numbers on this analysis. Kenny is a physics graduate student at UW-Madison, a great Ultimate Frisbee teammate, and a Bills fan. His website is here.

Abstract

A subtle result of the (relatively) short football season is that teams have a very small sample size of games from which to determine how good they are. The fact that the outcomes of individual games can strongly depend on a few crucial plays adds an additional element of chance any given Sunday. Combine these features with a fairly inclusive playoff system and you have a recipe for upsets. In this post I'll talk about a theoretical model which attempts to produce a handicap, giving better teams a point boost to reduce this variance.

Introduction

Generally the ideas for this blog come while I'm watching NFL games; sometimes they come from discussions with friends. But recently I came across an article in Slate by Neil Paine that got me thinking. In case you're not a fan of following links (your loss, Slate is generally awesome), the basic premise is that the current playoff structure of the NFL, with only a minimal nod to regular season performance, doesn't properly reward teams who've played statistically better during the season. Paine's insane (his words, not mine) suggestion is to give the supposedly superior team a starting point advantage.

Putting aside whether such a handicapping system would be good for the game (as Paine himself argues it's probably not, given that it's fun to see unlikely outcomes and boring to watch your team start a game with a 20-point deficit), how would you go about computing how many points to spot the better team? Paine advocates an approach based on assuming that win-loss records and point differentials are governed by normal distributions.

If you're interested in the full calculation you can get all the details from the links in his article, but the gist is fairly simple. You start win the raw win-loss records for each team, then adjust them to account for the fact that 16 games isn't enough to properly sample a team's actual talent level. The next step is to convert the adjusted winning percentages into an estimate the expected likelihood for each team to win the game. Finally, this win expectancy can be converted into a point value, where broadly a higher likelihood translates into more points.

The result is a prediction of how many points the team with the better winning percentage would need to start off with in order to win with the same probability that they are truly the better team – for instance, a team with a 65% probability of being the better squad would be spotted enough points so that statistically they should have a 65% chance of winning the game.

Paine's method relies on several assumptions which, while not unreasonable, have not all been well-tested. Specifically, although the fact that winning percentages and point spreads are normally distributed has been fairly well tested, his analysis is also predicated on the premise that the relative skill of the two teams have no bearing on the outcome of any individual game. That's quite a claim, and one that I think is worth testing.

Data

Before getting into the real meat of the data, I first wanted to test another assumption Paine makes, which is that the variance of team win-loss records is equivalent to that from repeated coin flips. Since NFL scheduling is definitely not random I was skeptical about this assumption, but a quick Monte Carlo analysis of randomly generated standings indicated that the details of how match-ups are determined is not strongly biasing these statistics.

I grabbed game records from 2002-2011 from my copy of the Armchair Analysis database, then computed running winning percentages for all teams over the course of each season. Using Paine's methodology I computed the pregame win expectancy for the team with the better record. I then converted these percentages into points, using the same values for home-field advantage and scoring variance as Paine does.

There's no point in having data from the first week of any season, so I remove all of those games from my data set. (I actually ignore games in the first three weeks to minimize statistical fluctuations.) To remove the possibility of bias from teams which have already qualified for the playoffs but are resting their starters, I also do not use data from weeks 16 or 17. I do however include playoff games in my analysis.

Now that I have the win expectancy and how many extra points that translates into, I can compare how often teams win to how often they should (assuming we've accurately measured a team's true skill, of course). To help preserve signal in the data I binned the games up in win expectancy to have a roughly equal number of games in each bin.

Figure 1: Predicted relationship between win expectancy and winning percentage if the two quantities are perfectly correlated. The relationship is not exactly linear due to the nonuniform size of the bins.

First, to provide a point of comparison, in Figure 1 I plot the winning percentage for each bin provided that the win expectancy directly correlates to win percentage. These points are what we should be shooting for with our handicapping model, as they represent the pure translation between win expectancy and actual winning percentage.

Figure 2: Same as Figure 1, but now with the raw winning percentages of the games in our sample shown as the blue histogram. The errors come from counting statistics.

Next, let's look solely at how raw winning percentage trends with win expectancy, which is shown in Figure 2. You can see that while there is a positive correlation between the win expectancy of the statistically better team and their likelihood of actually winning, it's not enough to produce a one-to-one relationship. Finally, in Figure 3 I've added the winning percentage that would result if the "better" team was spotted the number of points given by the model.

Figure 3: Same as Figure 1, but now showing what would happen if teams were given the handicap suggested by Paine as the red histogram.

Discussion and Conclusions

It's pretty clear from Figure 3 that Paine's model, while better than nothing at producing game results in line with the pregame win expectancy, will actually give the team with the better win-loss record a slight additional advantage. Considering the data show that individual games aren't pure coin-flips, an aforementioned assumption of the model, this result is unsurprising.

That's not to say that Paine's scheme has some sort of fundamental problem. It's clear that individual games are not fully representative of which team is better overall, and there's nothing wrong with the idea of using statistical game results to construct a correction (again, ignoring the issue of whether or not to actually apply such a correction). Games aren't truly random events (another example of reality getting in the way of beautiful, pure statistics), and so any putative correction must take this into account.

QB's Don't Have (Passing) Rhythm

2013-10-14T17:06:00.000-05:00

While I have generally been able to maintain a semi-weekly update schedule, the next few months are going to be quite busy (that thesis isn't going to write itself). I will try to keep updating regularly, but I'm not making any promises. Feel free to check the blog regularly to see if there are updates, but if you'd rather use a more efficient method to see when I update put your email address into the box on the right (under 'Get Email Updates') and you'll automagically get an email each time I make a new post. If you're more into twitter, following me @PhDfootball will also let you know when I post (and has the added bonus of giving you direct access to all of my delightfully informative thoughts and comments). Regular updates will hopefully resume after the new year, when the title of this blog will be even more accurate!

Abstract

'Rhythm' is an often-used buzzword in football circles, especially pertaining to a quarterback who is known for being inconsistent. To take a quantitative look at this concept I break down each pass as a function of the one thrown before it, looking for evidence that completing a pass can jump-start a passer into completing more. While this analysis is admittedly superficial, it's a good starting point to tackling this subject. Ultimately, there is no evidence that one pass completion begets another, an argument against the idea that QBs can get into a rhythm.

Introduction

Last season the Jets tried to run an offense involving two quarterbacks, with Mark Sanchez running the regular offense and Tim Tebow coming in to run wildcat-style plays. This was an unarguable failure.

A common reason given by announcers and sportswriters for this unconventional scheme's lack of success was that it never allowed one quarterback to "get into rhythm." That certainly seemed true enough; several times Sanchez would complete a couple of nice passes, then Tebow would come in and run for a few yards, then the drive would stall out once Sanchez came back in.

This is, of course, the same reason given for the failure of Tom Landry's plan to let Craig Morton and Roger Staubach alternate snaps for an entire game (a loss) during the 1971 season.

As usual, there's never any attempt by the announcers to explain what 'rhythm' is or how to tell if a quarterback is in it; this wishy-washy term is generally used as a catch-all to explain why a signal-caller is (or isn't) playing well.

But maybe there is some truth to the idea. There is plenty of anecdotal (and some scientific!) evidence for players getting into 'the zone' during a game, which certainly sounds similar to the concept of 'rhythm'. And football commentators have been using the term for as long as I can remember without any pushback or criticism.

Let's take a look at the concept of QB rhythm (I'll drop the pretentious quoting from this point onward), first attempting to define it in a quantitative manner and then looking at data to determine its validity.

Data

For this experiment I need play-by-play data, which (as usual) comes from Armchair Analysis. Next we need to quantify what statistics could be employed to quantify how much a QB is in rhythm. What would be an observable of a quarterback in rhythm?

The obvious choice is completions. Generally a QB who is in rhythm should be completing several passes in a row, while you would expect a passer who is out of rhythm to be very scattershot. It's difficult to look at completion streaks, as drives can be of variable lengths and we could accidentally bias ourselves towards looking only at very good quarterbacks, who are more likely to have long completion streaks in the first place.

Therefore we'll look at the effect a completion has on just the next pass. While not perfect, this will at least minimize the risk of bias. Additionally, to avoid including situations where one team is being blown out and throwing every play, we'll only include data from the first three quarters of games.

Results

First of all, over the entire sample the completion percentage is a healthy but unspectacular 56.8%. If a quarterback can get into rhythm by completing passes, we'd expect the overall completion percentage on passes attempted after a completed pass to be higher than this overall figure.

Interestingly, it turns out that the opposite is true. If you only look at plays directly after a pass, NFL QBs have a completion percentage of 56.2%. If you loosen your restrictions and check the completion rate specifically for the next pass (even if there may have been several runs in between), the completion percentage is 56.3%.

Now, it might be that our data are somewhat biased to lower completion percentages because we have to throw out the first completion of each drive. Therefore it might be that we should expect a slightly lower completion percentage than the total 56.8% figure.

To check this possibility I did 1000 random resamplings of the data, keeping the drive data constant but shuffling the type of play (and the result). For both scenarios this test produced completion percentages 56.8+/-0.2%, exactly the same as the overall completion percentage. So if anything, completing their previous pass seems to make quarterbacks more likely to misfire on their next.

Discussion and Conclusions

So what gives? While I'll be the first to admit that this analysis is by no means perfect, it seems pretty clear that this line of inquiry doesn't show any evidence for getting into rhythms. At the very least we can now say that just because a QB has completed a couple passes in a row he's not about to keep up the trend.

One important caveat, especially for the Tebow-Sanchez and Morton-Staubach situations, is that this analysis covers drives where, the vast majority of the time, the QB stayed on the field for every play. Even for wildcat plays the quarterback usually lines up at wide receiver rather than going to the bench - in this way the surprise of the playcall is preserved until the offense breaks their huddle.

With the data currently at my disposal, I can't distinguish between plays where the QB is on the field and those where he is not. Even with that information, there are so few instances where the QB does leave the field during a drive that finding any signal amongst the noise would likely be impossible.

Despite these (very reasonable) concerns, the case against QB rhythm seems fairly strong. While I could believe that quarterbacks get into zones over the course of a season, it doesn't appear to happen on a drive-by-drive basis.

Not All Fumbles Are Created Equal

2013-09-30T07:46:00.001-05:00

Abstract

A fumble can be a key play in a football game, where just a single turnover can be the difference between a win and a loss. Recovering a fumble is therefore a hugely critical act. While the recovery itself is at least a mostly random event, the location of the fumble can significantly alter the odds that the defense will recover it. Fumbles behind the line of scrimmage are more likely to be recovered by the offense, while fumbles after a successful rush or pass are more likely to get scooped up by the defense.

Introduction

Nothing in football can change the momentum of a game faster than a turnover. A positive turnover differential is highly correlated with winning, so it's no wonder that teams are constantly talking about making fewer of them. While interceptions are generally directly caused by poor decision-making by the quarterback, the apparent random nature of fumbles makes them so much more exciting (and vexing, when your team is the one doing the fumbling).

Of course, fumbles aren't really random. Usually a player doesn't just accidentally drop the football, and defensive players are taught to hold offensive players up while their teammates attack the ball. However, the act of recovering a fumble is generally considered to be a random event, one that is entirely based on luck. (I'm not quite as convinced of this assertion as the sites I just linked; I've seen too many players try to pick the ball up when the should have fallen on it, or fall on it only to have the ball squirt away. But testing this is not the focus of this post so I'll leave it be for now.)

It's important to recognize that this does not mean that all fumbles have the same probability of being recovered by a certain team—you wouldn't want to use fumble recoveries as a random number generator, for instance. The more players on the defense near the fumble, the more likely one will make the recovery. Conversely, if only the fumbling player is aware that he's fumbled (such as on the quarterback-running back exchange), the offense will be more likely to recover. By this logic, a team's chance of recovering a fumble should be strongly dependent on where the fumble occurs relative to the line of scrimmage.

Data

Data come from the Armchair Analysis database, which I queried for all plays which resulted in fumbles, as well as all subsequent plays (to determine whether the fumbling team maintained possession). To avoid potential errors in this method of determining the recovering team, I excluded fumbles occurring on fourth down. To avoid biases from teams altering their strategy at the end of a half, I only used data from the first and third quarters. As usual all errors are bootstrapped.

Results

First I selected only fumbles made by offensive players—specifically QBs, RBs, and WRs (I lumped TEs in with the wide receivers). From here, computing the fraction of fumbles recovered by the defense is relatively simple, and it turns out that overall the defense recovers 54.80±1.002% of all offensive fumbles—slightly (but statistically significantly) more than half. This is not hugely surprising, given that the defense is much more focused on whoever has the ball than offensive players are.

Figure 1: Breakdown of fumble recovery probability as a function of position relative to the line of scrimmage. The horizontal red bar shows the overall defensive fumble recovery rate, while the bins are shaded proportionally to which offensive positions are responsible for the fumbles.

Figure 1 shows the defense's fumble recovery rate as a function of field position, split up into bins with roughly even numbers of fumbles per bin to maintain a constant signal-to-noise ratio. The histogram has also been split up into positions, showing who is responsible for the lost fumbles.

The most striking feature in Figure 1 is the clear dichotomy between fumbles that occur just behind the line of scrimmage and the ones that occur after positive yards have been gained. This makes intuitive sense: most of the fumbles behind the line of scrimmage are likely occurring in the center-quarterback or quarterback-running back exchange, which happen before the defense has had a chance to get into the backfield. (The uptick in defensive recoveries more than ~10 yards behind the line of scrimmage is almost certainly due to strip-sacks.) Once the offense gets beyond the line of scrimmage, however, most fumbles are going to be directly caused by the defense, in a region of the field where defensive players greatly outnumber the offense.

Interestingly, a fumble on a very successful play, one that gains more than 20 yards, isn't more likely to be recovered by the defense than the average play. I'm not sure exactly why that is, but it may be that a larger proportion of long plays end up near a sideline, and therefore any fumbles have a higher likelihood of going out of bounds. Since in this analysis a fumble out of bounds counts as an offensive recovery, it could be artificially depressing the defensive recovery rate.

Discussion and Conclusions

It's clear that the location of a fumble is of significant importance, as there is a ~20% swing in a defense's chance of recovery with just a few yards' change in position. A quarterback that drops the snap from the center will generally only be responsible for a wasted down, but a receiver who catches a 5-yard quick slant and can't hold on is likely to be the direct cause of a turnover. It's no wonder that running backs who fumble rarely last very long in the NFL; most of their runs will end up right in the range where the D is most likely to come up with a recovery.

Penalties II: Crowd Noise

2013-09-16T17:05:00.000-05:00

Abstract

Crowd noise is generally considered to be a contributing factor in causing false-start penalties on visiting teams. Some stadiums are well-known for focusing deafening amounts of noise on the field, usually near the endzones. To determine if crowd noise really does cause false-starts, I compared the discrepancy between false-starts called on the home and away team as a function of distance from an endzone. While distance from midfield and likelihood of visiting team false-start penalties is correlated, it is not strongly significant.

Introduction

In Part I of my investigation into penalties, I found that there was a statistically significant discrepancy in the number of false-start penalties between the home and away teams. At the time I chalked this result up to crowd noise and focused my analysis on other types of penalties, but I later realized that, while frequently quoted as fact, I've never seen any hard evidence on the subject.

If crowd noise does affect a visiting offense's snap count, it's logical to expect that the effect will be largest when fans surround the field on three sides—near the endzones. This provides a way to isolate crowd noise from other variables surrounding false-starts, e.g. the possibility that traveling to away games makes teams more prone to jumping before the snap, or that referees are somehow biased even for such apparently cut-and-dried calls.

Data

As usual the data come from Armch air Analysis. For this project I created a new table containing information on the field location of all plays as well as whether a penalty was called—because of the huge number of plays in the database this query took ~8 hours and therefore was not feasible to do on the fly during the analysis.

Results

The percentage of plays that result in false-start penalties as a function of field position is shown in Figure 1. I'm not sure what's going on when the offense is backed up by their own endzone; I've checked the data and didn't find anything out of the ordinary. It's possible this is due to the relatively small number of samples that close to the goal line—regardless of the cause, it doesn't appear to affect the rest of the data, so I will simply exclude this bin from the remainder of the analysis.

Figure 1: Percentage of plays which result in false-starts as a function of distance from the offense's goal line. Black points are for the home team, while red points show penalties committed by the away team. X error bars show the range of yards included in each bin, while y error bars are the bootstrapped uncertainties.

At just about every point on the field the away team commits more false-starts, which is unsurprising given what we already knew. If you squint just right, however, it does appear the away team gains 'penalty parity' around midfield.

As always, trusting your eyes is a poor way to do statistics. Therefore I ran a correlation analysis, folding the data at the 50 to test raw distance away from the nearest endzone rather than position on the field. The result is a fairly weak correlation (Spearman ρ of 0.28) that is far from statistically significant (p-value of 0.24).

Discussion and Conclusions

Our eyes do lie, apparently, at least about the statistical significance of the correlation. It's difficult to figure out what to make of this (non) result—the original hypothesis certainly seemed reasonable, and I don't believe that crowd noise has zero effect on false-starts. You can certainly make a strong argument for a correlation based on watching some of these false starts happen.

So why no significant correlation? Well, it certainly is possible that crowd noise actually isn't playing a huge role after all, although I don't have another explanation for why referees would be more likely to call a false-start on the visiting team. It's also possible that my original assumption, that crowd noise is amplified near endzones, is incorrect. Another option is to look at Figure 1 and note that the correlation appears to be more significant in the offense's own end of the field: maybe fans are more vocal when the away team is starting a drive, but as the offense moves down the field they get quieter.

Ultimately, the only way to know for sure just how crowd noise affects the game would be to attach sound meters to the players. The NFL may already do this, for all I know; they already mic the players and coaches, so at the very least crowd noise information should be (in principle) recoverable from these recordings. If I ever get my hands on this sort of data I will certainly try this sort of analysis again, but for now the stats I have on hand are are just too rough to make any definitive conclusions.

Quarterback Rating II: Let the Rookie Sit

2013-09-02T07:38:00.000-05:00

Abstract

Many franchises use high draft picks on quarterbacks, rightly understanding their importance to a well functioning team. There is enormous pressure to start these players right away, but is that a good idea? Based on a player's peak quarterback rating, the answer appears to be no. Whether this is due to the pressure of the job breaking fragile young QBs or because the teams most likely to start a rookie passer are also most likely to have other problems is unclear, but either way indicates that teams should be wary about throwing young quarterbacks right into the fire.

Introduction

As mentioned in my last post, the quarterback is the focal point of every NFL offense; all offensive plays run through his hands. Unsurprisingly, teams place heavy emphasis on the selection and training of talent at the position. Promising (and even some not-so-promising) quarterbacks are hot commodities—since 1990 nearly 60% of first overall picks in the NFL draft have been QBs.

The pressure on these passers is intense, especially from teams which expect immediate production from their rookie signal-caller. Even teams who intend to let the new QB learn from the bench for his first year frequently find their plans changed by injuries or pressure from fans.

But is it good for young quarterbacks to get rushed into starting roles like this? There are certainly QBs that find success after starting as rookies - Peyton Manning comes to mind, and Russell Wilson and Robert Griffin III certainly seem to be in good shape. But for each success story there are plenty of high-profile failures.

Certainly for at least some of those quarterbacks there are other good reasons for why they didn't live up to expectations, but considering how important the position is the ratio of successes to failures seems frighteningly low. Of course, on this blog gut feeling isn't good enough; let's see what we can prove.

Data

Data once again comes from Armchair Analysis using the same queries as in the last post to compute seasonal QB ratings for every passer in the database.

It is important to note that while a truly heroic feat of data collection and organization, the Armchair Analysis database is not perfect. While inconsistencies appear to be minor (at worst) for most of the statistics, there seem to be somewhat larger issues with data on when players were drafted.

For instance, Carson Palmer is listed as being a rookie in 2004, but was actually drafted in 2003. The database also doesn't handle players taken in the supplemental draft very well. Overall, however, the data quality seems to be very good, and I'm confident that the results are not significantly biased by any typographical mistakes.

Results

In order to produce as unbiased a sample as possible, I restricted the investigation to quarterbacks who have at least four seasons, including their rookie season, in the database. Additionally, the quarterback must have thrown more than 150 passes in at least one of those seasons.

Determining a reliable measure for quarterback skill is a non-trivial task; ESPN's Total Quarterback Rating involves "several thousand lines of code" and the website I just linked implies that advanced computational techniques, such as machine vision, are involved. I (sadly) don't have the amount of time necessary to do something this complex, so I'll be sticking to the regular old-fashioned QB rating.

An additional roadblock comes from grading quarterbacks over their careers. As it turns out, a QB's passer rating in one season is a surprisingly poor predictor of their rating in the following season.

While it's clear that QB rating is an imperfect measure of a passer's skill, it still works reasonably well as an overall gauge of competence at the position. To try to avoid the year-to-year issues I'll only look at a quarterback's peak QB rating—their absolute best season.

Alright, that was a lot of explanation, now let's get to the good stuff. Figure 1 shows a histogram of the maximum QB ratings of the entire sample (in gray). The majority of signal-callers in the sample have peak QB ratings between ~75 and ~90, with an average peak QB rating of 83. But there's a fair amount of variance in the sample, from 55 (Mike McMahon, who managed to start seven games for the ill-fated 2005 Eagles) to 118 (2011 Aaron Rodgers, who actually got a rating of 123 in the regular season but this data includes his poor performance in the Packers' playoff loss).

Figure 1: Peak QB ratings of the sample.

I've broken this sample down in two ways. First, I've selected all quarterbacks who threw 150+ passes in their rookie year (purple histogram). I next select QBs who have met the passing criterion for at least 4 seasons (gold histogram). While not perfect, passers who stick around in the NFL for several seasons are going the be the best and most reliable quarterbacks, so length-of-tenure is a good proxy for the most skilled playeres.

The results are notable—the 4+ year starters have a uniformly higher peak QB rating than the group who saw significant action as rookies. In fact, the veteran QBs are responsible for all seasons with a QB rating above ~90, while no passer is given 4+ years in the league as a starter without at least one rating above 70.

Not only are these two distributions notably different, they're significantly dissimilar. The standard test to see if two samples of data come from the same underlying distribution is the Kolmogorov-Smirnov test (usually abbreviated as the KS test). This test says that the two sub-samples are distinct with only an 8% chance of error. However, we know that the rookie starter and long-tenured QB distributions actually do come from the same parent distribution. In addition, there is clearly overlap between the two sub-distributions.

The net result of these facts is that there is an even smaller chance of error than the KS test would indicate. A full Monte Carlo simulation indicates that there is actually a 99.3% certainty that these two distributions are distinct.

Discussion and Conclusions

This result, in its barest form, means that the conditions that result in a quarterback starting as a rookie are different from those that lead to success in the NFL. Note that it does not necessarily mean that a rookie quarterback will not have a long, productive career; there is some overlap between the two sub-distributions. Nor does it mean that the solution to the problem is to make sure all rookie QBs stay far away from the field; the teams that are in a position to redshirt their first-year passers are also the ones most likely to already be in better position to protect their investments when they do make it to the field—for instance the Packers with Aaron Rodgers or the Patriots with Ryan Mallett.

But despite these caveats, these findings are still quite interesting, and indicate that quarterbacks who see significant rookie action tend to have lower ceilings than signal-callers who are rested at the start of their careers. This result indicates that teams who are looking to use a high draft pick on a (hopefully) franchise quarterback should resist the urge to play him right away, and maybe consider upgrading their other positions of need first before drafting their QB of the future. So the next time your favorite team passes on a flashy gun-slinger in order to draft a boring left tackle, don't judge them too harshly.

Quarterback Rating I: Year-to-Year Progression

2013-08-19T07:45:00.000-05:00

Abstract

Using quarterback ratings I've charted out a QB's average improvement from his first season as the starter. On average a QB sees only a minor ~10-point rating boost in his second year, with his rating remaining flat (or lower) for the rest of his career. Additionally, very few (~20%) players will ever have a season with a QB rating more than 20 points higher than their first year. These results indicate that a quarterback's first season is a reliable indicator of their future success, and that passers who struggle in the early stages of their career are unlikely to show significant long-term improvement.

Introduction

As the guy responsible for handling the ball on every single offensive play, the quarterback is unambiguously the most important player on a team. So when a team drafts a new quarterback the pressure is extremely high - both on the player to perform to expectations and on the management to ensure they're getting a good return on their (significant!) investment.

In recent years QBs have been asked to step in and start as rookies with increasing frequency. Last year saw a record 5 rookie signal-callers taking the majority of their team's snaps. While this year's draft appears to have a definite lack of QBs ready to start immediately, it's a virtual certainty that a few desperate teams will roll the dice on their shiny new gunslingers.

With the importance of the quarterback position and proliferation of young, untested starters, it's critical for teams to accurately evaluate QBs, not only as college prospects but even while they're playing in the NFL. While there is clearly worth in exploring how quarterback talent is evaulated for the NFL draft, the sheer number of college teams, and the limited opportunities given for the best players to play against each other, make it very difficult to perform such an analysis without more advanced tools.

Fortunately, charting the progression of quarterbacks once they enter the NFL is also interesting, and somewhat easier due to the small number of teams and the high level of competition. A good manager is always watching how their players progress, and it's highly relevant to know whether a struggling QB is merely inexperienced or a hopeless cause. There are several ways to dig into this topic: for now I'll focus on computing the average year-to-year progression of NFL quarterbacks as a general barometer for how a QB should be expected to develop.

Data

As usual the data come from the Armchair Analysis database. I first queried the database for the identifiying information for all QBs, then fed that into a query which returned all game stats for each QB. From there season totals were computed.

Finally, the seasonal QB rating for each quarterback was determined. Because the QB rating can be highly biased if a passer only has a small number of attempts in a given year, I only took ratings from seasons in which the QB threw at least 150 passes.

Results

A (relatively) simple way to track a signal-caller's improvement over time is to compare their QB rating from a given season to earlier seasons. An aggregate plot comparing a passer's QB rating from later seasons to their first 'full' season (full being a season where the QB attempted at least 150 passes) is shown in Figure 1. The data are shown as black points, while the averages (and standard errors) are shown in red.

Figure 1: QB rating improvement from first season as a function of years in the league. Red points show average improvements.

While there is significant scatter, it is clear that on average a QB only shows improvement between their first and second full seasons. After that, performance stabilizes until the 7th season or so, where it begins to decrease (although the data appears to show that the few QBs who make it to their 10th season are able to maintain their improved performance).

This performance boost, at only 5-10 rating points, is moderate at best, and indicates that a quarterback's first full season is a strong indicator of their future success. Of course, this is only an average and as such somewhat of an abstraction - clearly not all QBs will follow exactly this trend.

To gain more insight into the maximum potential improvement over a quarterback's career I've plotted a histogram of peak QB rating improvement (or minimum reduction, the sad reality for some passers) in Figure 2. It's clear from this figure that the majority of signal-callers never progress beyond a 20-point improvement in QB rating, even during their best seasons, with only 20% of all passers in the sample beating this threshold¹.

Figure 2: Histogram of peak QB performance compared to a player's first starting season.

Discussion and Conclusions

Even at their very best, this analysis shows that most quarterbacks shouldn't be expected to show dramatic improvement at any point during their careers, and only moderate improvement from their initial starting season. This analysis indicates that even a rookie QB's ceiling can be estimated with reasonable certainty, and has clear ramifications for evaluating quarterbacks. For instance, this is bad news for Andrew Luck (first year QB rating of 76.5), Ryan Tannehill (76.1), Jake Locker (74.0), and Brandon Weeden (72.6), who are all unlikely to ever see a triple-digit rating but are tabbed as the starters heading into 2013.

These results also lend credence to the arguments of impatient fans, who expect to see immediate results from new QBs and have no patience for any 'adjustment period', 'learning curve', or any other excuse offered by a team for a young passer's poor play. I had always assumed these fans were merely short-sighted, unwilling to wait and see how a player would develop. But now it's much more difficult to dismiss their concerns so easily.

1: The two players in the sample with a 40+ point QB rating improvement? Alex Smith and Eli Manning.

Penalties I: Referee Bias

2013-08-05T20:44:00.000-05:00

Abstract

In addition to making the occasional blown call, multiple sources have noted that referees appear to have a subtle, pervasive, likely subconscious, home-team bias. Here I attempt to quantify that bias, using different categories of penalties to highlight any discrepancy between penalties that require no interpretation (and should not be subject to this sort of bias) and penalties that involve the judgement of the referees (and therefore would be prone to bias). I find that there is a small but statistically significant discrepancy between judgement-call penalties on the home and away teams, with the visitors getting flagged an average of ~0.1 more times per game. What is most striking about this result is not its statistical significance but how small it is, a testament to the (often overlooked) fact that NFL referees are generally quite good at their jobs.

Introduction

If you watch football for long enough, eventually you'll see a play that makes you uncontrollably angry—Specifically, angry at the refs. How could they have blown that call so badly? Were they even watching the play?

This outrage, however, usually fades fairly fast—you have some reluctant understanding that what's obvious to you from the super-slo-mo replay is not as crystal clear when seen at full speed, and most individual calls/non-calls have a small impact on the final score. (Of course, there are some notable exceptions).

Individual plays such as these are so infrequent that they are not well-suited to statistical analysis. However, it is also possible that referees can be biased by the location of the game, either because the refs are from the area or are subconsciously influenced by the cheering home crowd. The NFL mitigates the former issue by rotating crews between stadiums, but what about the latter?

Unfortunately, while some work has already been done on this very issue, actual numbers on any bias appear to be thin on the internet ground. General assertions from non-open-access sources¹ abound, as do people using studies of soccer(!) officiating to back up their claims about the NFL. I did run across an interesting article that attempted to quantify home/away bias in individual officiating crews, but it unfortunately suffers from a small (13 weeks) sample size and a lack of errors — is calling an average of 1.5 extra penalties on the away team a significant effect or have they just shown how noisy their data is? (The fact that the sum of each crew's 'bias' is close to zero is circumstantial evidence for the latter case).

Data

Once again my data come from the thorough folks at Armchair Analysis. In addition to providing data on individual penalties, they also aggregate the calls into one of several helpful categories. Using their categories as jumping off points, I lumped almost all penalties in the entire data set into one of four categories:

Judgement: Penalties like holding, pass interference, and illegal use of hands, for both offense and defense.
Timing: False starts, offsides, encroachment, and neutral zone infractions.
Positioning: All kinds of illegal blocking penalties (e.g. blocks in the back, crackback blocks, tripping).
Dumb: Taunting, roughing the passer, giving him the business, etc.

Results

I split up the penalty data into home and away bins, then computed the average number of penalties per game in each category. To get a sense of the uncertainties, I bootstrapped the data. These averages are shown in Figure 1.

Figure 1:Average penalties per game in each of the four categories discussed in the Data Section.

For both penalties relating to positioning (the illegal blocks) and dumb penalties there is stastically zero referee bias; both the home and away teams get flagged at the same rate (within the errors). This is not surprising, as these calls are fairly cut-and-dried, with little room for interpretation. Also not surprising is that the away team suffers more timing penalties (~0.2 more per game) — despite also being generally black and white, things like false starts and offsides are the fouls most likely to be affected by crowd noise.

For judgement call penalties like holding or pass interference, however, there is a small but statistically significant excess of penalties for the away team, with the visitor receiving an average of 2.70±0.03 penalties while the home team only gets called 2.59±0.03 times per game. These fouls should not be significantly affected by crowd noise, and thus indicate that referees do indeed hold a slight bias in favor of the home team.

Discussion and Conclusions

So it seems that NFL refs are indeed biased. But honestly, one tenth of a penalty per game is a pretty small bias. Since teams only play 8 away games during the regular season, this is less than one extra penalty, and since each team also plays 8 home games over time things should average out. Even in the playoffs, where a #6 seed would have to play 3 away games to make it to the Super Bowl, this bias shouldn't play a large role. The real story here is how fair NFL officials are, even when calling fouls in front of 80,000 rabid, screaming, angry fans.

1: In an interview with Wired, one of this book's authors cites this sort of referee bias as the reason why the Seahawks lost Super Bowl XL. I find this frightening, as anyone who writes an entire book about statistics should know that you can't apply statistical trends to individual events. I assume(hope?) that he was just speaking off the cuff and was therefore not very thorough with his answer.

—Shout out to Sonographer's Cup winner Andrew "Lulu" Schaffrinna, without whom this post (and indeed, any future studies of penalties) would almost certainly never have happened.

Are Underdogs Winning the Super Bowl More Often than they Should?

2013-07-22T20:43:00.001-05:00

Introduction

In the 2004 Super Bowl the first-seeded New England Patriots beat the third-seeded Carolina Panthers by three points to win their second NFL championship. 2010 featured the top-ranked New Orleans Saints earning their franchise's first Super Bowl win.

In the 10 seasons since the NFL's last re-alignment (before the 2002 season) these are the only two times a #1 seed has won the big game. It seems pretty odd that the top seeds, teams which only have to win two home games to make it to the big game, are only batting .200.

There's obviously a lot of potential reasons for this discrepancy. One that tends to get mentioned frequently is the first-round bye given to the top two seeds. The logic goes that the week off, rather than helping a team rest up and prepare for the Divisional round, somehow hurts them, possibly by disrupting the natural rhythm of the week.
I've showed that on average the home team wins about 57% of the time during meaningful games of the regular season. If the bye week is the cause of this Super Bowl drought then it seems reasonable that we should find that the first and second seeds are winning their home playoff games at a lower frequency than expected.

Data

A list of the seeding of the teams in the last 10 Super Bowls is all that's necessary for this experiment, so I simply made the list by hand from Wikipedia, which has fairly comprehensive coverage of each year's playoffs.

Wikipedia also has a comprehensive page on the Monte Carlo method, but in short it works by repeatedly generating random realizations of the problem at hand and comparing the results of the randomized trials to the real data. Given enough runs, the Monte Carlo method should converge to a stable result, allowing us to see if the assumptions that went into the Monte Carlo simulation are valid statistical representations of reality.

The Monte Carlo algorithm was set up to predict the expected number of Super Bowl appearances for each seed, under the assumptions that home field advantage was a flat 57% and that the different rankings of the teams had no bearing on game outcomes. No additional advantage from the bye week was programmed into the model.

Results

The number of Super Bowl appearances for each seed (AFC and NFC seeds combined) is shown in Table 1. Note that even a 7% home-field advantage results in an additional ~1.5 Super Bowl appearances per decade for the #1 seeds than if there was no home field advantage (with no home field advantage both the #1 and 2 seeds each make it to 5 out of 10 Super Bowls, as would be expected).

Table 1: Playoff Model Predictions
Seed	Predicted # of Appearances	Actual # of Appearances	Predicted # of Wins	Actual # of Wins
1	6.5+/-2.09	8	3.2+/-1.48	2
2	5.6+/-2.00	5	2.8+/-1.42	3
3	2.5+/-1.48	2	1.2+/-1.01	1
4	2.2+/-1.39	3	1.1+/-1.01	2
5	1.9+/-1.29	1	0.9+/-0.91	1
6	1.7+/-1.24	1	0.8+/-0.88	1

The standard deviations of all the results are listed next to each predicted value; the relatively small sample of Super Bowls results in fairly large margins of error in the simulation.

Regardless, it's pretty clear that the extra bye week isn't hampering the first two seeds from getting to the championship. The #2 seed has made almost as many appearances as predicted, while the #1 seed is, if anything, reaching the Super Bowl more often than they should be.

Because I was interested, I also computed the number of times each seed wins the Super Bowl. For this calculation I made the additional assumption that there is no home field advantage in the Super Bowl, which seemed reasonable given that the game is held on neutral ground. Those results are also presented in Table 1.

Discussion and Conclusions

The errors are fairly large, but the overall match between the model and the data indicates that there is neither an extra advantage or disadvantage to having the bye (although there is tantalizing — but not quite significant —evidence that #1 seeds aren't winning as many Super Bowls as they should). Without a larger sample size, however, any firm conclusions would be premature.

Unfortunately, when it comes to the Super Bowl you only get one new data point a year, so it's going to be quite awhile before the signal may stand out from the noise. One interesting note to mull over while waiting for more data: in the first five of the post-realignment playoffs, a #1 seed reached the Super Bowl all five years. Since then, only three top seeds have made it to the big game, while the last three Super Bowl winners were all ranked 4th or lower.

Home Field Advantage II: The Cold Weather Edge

2013-07-08T20:59:00.000-05:00

Abstract

To investigate the effect that weather has on home field advantage, I've compared the average temperature difference between home and visiting teams over more than a decade's worth of games. I find that when the temperature differential is larger than 20° F the team coming from the colder city always has an advantage against the warmer-weather franchise compared to the overall home team win percentage, even when the cold-weather team is the visitor. This result persists even after the data are corrected for the effect of teams which have played against each other multiple times, and indicates that there may be some persistent advantage gained by teams which become acclimatized to poor playing conditions, although why this should be is unclear.

Introduction

A few posts ago I investigated the effect that distance has on home field advantage, and found that teams traveling East had a much more difficult time playing on the road than visiting franchises coming from the West (or traveling North/South). However, as I noted in that post, distance is but one of many possible components of home field advantage.

Because NFL teams are scattered all over the country, many games (especially toward the end of the season) happen between teams used to dramatically different climates. Along with distance, the temperature differential is extensively discussed in the lead-up to a big game. The most notable example of this trend is the coverage of the Tampa Bay Buccaneers longstanding cold-weather futility. (This coverage, interestingly, largely ceased after the Bucs beat the Eagles in Philadelphia in the NFC championship game — only their second-ever win in temperatures below 40° Fahrenheit — en route to winning Super Bowl XXXVII.)

Of course, just because pundits and announcers like to talk about the weather doesn't mean it actually has any impact on the outcome. And the Buccaneers were a historically bad franchise for over a decade before their Super Bowl win. Let's dig in and find out exactly what (if any) impact the weather really has.

Data

While my other home-field advantage study used game results I personally downloaded from NFL.com, that data did not include any temperature information. The Armchair Analysis database, however, has plenty of information on game conditions. From this database I obtained game results as well as weather information for every regular season game between 2000 and 2011.

Results

Before digging into the temperature data I first computed the home team win percentage for the entire Armchair Analysis sample. Overall, the home team wins 56.9% of the time —only 1.1% less than for my NFL.com data. This consistency is very encouraging, and indicates that results obtained with one data set can be accurately compared with the other.

To integrate the temperature data into the win-loss results I first computed the average temperature for every stadium in the league for each week of the regular season (Figure 1). Because the sample size for a given week is fairly small (roughly 5 games per week per field) I included the temperatures for the weeks immediately before and after as well, which helped to smooth out the 'wrinkles' and should provide more accurate averages.

Figure 1: Average home-field temperature for each team over the course of the regular season. Hotter temperatures are red, while colder temperatures are blue.

For teams playing in a dome I set the temperature at 72° F. For stadiums with a retractable roof I used the ambient temperature when the roof was open and 72° when the roof was closed.

Most of Figure 1 makes sense — Green Bay gets frighteningly cold in December and January, while all three Floridian teams play in fairly warm conditions. Kansas City is colder than I would have thought, however, and Pittsburgh has comparable weather to icy Buffalo. But overall it seems as though there is enough data in the sample to produce reasonable weekly averages.

With temperature averages established, I next determined the expected temperature differential between the home and away teams for every game in my sample. Note that in the following analysis I am not using the actual temperatures for each game but rather the averages for that week for the two teams. While somewhat more abstracted than using the real temperatures, sticking with the averages significantly simplifies things — if I used the specific game time temperatures for each game I would have to compare the expected temperatures for both the home and away teams to the actual conditions. Seeing how really extreme weather affects teams would be interesting, but that analysis is for another post.

Figure 2 shows the home team's win percentage as a function of the average temperature differential. The overall home team winning percentage is also shown, as are 1-sigma bootstrapped error bars.

Figure 2: Home team win percentage broken up by average temperature differential. The red line shows the home team's win percentage for the entire sample.

When the visiting team comes from a city with a temperature less than 20° different than the home city there is essentially no change in home field advantage (although there is weak evidence that the away team does better when visiting cities with similar weather). However, there is a dramatic shift for temperature differentials larger than ±20° — When a warm-weather team travels to the frozen North, they are almost 10% less likely to win than on average, while the situation reverses completely when a team used to duking it out in the cold road-trips to more tropical climes.

Discussion and Conclusions

Before digging in to the provocative results in Figure 2, some caution is advised. While each bin has several hundred individual games, it is possible that a few specific matchups between divisional rivals could be biasing the results. For instance, the NFC North has two teams with some of the coldest weather in the league (Green Bay and Chicago) as well as two teams which play in domes (Detroit and Minnesota). Depending on how the League draws up the schedule this division could contribute up to four games a year in the most extreme temperature differential bins — exactly the ones which show a significant change in home field advantage.

So how do we know that the apparent trend with temperature isn't the merely the result of the Packers and Bears beating up on the Lions and Vikings over the past decade (or Patriots-Dolphins, or Chiefs-Chargers, etc)? Controlling for this potential source of bias is actually fairly simple — just give every distinct matchup is given the same weight in the computation.

Basically, for every combination of home and away teams present in a bin, I've computed the total home team winning percentage instead of treating each game as a separate event. These matchup winning percentages are added together in the same way that the original games were in Figure 2 to produce a corrected temperature differential histogram — Figure 3.

Figure 3:

Despite all of my concern, Figure 3 shows only a slight reduction in the trends when teams with multiple matchups are taken into account. Now it's possible to evaluate these results with at least some confidence that they aren't being dominated by just a few teams.

And the results are certainly interesting — especially if you root for a cold weather team! Not only does playing up North make it tough on visitors, with the home team winning ~7% more games than for teams in moderate climate and nearly 65% of the time overall, but the advantages provided by a frigid home environment appear to persist even when traveling.

It's not too surprising to find that rough weather can be difficult for a team which isn't used to it, but I wouldn't have predicted that fair-weather franchises would have just as much trouble when hosting teams used to the cold. I couldn't say how — perhaps teams used to playing in unpleasant conditions simply become extra excited about games where they know they won't need to worry about wearing gloves and sleeves!

Field Position and Scoring Probabilities: Half of the Red Zone is a Dead Zone (for Touchdowns)

2013-06-24T22:07:00.001-05:00

Abstract

Any drive's scoring chances increase as the offense moves down the field, but exactly what impact an additional X yards gained provides is not generally known (or at least not commonly discussed). In this post I've charted out a team's scoring chances for a first-down situation at any point on the field. In addition to a dramatic increase in touchdown percentage for all drives that have a first down within 10 yards of the end zone, there is a leveling off in the fraction of drives ending in touchdowns right outside of this zone. While the root causes of these features are not made clear by this analysis, they may be due to the necessity for different offensive and defensive tactics near the endzone.

Introduction

As a team drives down the field, excitement naturally builds. Each first down brings them closer to the end zone and a touchdown. At least, it should. How much does each first down improve your chances of scoring, and are there any parts of the field where having a first down closer to the goal line doesn't help matters?

Data

To obtain the necessary data I queried my copy of the Armchair Analysis database for all plays in the first three quarters. I ignored the final period so as not to bias the results with desperation drives from teams attempting a late rally. I then used a python script to find all first-down plays and the end result of the drive they occurred on.

This resulted in 63182 first downs over 17164 scoring drives. Roughly 60% of these plays were on touchdown drives, while the rest were on series that resulted in field goals (I completely ignored safeties, for the record). This uneven distribution is unsurprising, given that TD drives generally cover more of the field (and thus generate more first downs) than FG drives.

Results

A plot of how likely a drive is to end in points as a function of field position is shown in Figure 1. It shows the fraction of scoring drives that result from a first down at a given yard line, with the opponent's end zone denoted by zero. Errors were determined via bootstrapping, and due to the sheer number of samples in this data set they are small.

Figure 1: On any given drive, having a first down at a given point on the field is plotted against the probability of the drive ending with a score.

As expected, the likelihood of scoring any points increases monotonically (aside from a couple of bumps and wiggles most likely due to statistical fluctuations) from the offense's end zone to the other team's goal line. On a team's own side of the field the relationship is linear, with a field position boost of ten yards resulting in roughly a 10% increase in scoring probability.

Once you cross midfield, however, the odds of scoring take a distinct upturn. Looking at the data split into the different types of scores (red and blue points in Figure 1) shows that this uptick is the result of field goals, which makes some sense given that a team starting at the 50 only needs a couple of first downs in order to be in field goal range.

Inside the opponent's 30, the percentage of drives ending in field goals levels off because the offense is already within field goal range — getting additional yardage doesn't make you more able to attempt a field goal. The likelihood of ending the drive with a touchdown, however, continues to increase.

After a leveling off between 10-20 yards away from the opposing team's goal, the TD percentage rockets upwards for first-and-goal situations at the expense of field goals. Ultimately, a first-and-goal at the 1-yard line gives the offense an 85% chance of scoring a touchdown and an almost 95% chance of getting any points.

Discussion and Conclusions

It's somewhat surprising to see the dramatic increase in TD% when the offense is within the opponent's 10-yard line. This implies that there's something different about that last 10 yards — either it becomes significantly easier to score a touchdown (doubtful; I think the opposite is probably true), or teams are more likely to go for it on all four downs when they're so close to scoring. It's also possible that there's a psychological shift, providing a boost of adrenalin to the offense. A full investigation of these possible explanations is beyond the scope of this post, but might be worth revisiting in the future.

Of further note is the lack of improvement in a team's touchdown chances inside the red zone but outside the 10-yard line. This is in stark contrast to the dramatic ramp-up of TD% once a team reaches a first-and-goal scenario. While the TD% in this region stagnates, however, FG% increases correspondingly, leaving a smooth increase in the total scoring probability.

On it's own, the leveling off of the touchdown percentage wouldn't be inconsistent with random statistical fluctuations, such as the apparent increased scatter in the total scoring percentage around the 50 yard line. But the consistency of the feature around the opponent's 10-yard line, along with the corresponding increase in the frequency of field goals, indicates that this phenomenon is real.

So it seems like there is indeed a bottleneck effect when a team gets ~15 yards away from a touchdown, likely due to the difficulty of getting a first down very close to the goal line. This bottleneck disappears once a team gets into a first-and-goal situation, possibly the result of a team's increased willingness to go for it on fourth and goal. So the next time your team has to settle for a field goal when they had first-and-10 from the 12, take a small comfort in knowing that they weren't in quite as good of a spot as it seemed.

--A huge shout out to Kenny Rudinger for noticing that my preliminary results for this post were obviously in error, allowing me to sort out the bugs in my analysis code *before* subjecting my boneheaded mistakes to public scrutiny.

Quantity over Quality in the NFL Draft

2013-06-10T21:11:00.000-05:00

Abstract

NFL teams live and die by the draft. A franchise which drafts well consistently can look forward to years of sustained success, but just a single year of bad evaluations can cripple a team for several seasons. Drafting will never be an exact science, and it's not obvious why some teams appear to be better at it. In this post I investigate what might give these teams their advantage, and find that while drafting better players is correlated with winning more games, simply acquiring more draft picks has a stronger effect on a team's success. This result indicates that teams should focus on obtaining as many selections as possible rather than staking their fortunes to a few highly rated prospects.

Introduction

One of the great things about the NFL is the level of parity. No matter how bad the previous year was, your favorite team is always 'just one year' away from turning it all around. Every year their seems to be one or two teams who dramatically improve their fortunes — look no further than the 2008 Miami Dolphins or 2012 Indianapolis Colts for examples.

Of course, these teams usually crash right back down to Earth (c.f. the 2009 Dolphins). But some teams seem to be near the top of the pack year after year.

Quantitative evidence for the above statement comes from the postseason. While 28 different teams have made the playoffs at least once since 2006 (sorry Buffalo, Cleveland, Oakland, and St. Louis fans!), only 10 have made the playoffs in more than half of those seasons. If you want teams that have made 5+ out of 7, your sample drops to five — the Colts, Patriots, Steelers, Ravens, and Giants. So why are these five teams so consistently successful, while most of the rest of the NFL is so streaky?

One possibility is that these teams win so much because they draft better than the rest of the NFL. Teams constantly have to refresh their talent pool as players age; a team which is able to more accurately evaluate college talent should have a huge advantage over teams which can't.

But is that true? Is it even possible to draft well? Some evidence would argue that it is not — there have been plenty of high-profile draft busts in recent memory (e.g. Vernon Gholston), and undrafted stars like Arian Foster immediately tell you that good players are still falling through the cracks.

So the question becomes how to quantify drafting savvy, which is clearly a difficult thing to do. (If it wasn't, teams would have already figured out how to draft better!)

Data

I downloaded a comprehensive list of draft results between 1990 and 2011 from Pro Football Reference, which (among other things) lists year, round, team, and when the player left the league. This data isn't perfect, as there are players who spend time outside of football before re-entering the league, but those players are outliers who shouldn't affect the results very much.

Coupled with the draft data I have team win-loss records for each of the aforementioned seasons. These records were compiled from individual game scores downloaded from NFL.com. To make things a little easier I will strip out individual teams from the equation and aggregate all teams together, and then look only at how prior drafts affect current win-loss records over all franchises.

Results

If teams are truly bad at picking talent, then every pick would be essentially equivalent to rolling the dice. Now, we know that this isn't quite true, as otherwise you'd have many more first round busts and late-round diamonds. But what if it was?

If you assume that teams are totally incapable of evaluating talent, then the optimum strategy to build a winning team becomes clear: stockpile draft picks. In this scenario if you are drafting more warm bodies than other teams, by the laws of probability you will also acquire more talented players. Assuming you can separate the wheat from the chaff in training camp (a dubious assertion, I know, but let's not go down this rabbit hole now), you'll come out ahead in the long run.

Even if you make a weaker assumption about a front office's ability to diagnose talent in the draft — maybe that coaches and GMs lack the ability to discriminate between talent levels within a single round of the draft — the logic of grabbing as many picks as possible still holds. This is especially true given the low value teams seem to place on draft picks in future years. If you can give up your first round draft pick this year in exchange for a team's first round draft pick next year plus a second rounder, you essentially get an extra chance at winning the second round 'lottery' by agreeing to wait one more year before trying to get a good first-rounder.

It's simple enough to compute how additional draft picks impact win percentage. Figure 1 shows the Spearman correlation coefficient between the number of draft picks above the NFL average and win percentage. Just looking at the current year's draft isn't enough; Figure 1 shows several ranges of years — each point on the figure has a Y-to-X range of years. For example, the point at (6,2) shows that the players drafted between 2 and 6 years ago have a (relatively) strong effect on how well a team is currently doing. It's not the most obvious plot to look at, but it conveys a lot of information in a compact way.

Figure 1: Correlation between number of draft picks in prior seasons with win percentage. The X axis shows how far back in time we count draft picks, while the Y axis shows the minimum number of years before the current season a pick must be made to be counted. A higher Spearman coefficient indicates that surplus of draft picks in that range of years is more strongly correlated with win percentage.

Only correlations with greater than 95% significance are plotted. The strongest correlation is only 0.123, for players drafted between 2 and 10 years ago. I can further break down the data with the strongest correlation by round — Figure 2.

Figure 2: Round-by-round analysis of the strongest correlation in Figure 1. Axes are the same as Figure 1, but with draft round numbers instead of years.

First off, if you only look in the first 1-3 rounds, there isn't any significant correlation. This is probably a function of teams' general reluctance to deal draft picks in the early rounds, which leads to a smaller sample size and therefore a weaker confidence level. The next interesting thing is the sudden pickup in significance when Round 4 is included in the calculation. And going later than Round 4 doesn't help out your win percentage. So having extra picks in Round 4 (and likely earlier) does much more for you than having extra picks in the last few rounds.

Alright, so now we know that additional draft picks can boost your win percentage by a small amount, but only if you look over several years and focus on earlier draft picks. Now we need to test how this compares to a measure of drafting skill.

Estimating how good at drafting a team is is much more difficult than just comparing win totals to the number of draft picks. While certainly not perfect, a decent proxy is the length of a player's tenure in the NFL; if, say, the Giants have drafted the same number of total players as the Cardinals over the last five years but have twice as many which are still in the league, logically it would seem that the Giants are doing a better job identifying talent.

In Figures 3 and 4 I've plotted the same metrics as in Figures 1 and 2, but looking at the number of drafted players still on the team instead of the raw draft numbers.

Figure 3: Same as Figure 1, but computing the correlations between the number of players still in the NFL and win percentage.

Figure 4: Same as Figure 2, but for the strongest correlation in Figure 3.

Of course, the number of players still in the league is also dependent on the total number of draft picks a team has. So we expect a correlation at least as strong as for the raw draft picks.

The first thing to notice is that many more ranges of years produce statistically significant correlations. Many of these correlations are larger than the strongest correlation from Figure 1, although the peak correlation is still in roughly the same location. Looking at this peak correlation by round, however, the largest correlations are not much larger than when looking at raw draft picks.

Discussion and Conclusions

Before really jumping into the detailed analysis, it's important to note that none of these correlations are very large — that is to say that at best historical drafting ability plays only a small role in determining how well a team will do in a given year. This is perhaps not hugely surprising, given that there are many other variables (injuries, suspensions, contract holdouts, varying strength-of-schedule) which affect a team's fortunes but have nothing to do with drafting.

Despite this, however, many of the correlations are statistically significant, which means they are very likely to be real. It's always important when looking at correlations to remember that even significant correlations do not necessarily imply causation. But in this case, when it's fairly clear that a team's ability to draft well should directly impact their on-field success, it seems reasonable to assume a causal link.

Let's first discuss the intriguing results in the round-by-round breakdown. It's clear that considering later rounds in the analysis doesn't significantly improve the correlation. The broadness of this result implies that it is not a statistical aberration, which then means that adding extra late-round picks doesn't significantly help your team — the logical conclusion here is that it makes the most sense to package up your 5th, 6th, and 7th round picks and grab extra 4th and earlier selections.

The main conclusion, however, has to be that drafting players who survive in the league isn't much more better than simply drafting extra players. It's possible (probable?) that the number of second- and third-string players that stick around the league for a long time (so-called 'career backups') are biasing the results. The best way to test this hypothesis would be to construct some way of comparing player skill and add this into the analysis, but of course such a statistic (which would have to accurately compare such disparate positions as quarterback and defensive tackle) would not be simple to create.

Looking at the data on the raw draft picks indicates that there is indeed some advantage to be gained just by stockpiling draft choices. Given how teams appear to undervalue their draft picks in future years, a forward thinking team should be able to trade away picks in a current draft in exchange for extra picks in the next year. Repeating this strategy over several years would (in theory) lead to a large surplus of picks.

The correlations are small, but every little advantage in the NFL matters. As long as teams are willing to give up many future-year and/or late-round picks in order to move up just a few spots in the first couple of rounds, there will always be opportunities for a patient team to gobble up the extra selections. Bill Belichick's Patriots — 10 playoff appearances in the 13 years — are well-known for doing just that.

Home Field Advantage: Distance Doesn't Matter, But Time Zones Do

2013-05-29T21:10:00.001-05:00

Abstract

The existence of home-field advantage is well-known and not in dispute, but the magnitude of the home team's advantage is not often discussed. Additionally, it's reasonable to assume that while some of this effect is due to crowd noise or unconscious referee bias there may also be a component that results from the distance the away team needs to travel. It turns out that while the absolute distance traveled doesn't affect home-field advantage, teams that have to cross time zones — specifically traveling East — do almost 10% worse than on average.

Introduction

Whenever there's a big game, TV commentators always like to spend some of their pregame chitchat on the topic of home field advantage. When the visiting team has a false start or delay of game, the announcers are quick to blame crowd noise. Even after a team has locked up a playoff spot they still play hard until they've secured (or have no chance at) home-field for the playoffs. The subject comes up enough that it's been the subject of at least one hour-long NFL network production. One of the reasons the Super Bowl is held at a neutral location is to remove this advantage (and it's also why basebal and the basketball organize their playoff series the way they do).

But what exactly is home field advantage? Is it some constant that applies for all visiting teams? Is it only applicable for those teams who play in loud stadiums (Seattle, Kansas City) or locations with hostile climates (Green Bay, Buffalo)? How strong is the effect?

Anyone who remembers the last several Super Bowls has good reason to be skeptical of the power of playing at home: 5 out of the last 8 Super Bowls have been won by the 4th seed or worse. In the same time period, the number one seed — the team with home field advantage — from either conference has only won once.

Obviously just looking at Super Bowl winners doesn't provide a very large sample size, and there can be other complicating factors — for instance in 2010 the Jets secured a wild-card spot with an 11-5 record, but had to go to Indianapolis because the 10-6 Colts won their division. So for this experiment we'll look at nearly two decades of regular season games, and probe the role that distance plays in the effect.

Data

I downloaded game information (teams and scores) for all regular season games between the 1995 and 2011 seasons from NFL.com using a Python web scraping script. Going further back in time might be useful, but it runs the risk of biasing your data if teams change the way they travel.

Determining the home time zone for each team is generally straightforward, except for the Cardinals. Since most of Arizona doesn't observe daylight savings time, during the summer months and part of fall the Cardinals are in the pacific time zone. But in the winter, when the rest of the country falls back, Arizona goes to Mountain time. For simplicity's sake I used Week 8 of the regular season as a cutoff for determining which time zone the Cardinals were in, as that's usually around when the switch happens.

Results

The first question to answer is whether home field advantage exists at all. Counting up the entire data set, the home team wins 58.0% of the time, with a 1-sigma bootstrapped error of 0.76%. That's pretty significant — If a team played all 16 of their games at home on average they'd win one extra game per season!

So home field advantage is clearly a real effect. But what controls the advantage? This result is league-wide and over many years, so we can discount stadium-specific effects.

A natural next step is to investigate the effect distance has on home-field advantage. Toward this end I computed the stadium-to-stadium distance between the two teams for every game. A plot of the win percentage of the home team as a function of the distance the away team had to go is shown in Figure 1. The error bars are 1-sigma bootstrapping errors, and the red bar shows the average value of the whole dataset.

Figure 1: Home team win % as a function of how far the away team had to travel. The red line shows the average home-field advantage.

The widths of the bins were chosen to keep the number of samples roughly constant in each bin, which is why the bins at small distances are narrower than the bins at large distances. While there is a general trend towards a larger advantage when the away team has to travel a long distance, it's very weak. Only the smallest-distance bin is inconsistent with the global average, and it's not off by very much.

Pure distance isn't the only deleterious effect of traveling, however. Crossing time zones can mess with circadian rhythms, and an East coast team has three time zones to travel to play one of the California or Washington teams.

Figure 2 shows the relationship between the number of time zones traveled and the home team's win percentage. Firstly, when the home and away teams are both in the same timezone, the home-field advantage is lower than the average — when the Dolphins travel up to Buffalo the Bills don't have a full 58% home field advantage.

Figure 2: Win % as a function of the number of time zones crossed by the away team. Lines are the same as in Figure 1.

From East to West the time zones get increasingly negative — New York's time zone is -5 hours, while San Francisco's time zone is -8 — so an East coast team traveling to a West coast team would be on the left side of the plot.

When a team travels one or more time zones West, the effect of home-field advantage is roughly the same as the overall average of 58%. However, when teams go East the benefits to the home team are well above the average, hovering closer to 65% and significantly larger than their errors.

Discussion and Conclusions

It seems pretty clear that while home field advantage is very real, it's more dependent on time zones than pure distance. What's really interesting is that the home team only gains an advantage when the visitors are traveling East; there is essentially no additional benefit (over the average home field advantage) to hosting a team which has come to the Pacific Coast from somewhere else in the country.

This finding has consequences for NFL scheduling. The league goes out of its way to ensure that when teams from one division play another, each team plays the same number of home and away games. This consideration is admirable, however it falls somewhat short of ideal given these results, because both the AFC and NFC West have a team in the central time zone (Kansas City and St. Louis). So when an East Coast team hosts the Chiefs instead of another AFC West franchise, they lose out on the added home field advantage.

This discovery has serious ramifications for deciding where to place franchises as well. An NFL owner's job is to do everything in their power to win a Super Bowl. If you know that placing a team on the West Coast will put you at a disadvantage, isn't it your duty as an owner to make sure your team plays in the East? This is especially relevant given the current attempts to put a team in Los Angeles: any owner looking to buy into a new L.A. team is automatically going to be at a disadvantage.