Thursday, September 1, 2016

Introducing NFLWin: An Open Source Implementation of NFL Win Probability

tl;dr: I made a Python package to compute NFL Win Probability - given a specific game state, what are the odds the offensive team will go on to win the game? Code on GitHub, documentation on Read the Docs, or just 'pip install nflwin'.

One of the most common advanced statistics used by NFL analysts is Win Probability. Put simply, Win Probability (WP for short) is an estimate of the likelihood that, given a specific game state one team will go on to win the game. For example, at the very start of the game between evenly matched opponents each team's WP will be very close to 50%, while a team up by 20 points with a minute left to go will have a WP of essentially 100%. Down, distance, field position, and other variables can also be added to the model in order to produce an extremely granular WP estimate.

While WP alone is a useful tool for condensing the myriad variables surrounding the game state into a single, easily interpretable number, it becomes even more useful when compared across plays. The difference in WP between two plays (also known as Win Probability Added, or WPA) provides a way of measuring how effective a given play was at helping your team win. Instead of grading a running back's performance based on rushing yards or yards-per-attempt, for instance, summing the WPA from each rushing attempt automatically produces a statistic which gives more importance to a 2 yard rush on a critical fourth-and-one than for a 7 yard draw play on third-and-18.

Despite its easy interpretability, which is relatively rare in the world of advanced statistics, WP is not a straightforward calculation like yards-per-rush or even QB rating. WP isn't based on a simple formula; rather it requires one to build a detailed model based on historical data. This model can be quite complex, both in terms of the specific data used to construct it but also in the choice of model itself. As a result computing WP from scratch is not feasible for a large number of would-be analysts. That's why I built NFLWin.

NFLWin is a Python package designed to make estimating WP robust yet simple. It provides a simple interface for pipelining raw data through all the steps necessary to compute WP along with great documentation that covers installation and use. The code is fully open-source so anyone can inspect its guts or modify it to suit their purposes, and while it includes a WP model to make it easy for anyone to get going right away, NFLWin also includes utilities and instructions for rolling your own model if you so choose. 

NFLWin is far from the first effort to compute Win Probabilities for NFL plays. Brian Burke at Advanced NFL Analytics was one of the first to popularize WP in recent years, writing about the theory behind it as well as providing real-time WP charts for games. Others have picked up on this technique: Pro Football Reference (PFR) has their own model as well as an interactive WP calculator, and the technique is offered by multiple analytics startups.

So why create NFLWin? Well, to put it bluntly, while there are many other analysts using WP, they're not publishing their methodologies and algorithms or quantifying the quality of their results. This information is critical in order to allow others both to use WP themselves but also to validate the correctness of the models. Brian Burke has never discussed any of the details of his WP model in any depth (and now that he's at ESPN, that situation is unlikely to improve any time soon), and analytics startups are (unsurprisingly) treating their models as trade secrets. PFR goes into more detail about their model, but it relies on an Estimated Points model that is not explained in sufficient detail to reproduce it.

Possibly the best description of a WP model comes from Dennis Lock and Dan Nettleton, who wrote an academic paper outlining their approach and results. Lock and Nettleton's paper provides information regarding the data source used to train the model, the type of model used, the software used to build the model, and some statistics indicating the quality of the model. It even includes a qualitative comparison with Brian Burke's WP estimates. This is far and away the most complete, transparent accounting of the guts of a WP model and is laudable. However, as often happens in academia, none of the code used to build and test their WP model is available for others to use; while in principle it would be possible for anyone to recreate their model to build on or validate their work, this would require building their entire pipeline from scratch based off of dense academic prose.  

"But Andrew", you may say, "What about the PFR online WP calculator you mentioned only two paragraphs ago? Surely we can just use that instead of having to create our own." Well, unfortunately there are two main problems with that approach:

  1. If you ever want to programmatically compute WP you'll need to write a web-scraping algorithm to do so. The end result will require the user to be online, and, like most web-scraping, be fairly brittle - if PFR changes their website your scraper has a good chance of breaking. Not optimal.
  2. There is something obviously wrong with the PFR calculator. Go to the calculator page and ask it to tell you the WP for a tie game with zero point spread, with 5:01 to go in the 4th quarter and the offense at first-and-goal from the 5. You'll see that their model gives the offense a 50% chance of winning the game. Now compute the WP for the same exact situation but with 5 minutes left to play - one less second than before. Suddenly the WP prediction has jumped to 76.69%, a increase of over 25% just from having one fewer second on the clock!


While the first issue is unpleasant, the second is a huge problem. I don't know whether its a buggy implementation or a bad underlying model, but this discontinuity makes no sense. If PFR posted its algorithms publicly it would be possible to diagnose the problem. If their code was on GitHub I could even patch their code and contribute back.

This lack of transparency is endemic in the field of sports analytics. By not publishing their methodologies and the code behind them they are failing the reproducibility test as well as their readers who trust them to provide honest and unbiased stats. I get that controlling access to these algorithms can represent a competitive advantage, but frankly it's impossible to trust any analysis when there's no way to assess its accuracy or even verify that it's not flat-out wrong. How correct is Brian Burke's model for a given game state? Is the PFR model buggy just in this one case or is it pathologically incorrect? There's no way to tell.

NFLWin doesn't have that problem. Anyone can inspect the code to look for bugs, and accuracy measurements are built into the model. To be completely honest the default model in this initial release isn't particularly good - plotting the expected WP based on an aggregated validation set against that predicted by the model shows clear deviations from perfection (see below) - but if you want to use it you can see exactly how much you should trust the model, and it's now possible to quantify improvements made as time goes on and the model is iterated upon.
The default model in Version 1.0.0. Note the deviations from perfect predicted WP.

The OSS community has shown time and time again the value to be gained from open development - not only is there direct benefit to the public but having more eyes on the project leads to better code. By creating NFLWin I hope to not only empower others to produce robust, reliable WP estimates but also to use the knowledge of others to build a better tool than I could construct on my own.

So check NFLWin out. Read through the documentation. Install it and play around. Post an issue if something is missing or wrong. And, of course, contributions are welcome :). 

Wednesday, September 3, 2014

Isolating Player Movement by Eliminating Camera Motion: An Ongoing Project

Note: This post departs from the general format used on this blog. That's because the post is not about a specific analysis I've done but rather a demonstration of a tool I've been developing to make more detailed studies possible.

For those not counting at home, this is the 18th post on this blog. In the last 17 entries I've investigated a wide variety of topics, from whether or not it's a good idea to start rookie QBs, to home field advantage's dependence on time zone shifts, to crowd noise's impact on penalties. All of these analyses were performed using the standard statistics that the NFL keeps, chiefly play-by-play data from Armchair Analysis. These stats are very rich in content, and so it's not too surprising that I (among many, many others) have been so successful in exploiting them.

At this point, however, I believe that the returns from this kind of data are rapidly diminishing, and fewer and fewer cool new results will be forthcoming. At this point the community has explored a significant fraction of the power of the existing data; most of the really interesting questions that can be satisfyingly answered with play-by-play data already have been addressed. That's not to say this resource is fully exhausted, but I strongly believe that smart people have been using these statistics for long enough now that new work will be increasingly more difficult to find and perform and will require even greater care to ensure its validity.

That doesn't mean that there's nothing more to be done with advanced football analytics. In fact, I would argue that everything done up to now has only scratched the surface of what could be possible with better data.

I'm talking, of course, about player tracking systems.

It's taken many years of innovation, but technology has finally progressed enough to allow the positions of athletes to be monitored with high accuracy during a game or match. This capability opens up an entire new world of possibilities for improving our understanding of sports as it eliminates the reliance on simple statistics that can be easily tracked by humans. (Of course, if you like looking at the old statistics because they're easy to understand, don't worry: those will stick around, at least for the foreseeable future. In fact, they'll likely be improved, as player tracking systems can be used to automatically reconstruct all of the common statistics and thus eliminate transcription mistakes.)

While the NFL has generally taken a rather dim view on modern technology, in the last couple of years things have started to change – slowly. Here's an article from the league's website breathlessly describing how teams have finally started keeping their playbooks on computers – from just two years ago. This year the NFL has finally relaxed its rules prohibiting new tech on the sidelines, only to force teams to use old Surface tablets which have been modified to prohibit anything except viewing photos of plays from the prior drive.

Given this spotty track record, I was pleasantly surprised (stunned, really) to read the news that the NFL is jumping on-board the player tracking train in a big way, outfitting 17 stadiums with the technology to read RFID chips mounted in the shoulder pads of the players. (For a less pleased – but quite amusing – take on the announcement, see here.) Once statisticians and commentators get used to the new data at their disposal I'm sure we'll see a bevy of interesting new stats during games and in write-ups.

I want to be clear that I see the NFL's unexpected embrace of player tracking as an undeniable good: obviously the stats viewers and fans are provided will only get better once the system is up and running. However, given the limited nature of the stats the league provides on their website I would be shocked if the raw data from this system was provided to fans. I imagine that instead we'll be drip-fed highly aggregated statistics of minimal use for deep analysis, similar to the exceedingly limited results provided to the public by the NBA from their similar system.

I'm not sure why major sports leagues are so stingy with their statistics. It's possible they just don't see the demand to justify adding that capability to their websites, although the cynic in me thinks they're worried that with free access to the data the general public will start putting their own analysts and pundits to shame.

Regardless of why this information tends to be rationed, it's pretty clear (to me, at least) that despite the NFL's new commitment to this technology there is still value in developing an open system for player tracking. It's also clear that the broadcast footage is woefully inadequate for this task, as it's far too zoomed-in to see what players are doing away from the ball.

Fortunately, despite some ridiculous objections, the NFL has recently decided to make All-22 game footage available to the public. For those of you who don't follow my hyperlinks, the All-22 footage refers to two very wide camera angles, designed to show all 22 (hence the name) players on the field rather than focusing on the football. One camera shoots from the sideline at midfield while the other one films from one of the endzones, both from a very high vantage point. If you want to do any kind of player tracking based on game video, this is the footage you want to use.

Thanks to the excellent nflvid module, I was able to download the All-22 footage for the 2013 Jets-Falcons game to use as a guinea pig. My first discovery was that, despite the name, the All-22 film does not show every player for the duration of each play. Rather, the All-22 cameras function much like the regular cameras you see during the broadcast, except from a wider angle. For running plays, it's not a terrible issue, as the play is fairly compact and action away from the ball is of marginal importance:
But for passing plays, especially longer ones, the camera tends to lose the other receivers (as well as the linemen and QB) after the catch, when it zooms in to follow the ball:
(Note that the quality in the actual video is significantly better than in these GIFs, as I didn't want this post to take hours to load so I compressed them by a fair amount.)

A proper system would cover the whole field at once, allowing the viewer to watch every player for the full duration of each play. Unfortunately this is clearly not the case for the All-22 footage (the endzone view is actually worse in this regard, for reference), which makes identifying and following players between video frames much more challenging.

So, rather than being able to use the raw All-22 footage for player tracking, first we need to remove the camera motion. This is necessary in order to both get absolute position shifts for players as well as to keep track of which player is which between frames. The gist of the technique is simple: Find easily identifiable regions in each frame, and then compare their locations between frames to compute how the camera has moved (the homography, if you want to sound smart). Once you know the motion of the camera, you can correct the frame for it, effectively removing all camera motion.

Football is actually really well-suited for this process, as there are so many lines and markers painted on the field (soccer, for instance, would be significantly more challenging). The program I wrote to do this is relatively straightforward, and can be found on github. You can compare the original footage with the motion corrected video for a run:


and a pass:


I personally just prefer watching with the camera stationary – it's much more like actually being at the stadium, where your attention isn't controlled by the camera operator. Of course, it's not perfect; the NFL shield now moves, and you can tell from the pass play that the re-projection isn't exactly accurate on a few of the frames. The program also tends to have trouble when the ball is near the sidelines as well as at the end of most plays, when the camera has zoomed in really far. It isn't very fast (it takes about 10 minutes or so to process most plays) and can require several gigabytes of RAM since it currently stores all the motion-removed frames in memory as it processes the play.

Overall, however, it generally works fairly well, especially for a first attempt. From this point you can begin testing player identification and tracking algorithms for a variety of different plays, while continuing to iterate on the robustness of the camera motion program. Everything is open source, so if you're interested in contributing, or just want to try out the code for yourself, feel free to grab it and go nuts!

Sunday, March 23, 2014

Classifying WRs, TEs, and RBs by Where They Catch the Ball

Abstract
Principal Component Analysis (PCA) is a useful tool to simplify complex datasets. The results of the PCA can be then used either to reconstruct the original data or to classify it into different groups. In this post I apply PCA to reception data for a sample of 150+ NFL receivers. I find that PCA generally does a good job of discriminating between wide receivers, tight ends, and running backs. A few tight ends, however – generally ones known more for their use as receivers than blockers – have significant overlap with the wide receivers. This result indicates that PCA may be useful for determining how to designate players, Jimmy Graham for example, for the franchise tag.

Introduction

Alright, so I lied. Well, partially – I am very busy with job applications, but I've also been teaching myself some new machine learning techniques (mostly from this excellent textbook) and they're just so damn cool that it's been hard not to think of ways to apply them to NFL data.

One of these methods is called Principal Component Analysis (PCA for short), and it's designed to reduce a large, complex dataset down into its most important pieces. These pieces (the 'component' part of PCA) can be used as basis functions to reconstruct the original data with minimal information loss, providing a form of data compression. Or, the coefficients for a given component can be compared between all the observations in a dataset, and trends in these coefficients may be used to classify the data into groups.

One of the great things about PCA is that it relies on no assumptions about how the data are distributed. This means that PCA can be used on just about anything. Something that is especially appropriate for a PCA is the distribution of yardage gained by a player every time they touch the ball. Credit where credit is due: the analysis in this post is partly inspired by Brian Burke of Advanced NFL Stats, who looked at the distribution of yards gained (or lost) on rush attempts in an effort to distinguish between power running backs and smaller, faster RBs. Burke (largely visually) compared the raw yardage histograms, and found that there were only small differences between each type of back. Burke suggests using a gamma distribution to parameterize these gains, although given the distinct rush distribution for each player he shows it seems unlikely that every running back will be well-represented by such a (relatively) simple model (to his credit Burke himself is quite upfront about this). PCA allows us to produce accurate representations of such data without choosing a distribution a priori, which means we don't have to worry about limiting or biasing ourselves by such a decision.

For this post I'll apply PCA to reception statistics rather than rush attempts. One reason for my choice is that Burke's analysis (despite the limitations I mentioned earlier) is pretty thorough, and I prefer to break new ground when I can. The other (more interesting) reason is that while most rush attempts come from a single position group (RBs), the target for a pass attempt can be a WR, TE, or RB. So in addition to looking for differences between possession receivers and home run threats it's also possible to see how the different positions are utilized on passing plays.

Data and Model
I queried my copy of the Armchair Analysis database (which spans the 2000-2011 seasons) and grabbed the yardage gained from every reception for each player with 200+ catches in the database. (I impose this reception threshold to ensure that statistical noise doesn't dominate the data.) The final sample consists of 114 wide receivers, 37 tight ends, and 33 running backs. The reception distribution of the total dataset is shown in Figure 1.
Figure 1: Distribution of all receptions in the sample. It has a strong peak at a gain of around 7-10 yards, with a long tail showing big passing plays. 
I next computed the reception distribution for each player, then ran the PCA. The details of exactly how PCA works are beyond the scope of this blog, but I'll give a brief overview of the method here so that at least the general concept is (hopefully) clear.

First off, each player's reception distribution is normalized so receivers with more catches don't bias the analysis, and the mean yardage distribution for the whole dataset is subtracted. From this point the algorithm gets to work, computing a function which minimizes the variation in the residual data. This process is repeated, and each successive iteration accounts for more and more of the fine details of the dataset. Eventually (when the number of iterations approaches the number of players in the sample) the PCA will perfectly reproduce the original data. Of course, that sort of exact duplication isn't the point of PCA; rather since the most variation is explained by the first components, the goal is to truncate the algorithm after only N iterations, where N is much smaller than the number of players in the dataset.

The script I wrote to do this analysis can be found here. It's a fairly long program, but a large chunk of it is just to make the diagnostic plots to show how well the PCA worked – the meat of the PCA happens between lines 107-115.

Results
Figure 2: The first four components of the PCA as a function of reception yards. The first component is the average of all the players, while subsequent components have been computed by the PCA algorithm. Components beyond the second and third are very jagged, signs that they are fitting individual player variation rather than useful information.

I ran the PCA on the reception data to N = 15, but a look at the first four components (Figure 2) indicates that after the first couple of iterations the PCA is mostly fitting differences in the reception distributions for individual players. I can prove (hopefully) this to you via Figure 3.
Figure 3: Sample PCA reconstruction for Anquan Boldin, showing both the original reception distribution (in black) and a reconstruction using the first three PCA components (red). The reconstruction generally does a good job of mimicking the data even with only three components.
In addition to providing the maximal reduction in variance, the PCA also provides a list of coefficients for each component. These coefficients can be used with the components to produce a reconstruction of the original data – in the case of Figure 3, for Anquan Boldin. You can see that just the first three PCA components are required to recover a fairly good representation of Boldin's catch distribution – consistent with what the shape of the components indicated in Figure 2.

Now that we have verified that the PCA is working as intended, we can get to the good stuff – using the PCA to differentiate between players. As I mentioned earlier the data contain WRs, TEs, and RBs. A plot of the coefficients of the first PCA component (PCA1), color-coded by player position, is shown in Figure 4.
Figure 4: The distribution of the first PCA coefficient. Note how WRs are cleanly separated from RBs, while TEs partially overlap with WRs. 
This figure is quite striking – running backs all cluster with (relatively) large coefficients, while nearly every wide receiver has negative values for PCA1. Tight ends tend to fall in the middle, although there is substantial overlap with the wideouts. What this means is that there is something inherently different about where each position grouping tends to catch the ball (and by extension, what routes they run). This is not inherently surprising, given it's fairly easy to see this just by watching how players at the different positions move during a game.

Discussion and Conclusions
What is interesting, however, is the fact that tight ends and wide receivers aren't as cleanly separated from each other as they are from running backs. In fact, while TEs and WRs are clearly not drawn from the same distribution, there is definitely some overlap. This implies that some TEs are being used more like wideouts. Additional evidence for this hypothesis comes from looking at which tight ends are most and least 'wide receiver-like'. Table 1 lists the top and bottom five TEs, sorted by PCA1.

Table 1: Tight Ends with Extreme PCA1 Values
Most WR-likeLeast WR-like
NamePCA1NamePCA1
Owen Daniels-3.8x10-2Steve Heiden6.8x10-2
Antonio Gates-2.9x10-2Donald Lee5.2x10-2
Tony Gonzalez-2.4x10-2Bubba Franks4.6x10-2
Marcedes Lewis-1.4x10-2Eric Johnson3.9x10-2
Tony Scheffler-7.8x10-3Freddie Jones3.7x10-2

The left side of Table 1 generally contains TEs, most notably Antonio Gates and Tony Gonzalez, who are able pass-catchers. The right-hand side, however, consists of players generally not known for their receiving ability. It seems prudent to reiterate here that I'm not claiming that PCA1 is a predictor of skill in any way; rather it merely indicates that some tight ends are being used more like wide receivers than others.

Aside from being a cool result on its own, it also provides a way to classify players based on a statistic that's directly comparable between positions. This is especially relevant right now, as New Orleans Saints TE Jimmy Graham attempts to be treated as a wideout for the purposes of contract negotiation. You can read up on the details for yourself, but the upshot is that if Graham can get himself classified as a WR he can earn himself an extra $5 million over what he would get as a TE. A lot of the discussion has centered around statistics, such as where Graham lines up before the snap or how many receptions he had last year, that aren't directly comparable between wideouts and tight ends.

Unfortunately my data isn't current enough to actually include Graham in this sample (ditto Rob Gronkowski, just FYI), but I would bet that he winds up in the same regime as Gates and Gonzalez. Regardless of whether my intuition is correct, however, PCA provides a way to directly compare players at different positions based only on very basic data, and therefore it could be a very useful tool for position disputes like these.


Monday, March 17, 2014

Hiatus

Hi everyone!

As some of you may know, I'm on the hunt for a new job – outside of academia. Unfortunately, sending out applications takes a lot of time, and despite my best efforts I can't keep doing football analytics and give the job search the attention it needs. So I'm putting PhD Football on hiatus while I figure out what I'm doing next. Once I get my work situation in order I'll get back to the blog – hopefully soon! (And if you happen to know of anyone hiring science PhDs I'd love to hear about it.)

Monday, March 3, 2014

First Down Probability

Abstract
In this post I compute the First Down Probability metric, which predicts how likely a drive will produce at least one more first down for a given down and distance. I find similar overall first down conversion rates to prior studies in the literature, including that third down rushing plays are significantly underutilized. Unlike previous studies, however, I break down these rushing plays by the position of the ballcarrier, and find that a significant portion of this discrepancy comes from rushes by the quarterback, likely from scrambles on broken passing plays. More puzzling is the fact that QB runs on first and second downs don't show this trend, a result that is difficult to convincingly explain.

Introduction
During the course of a football game a fan gets a lot of statistical information. These numbers – QB rating, a running back's average yards per carry, time of possession, etc – generally lack any kind of contextual information about how the game is actually going. At best these statistics are incomplete (showing a WR's average yards per catch after a 99-yard completion, for instance); at worst, they're downright misleading (That QB just had 5 completions in a row...but they were all screens for minimal yardage).

A better statistic is one that takes the game situation into account. For instance, a 5-yard completion should count for more on third and 4 than on third and 16. There are several such statistics already in existence, such as Football Outsider's DVOA metric or Expected Points. These sorts of metrics generally depend on using historical play-by-play data to compute average outcomes for plays at any given down and distance. This approach is (unsurprisingly) more computationally complex, and often can appear opaque to the casual fan. Some of these stats, such as the DVOA, are intricate enough that their creators have decided to keep the full details of their computation private.

A direct and (relatively) simple context-sensitive statistic is Brian Burke's First Down Probability, which I will abbreviate as FDP. That link has more details, but the core insight of this metric is that the average odds of converting the next first down in a series can be estimated for any given down and distance. With this information in hand, it's possible to evaluate the result of a play based on whether it improves or harms the offense's chance of eventually getting a first down.

In this post I'm going to compute the FDP for the plays in the Armchair Analysis database. One may ask why I would recompute this quantity when Burke has already done quite a good job of it. One reason is to ensure the reproducibility of results – while I trust Burke's analysis, everyone makes mistakes. A more basic reason is that while Burke produces a nice visualization of his computed FDP he doesn't provide his data in a tabular form, which makes using his FDP values difficult (at best). I can also extend the FDP calculation to all four downs (Burke only considers second and third downs in his post). Finally, I can (spoiler alert) start using FDP to generate new insights about how teams approach different down-and-distance situations.

Data
As I mentioned before, I'm using the Armchair Analysis database, which covers the 2000-2011 NFL seasons. I grabbed the play-by-play data for all regular season and playoff games, then filtered out plays for several reasons. Plays inside the two minute warnings were discarded because teams play differently in those situations; I removed plays when the game wasn't close (defined as one team being up by more than 16 points) for the same reason. I cut out all punts and field goals as well as penalties (although I keep the results of the penalties in the data: if a team runs for -5 yards on second down but then is the beneficiary of a 15-yard roughing the passer call on third down, the second down play would be considered as ultimately resulting in a first down for the purposes of this analysis). Finally, to avoid biasing the data based on field position I only include plays between the offense's own 10-yard line and the redzone. 

Ultimately this results in a dataset of 262,601 plays, split 56%-44% in favor of passes over runs. I bin these plays as a function of current down and yards to go, eliminating bins with fewer than 200 plays in my dataset. This cut ensures that there are no bins with conversion rates dominated by sampling error. The Python script I used to do this data querying and processing (as well as produce the plots in later sections) can be found here.

Results
Figure 1: FDP as a function of down and distance. The colors denote different downs, while the line styles break down success if the next play in the drive is a run or a pass. In some cases the data for the individual types of plays does not cover the same range of yards to gain. This is due to the minimum play cutoff detailed in the Data section.
Figure 1 shows the raw results, split by down and distance. For the benefit of anyone looking to check my results or to build on them I have also tabulated these results in text files, which can be obtained from my GitHub repository. Feel free to use them as long as you explain where you got them from (and a link back here would be nice as well!). 

Anyway, the first thing to do is to check my results with what Burke obtained. It's a bit difficult to compare directly since I can only eyeball our plots, but in general my results seem to be fairly copacetic with his. The data between downs look fairly similar, with a ~15% shift each down as you go from first and N to second and N, increasing to ~20% from second to third down. There's not much data on fourth down, but I see no reason why it wouldn't resemble the other downs for conversion attempts beyond 2 yards.

More interesting is what happens when you break the conversion percentages down by type of play. Note that when comparing the FDP of runs versus passes at a given down and distance, a higher conversion rate for e.g. a pass doesn't necessarily mean you should always throw the ball in that situation; rather, it implies that currently NFL teams are not playing at the Nash equilibrium. This means that NFL teams should call more passing plays in that situation than they currently do; as defenses adjust to this new reality, there should be more opportunities for successful rushing plays, and eventually the FDP of both types of plays will equalize. Burke has some more detailed discussion of this in his breakdown of first down probability for runs and passes (although he restricts his analysis to third downs).

So again we are treading on old ground, and again it makes sense to compare results. Here we find a bit of a discrepancy, with Burke's rushing FDP on third and short ~5% lower than mine. It's not clear why this would be, although it might be due to the fact that Burke's data only goes through the 2007 season or how he considers sacks (the Armchair Analysis database considers sacks to just be really crappy passes). Regardless, things appear similar enough to proceed.

It's clear that teams aren't passing enough on first and second downs with more than 5 yards to go. Considering teams are already passing a lot in those situations, especially in second and 10+, this would imply that even the occasional rush in such circumstances is too much.

In short yardage, however, things are reversed. On second and 3 or less teams are running less often than they 'should', although the difference is only at about 7% or so. Third down is even more striking: whenever there are fewer than 9 yards to go the data indicate that teams should be running more. This is an even larger discrepancy than Burke finds, and is downright shocking given how unusual 7+ yard runs are under normal circumstances.

But there are two kinds of runs – designed runs and aborted passing plays. Burke considers the latter category to be rare enough to be inconsequential, but I wasn't so certain. So I modified my program to separate out rushes by the position of the ballcarrier – it can't tell if a QB rush was designed that way or if it was improvised, but it's better than nothing.
Figure 2: FDP, corrected for the influence of QB runs (the uncorrected rushing percentages are shown in gray to facilitate direct comparison). 

Figure 2 shows the result, and it turns out that without the QB involved a third down rush becomes a much worse proposition. Indeed, now teams should only be running more on third and 3 or less, consistent with what the data show for second down.

While teams are generally doing better at finding the equilibrium between passes and rushes with RBs, these results indicate that teams are letting their signal-callers run the ball far too infrequently. If you look at the conversion rates just for QB scrambles it's generally 10% or more higher than a rush from a running back in the same situation! Even more interesting is that this offset only applies on third down. On first and second down a QB scramble appears to have similar conversion rates as a regular rush.

Discussion and Conclusions
First of all, the fact that QB rushes are so underused compared to other types of plays is quite interesting. Given the fact that teams generally do not want their prize passers taking hits down the field, most of these successful conversions are likely due to scrambles on passing attempts. But given how high the conversion rate is perhaps coaches should consider running a few more QB draw plays, especially with all the mobile passers entering the league. 

But what's really weird is that QB's rushes aren't more successful than the regular variety on earlier downs. A possible explanation is that defenses are more keyed toward stopping shorter-yardage plays on second down, whereas on third down they sit back and follow the WRs down the field. But in that case you would expect third down rushes to be equally successful, regardless of the runner. I think it's more likely that on second down a QB under pressure isn't concerned with making the sticks, but rather simply looks to get out of trouble. On third down, however, the consequences of playing it safe are much more clear, which encourages passers to scramble for every last yard. 

Of course, I'll be the first to admit that this is just speculation. A definitive analysis of this phenomenon would probably require deep analysis of individual quarterback scrambles, which is way beyond the scope of this work. But it is a cool result from a (relatively) simple metric, and illustrates how deep insights can be gleaned from just a little bit of intelligent digging.

Monday, February 17, 2014

What Positions Do Teams Value in the Draft?

Abstract
Where players are taken in the NFL draft is based not only on their raw skill and potential, but is highly influenced by the perceived value of the position they play as well as the overall supply of players at that position. While quarterback is clearly the most important position on the team, investigating where players at other positions get drafted may provide insights into how NFL teams evaluate the relative importance of those positions. In this post I do a couple of simple analyses of where players get drafted, breaking the data down by position groupings and finding that there are slight variations in where players at different positions can expect to be taken, although more draft data and/or deeper analyses would be necessary to decisively show disparities between the positions.

Introduction
It's pretty clear that quarterback is the most important position on a football team, and their value is reflected in the draft – in the last decade 13 QBs have been taken in the top five picks. But exactly how much are QBs favored by GMs and coaches come May? And what about the perceived value of other NFL position groups?

Data
This one's fairly easy, as Armchair Analysis lists where each player was drafted as well as what position they play in the same table. I downsampled the data to include only players drafted since 2001 (up to 2011, the last year in the database), because the table contained only partial records before that year. You can find the script I used here.

There is also a limit to the granularity of the positions in the database. For instance, no distinction is made between any of the players on the offensive or defensive lines. This does limit how detailed I can make my analyses, and there may be significant difference in the valuations between positions on the O-line (for instance, a left tackle–protecting a passer's blind side–is likely to be more highly sought-after than a right tackle, although things are never that clear-cut).

Results
I first took a look at the raw data, plotting what fraction of picks go to each position grouping in Figure 1. To improve the signal-to-noise of the data (and just make things easier to visualize) I binned the data in groups of 10 picks. 
Figure 1: Percentage of players drafted at each position, as a function of draft position. Kickers added for scale.
While colorful, the stacked nature of this figure makes it somewhat difficult to parse. Figure 2 shows where players of each position get drafted, independent of any other position groups in the sample. While there are still a bunch of overlapping lines, it's now easy to see if and where teams prefer to draft players at each position.
Figure 2: Where players in each position grouping get picked. Most of the positions have flat distributions. The only notable exception to this trend are QBs, which are highly peaked near the first few picks.

What stands out most is the large (and unsurprising) upturn in QB picks in the first bin. Otherwise, however, there don't appear to be any obvious trends. But the eye can deceive, so let's try to be just a bit more rigorous. Toward that end I computed the expectation value for each position. You can get more detail about the expectation value from Wikipedia, but in this case it's basically just the average place players at each position get drafted. You can't make a pretty graph with it, but you can see the results in Table 1.

Table 1: Draft position as a function of player position
PositionQBRBWRTEOLDLLBDBK
Expected Draft Position102±8.1126±5.1117±4.4131±5.8122±4.0114±3.9118±4.2119±3.2164±9.1

Table 1 starts off by confirming what we saw visually in the figures – quarterbacks are by far the most sought-after position, with an expected draft position 12 picks higher than any other position (although the bootstrapped standard deviations are just consistent with defensive linemen being equally valued). Wide receivers are drafted slightly earlier than running backs, a trend that's been picking up steam in recent years as teams realize that RBs tend to have short careers and therefore don't provide as much value for an early pick. Linemen are the highest-drafted of all defensive position groupings, probably driven by teams' desire to press the point of attack – the general wisdom is that QBs have an advantage against the secondary due both to their skill and rules restricting contact on WRs (although the Seattle Seahawks would beg to differ) , and that generating pressure and sacks is seen as the best way to defend the pass. More data would be required to definitively prove that these differences are real, however.

Discussion and Conclusions
While it's unsurprising that QBs are the hottest position in the draft, it's nice to see it confirmed with numbers. More interesting are the expectation values for the other positions, which while much closer to each other could be used as signifiers of broad NFL trends in how talent is evaluated between positions. The analyses presented here are pretty simplistic, but they are indicative of the potential power of draft data. 

Monday, February 3, 2014

Do Defenses Get Tired?

One of the hallmarks of good science is reproducibility – the ability for other researchers to repeat (and thus verify or disprove) your work. While I hope I have laid out enough details in each of my posts for anyone interested to check my analyses, I am happy to report that I will now be uploading my code for each post to GitHub. Check it out!

Abstract
One of NFL announcers' favorite statistics is the time of possession, which is usually discussed in the context of how tired the defense must be when they've been on the field for a long time. But do defenses actually get fatigued over the course of the game? To answer that question I used the raw number of plays a defense is on the field (rather than the less accurate time of possession) and computed the probability that the offense will score as a function of this number. Ultimately, even after 70+ plays there is no increase in the offense's point production – a clear indication that defensive players have plenty of endurance to make it through even the longest games.

Introduction
A common statistic to see quoted during a game is time of possession (lazily referred to as ToP in the rest of this post). Usually referenced between quarters or near the end of the game, commentators generally talk about ToP in the context of noting how long one team's defense has spent on the field. (Offenses generally have more flexibility in keeping their players fresh through skill package substitutions.) The not-very-subtle implication is that the defense is getting worn down by the amount of time they've been playing and will therefore be more likely to allow points.

This is, of course, largely bullshit. Since so much more time is spent between running plays than actually ticks by while the football is in motion, ToP is really only a good indicator of how much standing around the teams are doing. Additionally, since the game clock stops for an incomplete pass (and pass-heavy offenses tend to pick up yards in chunks and have shorter drives as a result) ToP is naturally skewed towards favoring rushing offenses. If ToP was only collected during a play it might have some value, or better yet just strap some pedometers onto the players and figure out how much they're really running around on the field. 

The idea at the core of ToP, however – that a defense spending more energy on the field may eventually show signs of fatigue and therefore allow more points – is not unreasonable. The current ToP statistic is just a terrible way of measuring it. This question is especially interesting because if defenses do get tired over the course of a game it would add more value to a strong rushing attack, a facet of the offense that has come
under significant fire in recent years as being strictly inferior to the passing game. 

While perhaps not quite as good as my earlier pedometer suggestion, the raw number of plays run should be a much better proxy than ToP for investigating whether defenses get tired. By comparing the results of drives as a function of the number of plays run will therefore indicate whether defenses ever become fatigued enough to affect play.

Data
I started with all the play-by-play data in the Armchair Analysis database, and computed the beginning and end of each drive as well as whether any points were scored. By separating this data out between the home and away teams for each game I constructed a running tally of the number of plays run by the offense at the start of each drive.

Before getting into the results it's important to note that for this analysis the devil really is in the details. The data can be biased in many ways, some subtle and some not.  First and perhaps most obvious is that while all games can be expected to start in a similar way, a drive in the 4th quarter of a blowout is going to look much different than one in a close game. To avoid this problem I restricted my sample only to games where the final tally is within one score (8 points). I also throw out special teams plays, as I am most focused on how the defense plays as a unit (although note that on most teams at least some special teams players will see snaps on the offense or defense).

Another issue is penalties. Most infractions are only called after the play is over, and even though (if the penalty is accepted) the original play doesn't count for statistical purposes I still want to count it for this analysis. Some penalties, however, result in the refs immediately blowing the play dead (the most notable examples of this being false starts and encroachment). These penalties I strip out from the final play-counts. Occasionally a penalty occurs after the play is over (e.g. many unsportsmanlike conduct calls). A dead-ball foul should be purged from the data; unfortunately (as far as I can tell) there is no indication in the database whether a penalty is a dead-ball infraction or not, so I choose to leave all of these penalties in my sample. Fortunately these types of penalties are relatively infrequent, and therefore shouldn't significantly affect the results.

Lastly, drives near the end of halves create significant additional bias as well, since many of them are kneel-downs or result in unusual play-calling (Hail Mary passes, record-setting field goals, etc). I cut out the result of any drive that starts within the 2-minute warning of either half, although I include the plays run on those drives in the running totals of plays run during the game.

It is also worth noting that occasionally there are errors in the database, where the down sequence counter I use to determine the length of each drive is not reset between possessions. This issue is most obvious in the existence of some unusually long (20+ play) drives, although it likely affects shorter drives as well. Generally the incidence of these errors is very low (there are only ~10 of these very long drives in the entire sample, for instance), so I do not believe they will bias the results – especially not for shorter drives, where the sheer number of actual drives should drown out the few erroneous ones.

Results
Before diving into the full analysis, I think it's interesting to look at some raw numbers about NFL drives that aren't usually discussed. Take a look at the distribution of drive lengths in Figure 1, and the distribution of drives per game in Figure 2. The plurality of drives take 3 plays, which makes sense as these are 3-and-out possessions. The occurrence of drives longer than ~6 plays is fairly well described by a power law (a straight line on this log-normal plot) with a cutoff at 21 plays (The few plays above this threshold are likely all spurious results as mentioned above).  Note that, assuming a team would punt on any 4th down, the maximum number of plays an NFL drive could take would be 30.
Figure 1: Distribution of drive lengths. After about 5 plays the frequency of drive length decreases quickly, and very few drives take more than ~15 plays.

While Figure 1 has home and away drives lumped together, I've left them separate in Figure 2 – it's pretty clear that there's no significant difference in the number of drives per game between the home team and the visitors. The distributions are well fit by a Gaussian distribution with an offset of almost exactly 10 drives and a standard deviation of a little less than two drives. This indicates that in a normal game a team will have less than 12 chances to score points – not a lot of opportunities! (It also implies that a team scoring 40+ points in a game is reaching the endzone on at least half of their possessions.)

Figure 2: Distribution of drives per game, for both home and away teams. There is very little difference between the home and away histograms. Solid lines show Gaussian fits to the data, which peak around 10 drives.
With the basics out of the way, now let's delve into the good stuff. I have the number of plays already run by the offense at the start of every drive lined up with the result of that drive. From there it's fairly straightforward to calculate the fraction of drives that end in scores as a function of the number of plays that have been run, which is shown in Figure 3.

Figure 3: Fraction of drives resulting in scores as a result of plays run. No trend is observed.
The errors on Figure 3 come from simple counting statistics, and the bin widths are adaptively chosen to have similar errors. If defenses really did fatigue as they spend more time running around on the field, the percentage of drives ending with points should increase as a function of the number of plays, but there is no evidence for this trend. If you look at touchdowns or field goals individually the picture remains the same – even if an offense runs 70+ plays the defense doesn't budge an inch. 

Discussion and Conclusions
It's pretty obvious from Figure 3 that defenses don't get fatigued during games. On a given drive the offense has a ~35% chance of scoring regardless of how much the defense has been on the field. If you look at how rushing averages change over the course of a game you reach the same essential conclusion, which is a good indication that my results are indeed accurate. While on the surface it seems totally reasonable that players would wear down as the game wears on, given the fact that the number of plays a team runs per game is a well known quantity it makes sense that players would have enough conditioning to make it well beyond even the longest of games. (It would be interesting to repeat this analysis for overtime games but my sample size is far too small.) 

So what are the implications of this result? Well, for one it means that announcers should stop talking about how long the defense has been on the field over the course of a game! More importantly, it means that there's one less reason for teams to rely on running the ball – if a coach feels that throwing deep every play best suits the talent on his offense, they should feel free to do so without consideration for their defense.

Social Media Bar

Get Widget