Explanation:
Let's start with an incredibly useful NFL metric called 'point differential'. It is great at showing how good each team has been, and is even better for building models to predict future outcomes. For anyone that does not know what point differential is, it is the total number of points that a team has scored minus the total number of points scored against that team. It is common to look at point differential on a game level, a season level, or for a portion of a season. It is considered to be a solid baseline for predicting future outcomes because, even though it is so simple, it is quite good at predicting wins. So, when building a model, point differential is a good starting point.
However, point differential is not only useful for modeling because it is helpful in predicting wins. It also is useful in being predicted. That is, a model built to predict point differential on a game level can actually be better at predicting wins than a model built to predict wins. Counterintuitive indeed! There are a few other metrics that are also useful in being a target variable, but point differential has the additional benefit that it perfectly correlates with wins: if a team's point differential for a game is positive, then that team won. Therefore, if you can predict point differential, you can directly predict game outcomes. It is an incredibly useful metric.
So, what are the downsides of using point differential as a target variable? One is that when the teams get an abnormal amount of scoring opportunities (because of a shootout, overtime, or some other reason), the point differential can be artificially inflated or deflated. Another downside is that there are 8 units contributing to the overall point differential, and it would be helpful to separate the point differential into the contributions from each of these units. But wait, aren't there only 3 units? Offense, Defense, and Special Teams? Well this might be overkill, but I also want to separate special teams into the Kickoff Kicking unit, the Kickoff Receiving unit, the Punt Kicking unit, the Punt Receiving unit, the Field Goal/Extra Point Kicking unit, and the Field Goal/Extra Point Defensing unit. There might be some correlations in talent/outcomes between these units, but I don't think that we should make that assumption before checking through modeling. Okay, let's say there are 8 units. Why is it helpful to separate the point differential between them? Each unit has a separate ability and that ability might be easier to predict or have less variance than than the other units. Imagine that an Offense gets down to the 10 yardline but then the Field Goal/Extra Point Kicking unit (FG/XP Kicking unit) misses the field goal. That missed field goal definitely doesn't tell us anything about the Offense. It also only tells us a little bit about the FG/XP Kicking unit and it should probably be assumed that another miss in this situation is still unlikely. Therefore, properly apportioning the point differential between these two units would show that the Offense did better than might be assumed based on the overall point differential, and the FG/XP Kicking unit did worse.
 
 
So, let's start modifying point differential to find an even better metric to use as our target variable. We could try to address the difference in the amount of scoring attempts by dividing the point differential by the number of plays. This is a step in the right direction, but we still can't distribute the contributions among the units. One big way to help with this distribution is to use 'Expected Points Added' from nflfastR. This gives each game state an expected value (called 'Expected Points'). Then, for each play, they take the difference between the Expected Points of the game states before and after that play in order to calculate the amount of Expected Points added on that play. This allows us to ascribe the Expected Points added, or 'EPA', of each play to whichever units were on the field for that play. Thus we can apportion the overall point differential to each unit on a per play basis! This also allows us to normalize the point differential by the number of plays so that longer or shorter games are normalized. This normalization makes sense, but it still has an issue that we need to address.
The issue is that some teams have more consistent, low-value plays while other teams have fewer explosive, high-value plays. Using EPA per play penalizes the more consistent teams to the point that the winning team can end up with a lower EPA per play! Even if EPA per play is more predictive than other metrics, if we are trying to use it as a target variable to predict who won, it needs to accurately reflect who won.
So, we can designate the points added to different units using EPA, but we need to normalize it by a value that is the same for both teams in a given game. Instead of normalizing it by the amount of plays, we could normalize each unit by the total number of drives in a game. This metric, EPA per Drive or EPApD, is able to adjust for games that have a different number of scoring opportunities without penalizing different styles of gameplay. It also has the property that the sum of the EPApD of Team A's units will be greater than the sum of Team B's units if and only if Team A won. This meets all of the criteria that we were looking for!
 
 
So we have found the type of metric that we want to use, but how exactly do we calculate it. For each game, we can attribute the EPA of each play to the appropriate unit and have running totals for each unit. We can then check that the sum of all the units is equal to the point differential. Then, for each half, we can count the number of drives, adding 0.5 if one team got an extra drive. Then, to get the final metric, we take the total EPA in a game for each unit and divide them by the total number of drives in that game.
Now, the way that we're treating drives might feel a little weird. If one team gets an extra drive, why does it make sense to divide the EPA totals by the same number? Other than being necessary to make sure that EPApD is perfectly correlated with who won, it takes into account that the number of drives each team has, and whether or not one team gets an extra drive, is based on how the teams played. It is something that they can control and should be captured in our metric. It might also feel weird to divide the contribution of the Punt Kicking unit by the total number of drives instead the number of punts. Once again, this is necessary to make the sum of the unit's EPApD only greater than 0 when the team wins, but it also reflects how often the team used that unit. If the game only had one punt, the team chose, or was forced, to use that unit less than normal and the metric should reflect that.
To be clear, if Team A is playing Team B, the EPApD for Team A's Offense will equal negative the EPApD for Team B's Defense, and the EPApD for Team B's Punt Kicking unit will equal negative the EPApD for Team A's Punt Receiving unit, and so on. In order to be consistent when doing the calculations, I calculate the EPApD's for the Home team's units for each game and then calculate the values for the away team's units based on the Home team's units.
 
 
So, let's total up the EPA attributed to each unit using R in the code from the top of the page and get their totals for every game since 2011. Before normalizing the total EPA for each unit by the number of drives, lets make sure that their sum actually equals the point differential for each game. Immediately we have an issue. The totals are completely different than the point differentials!
Looking into the differences, I found that the there is too much EPA for every single drive. Is something wrong with nflfastR's EPA? I realized nothing is wrong, but they did make a choice. They chose to have touchdowns, field goals, and safeties be worth their nominal amount when calculating the Expected Points for each game state. This makes a lot of sense and is useful in most other situations, but is causing a problem in ours. The difficulty is that, since the Expected Points for the game states before kickoffs are non-zero, no unit is being credited with adding/losing those Expected Points between after the scoring play and before the kickoff; they are appearing out of nowhere. What this means is that the value of, let's say, a field goal isn't actually 3, because the next team gets the ball and starts in an advantagous position (unless the game is over). Therefore the Expected Points of the next kickoff needs to be subtracted from the the value of that Field Goal in order to make the total EPA equal to the point differential. This adjustment must be made for every scoring drive.
After making the adjustments for scoring drives and accounting for many surprising circumstances (like the few times Tom Brady punted on third down), I found that there were still some discrepencies in a few games. I was able to find the specific plays that were causing issues and realized that the problem originated with the values I was getting from nflfastR, so I made bug reports for each issue on their github and they quickly fixed them. Now, the sum of the EPA attributed to all the units finally equals the point differential in every game. Normalizing the totals for each unit by the total number of drives per team in the game gives us our new metric: EPApD.
This metric correctly attributes the point differential to each unit, such that their sum is positive if and only if they win, while also controlling for the amount of scoring attempts there are in different games. Remember, this is a descriptive metric that is best used as a target variable for models when trying to predict individual games. It is also useful to see how much each unit contributed to wins/losses on an individual game level.
By using this metric as a target variable for predicting NFL game outcomes, the contributions and variance of each unit can be predicted separately and then simply added together to get a prediction for the distribution of the point differential, and therefore a probability for who will win.
 
 
In order to visualize this metric for you, I will take each team's average across the 2021 regular season. Once I take this average, I lose the metric's usefulness as a target variable, but gain a descriptive view of what happened in the season. This is similar to looking at point differetial across a whole season.
A reminder: although this average might have some predictive value, it is primarily descriptive and does not account for strength of schedule.
EPApD for each Unit, in 2021 Regular Season   EPApD for each Unit, in 2021 Regular Season EPApD for each Unit, in 2021 Regular Season   EPApD for each Unit, in 2021 Regular Season
Since Offense and Defense are the most stable and influential, I can shift the values on the Offense/Defense graph by the total special teams contribution in order to show each team's overall EPA per Drive:
EPApD for Offense and Defense Adjust for Special Teams, in 2021 Regular Season