Evaluating pitchers is tough. Most of the traditonal batting stats are based on discrete, context-free events that mostly have to do with the offensive player’s skill – Home Runs, hits, stolen bases, batting eye, etc. There are still stats like RBI and R that depend on the players around the batter, but generally the quick and dirty thing that were tradtionally looked at with a batter were their triple crown stats: AVG, HR, and RBI. Nowadays you’re more likely to see a hitter’s slash line (AVG/OBP/SLG), OPS, or if you’re lucky wOBA, all of which are pretty much context-neutral.
For pitchers it’s much more difficult to tease out context-neutral stats. For a long time the primary stats used to evaluate a pitcher were Wins, ERA, and Strikeouts. I don’t even need to explain how useless Wins are as a stat to evaluate a pitcher’s performance. Much like RBI it’s a narrative stat rather than a particularly quantitative one, and the RBI stat is far less ambiguous. It depends heavily on a pitcher’s team’s offense scoring enough runs to give the team a win, as well as the pitcher’s bullpen not regularly imploding behind him. ERA is a little better — it’s a rate stat that quantifies the number of runs scored off the pitcher per 27 outs. But it also has issues. One problem is the bullpen issue mentioned above – if you leave a guy on when you’re pulled from the game and Jeff Samardzija gives up a HR on the first pitch you’re still dinged for the guy on the basepaths. One even bigger problem is defense. You don’t want to burn a pitcher for pitching in front of a team that’s a bunch of statues, similarly, a pitcher who plays in front of a team of Ozzie Smiths is obviously going to look a lot better. The fact that pitching is so intertwined with defense makes it harder to tease out some sort of context-free metric for how good a pitcher might be.
One big breakthrough in evaluating pitchers came when Voros McCracken introduced Defense Independent Pitching Statistics – namely, developing metrics that completely strip out fielding from the equation. He found that generally pitchers have no control over what happens to balls in play – in almost all cases pitchers defense independent stats such as strikeouts, walks, and home runs tended to be much more stable than their BABIP (batting average on balls in play). There have been more modifications and clarifications to this theory (which we’ll talk about in future stats posts), but overall it provided a new framework for evaluating pitchers.
Anyway, here’s the formula
FIP = (13*HR + 3*(BB+HBP – IBB) – 2*K)/IP + C
C is a constant that rescales FIP so it’s on the same scale as ERA, much like we do with wOBA (i.e. the average FIP is the same as the average ERA). For general purposes you can think of it as 3.2, and it is usually computed at a seasonal level.
Where do these numbers come from? It’s the same idea as in wOBA – those weights on the events are derived from the average run value of each event. And in fact you could even say that FIP does include balls in play because of the + C factor at the end. By scaling it to the league ERA, you’re basically saying that FIP evaluates a pitcher for skills that he has control of relative to facing an average offense and with an average defense behind him. Aside from neutralizing the context, one of the advantages of FIP is that it is a better indicator of future performance than ERA. Colin Wyers did a study a few years ago that looked at FIP (as well as a few other DIPS type stats that we may look at here) as a predictor of ERA and found that it does roughly a 20% better job than ERA alone.
FIP does have its flaws, which other stats have sought to overcome. One of the ones that has always jumped out to me anecdotally is things like GB rate, which a pitcher also has some control over. Some other systems such as tRA and Baseball Prospectus’s SIERA factor in batted ball types. Another problem is that HRs can obviously have huge impact on the FIP formula but are relatively rare events, so some bad luck on HRs leaving the park can affect a pitcher. xFIP (for expected FIP) improves FIP by trying to normalize out pitchers’ HR/FB rates, though on average it’s not much better of a predictor than FIP because if you look at the entire population of pitchers, xFIP should be about the same as FIP.
FIP is used in Fangraphs’s calculation of Wins Above Replacement, which we will discuss in the future. One debate in the saberist world is what pitching stat one should actually use to value a pitcher’s performance. Fangraphs uses FIP because as mentioned it neutralizes defense and offense faced. However Rally, the creator of the now-propreitary CHONE projections, used ERA when creating his historical Wins Above Replacement database, and it is also used at Baseball-Reference. The main question here is whether you prefer FIP, which is more of a predictive stat, i.e. what should have happened, vs ERA which is a narrative stat, i.e. what did happen. We refer to these as fWAR and rWAR.
Credit where credit is due
FIP was originally created by Tom Tango, based on McCracken’s DIPS theory.
This FIP primer at 3-D Baseball by The Book Blog regular Kincaid was a good reference, as well as the one that pointed out that balls in play are taken into account via our rescaling to ERA.