Prediction and Explanation in Baseball

Casey’s article yesterday about Mike Zunino and calling up prospects reminded me of an important distinction in sports, specifically in baseball. Predicting any baseball player’s future is hard to do with much precision—and minor league kids are even more challenging—but to predict a player’s future value to his team, it is necessary to sort out which stats are and are not inherently predictive.

I wrote a similar piece in regards to soccer if you like boots and pitches and upper nineties and all that, but here’s the baseball version. Often the concepts of statistical explanation and statistical prediction are misunderstood and confused. Let’s take batting average for example. On this site, we don’t believe in focusing too much on batting average as an analytic tool. There are three main reasons for this; one comes from a prediction argument, one from an explanation argument, and one from a combination of both.

At the player level, consider some work done by one Russell Carleton.  Carleton found that—for a large set of players that had recorded at least 2,000 at bats—it took about 910 ABs to “adequately predict” the batting averages for players in their next 910 ABs*. This concept is often referred to as “stabilization rate,” the rate at which statistics are able to predict themselves.

So essentially, players’ batting averages required a season-and-a-half to stabilize, and that makes it hard to confidently use players’ batting averages to forecast their future performances. Players’ isolated slugging percentages (ISO), for comparison’s sake, required just 160 ABs to stabilize. That’s fast! OBP and SLG also required less than a season’s worth of plate appearances, though they weren’t as quick as ISO. Batting average just takes to too long in the prediction department.

That summarizes prediction’s argument against using players’ batting averages, but what about the explanation argument? For the explanation argument, first consider that it is an offense’s job to score runs. Accumulating hits and other stats is cool, but at the end of the day, it’s all about how many runs your team scored and how they did it. Can batting average adequately explain teams’ run scoring abilities? No, not relative to OBP, SLG and wOBA.

Using data from just the first half of 2012, batting average was only able to explain about 52% of teams’ run scoring. OBP explained 61%, SLG 78%, and wOBA a whopping 84%. In other words, using a team’s wOBA we can explain that particular team’s run scoring with far more precision; we can better account for randomness and variation, to put it nerdily. Everything seems to explain run scoring better than batting average can.

We’ve assessed prediction and explanation a little now, so why don’t we combine them? Now I’m going to look at how well teams’ first-half numbers in 2012 were able to explain and predict their run scoring numbers during the second half**. Here’s that chart:

Stat Runs Explained












Even team speed, as measured by stolen bases, was better than batting average at being able to forecast run scoring. While this data came from splitting up just one season, it reiterates what larger data sets have divulged with more certainty. If run scoring is an offense’s goal—as it should be—then it’s better explained and predicted by looking at things other than batting average. The combination of on-base skills, power and speed not only explains a team’s run scoring ability more precisely, it forecasts its run scoring potential more precisely, too.

*Close enough an explanation for government work. Technically it wasn’t the next 910 ABs, but the goal here was determining predictive properties for a statistic, and I believe that goal was attained. 

**The author understands that there are confounding factors due to trades and second-half team lineup decisions. The author believes his point remains intact.