Predicting the MLB Draft: Understanding the Model
I recently wrote about projecting the 2020 MLB Draft, providing the full methodology and data used for the model, the limitations, and ways to improve the model. It was inspired by the NFL Draft modeling done by ESPN, and it’s goal is to find plus expected value (+EV) bets in the market and a way to understand why teams will take players where they do in the draft.
The formula was pick = rank + age + position + school + FV, using data from FanGraphs for 2015-2019, using a bayesian logit model. But what is under the hood of the model? Using the random forest package in R, we can see the importance of each variable in the model by looking at the percent increase in mean square error (the higher the more important):
|Variable||% Increase MSE|
In this simple model we see that whether the prospect is a high school or college player is the most important, followed by where they rank on FanGraphs The Board. Where the player plays is the least important variable, and is not very predictive of where the player is taken in the draft. In this simple model, each position is defined separately, so what if positions are less granular and instead of a first baseman or third baseman, they’re listed as a corner infielder?
Doing this for all sets of positions, the new model (model1) will have the following positions: RHP, LHP, C, CIF (corner infield), Middle (shortstop, second base, centerfield), and Outfield. Rerunning the random forest to get the variable importance, we get the following table comparing the two models:
|Variable||Simple Model||Model 1|
The rank of variables is the same, but the position becomes positive and is more important than a random variable. In other words, the new position – which is less granular – is an important predictor of the model and should be used in place of the raw position. One way to check this is by looking at the accuracy of the model. For this we will use the 2015-2018 data to train the model and look at the 2019 draft, again using the random forest package.
Because of the limitations, the data will be reserved to the first 50 picks. The percent of variance explained in the simple model was 59.78, but in model 1 that increased to 61.43. When looking the entire draft, the simple model explains 19.05 percent of the variance and model 1 explains 20.27 percent of the variance. As a note, this only is for those players ranked on The Board. Within the first 50 picks, the model does much better than when compared to the entire draft. Part of this again could be explained by the signability issues of high school players. Looking at the 2019 draft, Matthew Allan ranked 20th on FanGraphs and he ultimately went 89th overall. According to the simple model, he had a 10.40 percent chance of falling that far, and model 1 had the probability at 10.59 percent.
In 2020, where signability can become a thing is with Nick Bitsko, for instance. FanGraphs writes regarding the high school pitcher, “… so teams’ opinions of Bitsko are going to be driven by what they saw last year… That, combined with the possibility that some teams won’t be comfortable taking a player they’ve barely seen, makes Bitsko’s stock pretty volatile.” This will be a player to watch in terms of the draft. One area where this model falls short is the predicting of high school players, as they are more complex to model in a simpler version. Adding in a dummy variable for signability issues, created using scouting reports and something teams would have in their own data, would only look to improve the model.
Going back to the first pick, let’s again look at the top five potential picks using the simple model and model 1 with the knowledge that model 1 performs better as positions are less granular than simply their raw position.
|Player||Position||Simple Model||Model 1|
In the simple model, Hancock is the fifth most likely to go 1.1, however in model 1 he is the sixth most likely with Arkansas outfielder Heston Kjerstad going from number seven in the simple model (2.35 percent) to number five at 1.98 percent. It seems unlikely that Torkelson or Martin do not go first overall.
Going forward, given the examination of the simple model, the draft model that will be used is model 1. Positions are not as granular in the draft, using their raw position was not a predictor of where the player was taken, but by adjusting into groups, the model improved as the amount of variance that was explained increased. After the 2020 MLB Draft is complete, a review of how the model performed and if there was any difference in draft strategy (with the theory that general managers will be more risk averse and take college players this year compared to years past) will be examined.