January 22, 2006
Improving the Park Model
One suggestion for improving the model for the Probabilistic Model of Range was to use just visiting players to construct the park model. The reason for this is that an everyday player can skew the values associated with a park. A really good fielder, accounting for the almost half the model, makes everyone else look worse than they are. The same is true of a very poor fielder making everyone else look better.
My resistance to this idea was two-fold.
- I didn't want to throw out perfectly good data, especially with small sample sizes.
- I couldn't come up with a good way to smooth the data. Only looking at half the data, there are going to be rare events that aren't covered just by the visiting team.
I decided to try to solve the smoothing problem today. I use the orginal park model (the first table in this post) to agument the data when the visiting team numbers are missing or sparse. What I want is the visiting team model to dominate. For a given set of parameters, if the number of ball in play against the visiting team is greater than or equal to the number of balls in play by the home team, then I just use the visiting team model. Other wise, I use a model weighted like this:
- 2.0*VisitBallsInPlay/AllBallsInPlay for the visiting team model
- 1.0-(2.0*VisitBallsInPlay/AllBallsInPlay) for the overall park model.
Let's say there were 100 balls in play for a particular set of parameters. If 60 of those came against visiting teams, I just use the visiting team model. But if 40 of those came against the visiting team, I weight the model 80% visiting team, 20% overall park model.
The following table should be compared to the first table of this post. That's the model used for smoothing.
Probabilistic Model of Range, 2005. Model Includes Parks, Smoothed Visiting Team Fielding
Team | InPlay | Actual Outs | Predicted Outs | DER | Predicted DER | Difference |
Astros | 4204 | 2963 | 2845.45 | 0.705 | 0.677 | 0.02796 |
Athletics | 4286 | 3064 | 2944.68 | 0.715 | 0.687 | 0.02784 |
White Sox | 4457 | 3175 | 3052.08 | 0.712 | 0.685 | 0.02758 |
Phillies | 4211 | 2962 | 2846.84 | 0.703 | 0.676 | 0.02735 |
Indians | 4385 | 3108 | 2988.57 | 0.709 | 0.682 | 0.02724 |
Cardinals | 4414 | 3101 | 2991.45 | 0.703 | 0.678 | 0.02482 |
Braves | 4559 | 3162 | 3059.99 | 0.694 | 0.671 | 0.02238 |
Blue Jays | 4511 | 3156 | 3058.15 | 0.700 | 0.678 | 0.02169 |
Twins | 4545 | 3193 | 3094.64 | 0.703 | 0.681 | 0.02164 |
Angels | 4383 | 3070 | 2987.00 | 0.700 | 0.681 | 0.01894 |
Giants | 4520 | 3152 | 3070.46 | 0.697 | 0.679 | 0.01804 |
Orioles | 4377 | 3032 | 2953.85 | 0.693 | 0.675 | 0.01786 |
Red Sox | 4575 | 3127 | 3053.44 | 0.683 | 0.667 | 0.01608 |
Pirates | 4467 | 3095 | 3023.38 | 0.693 | 0.677 | 0.01603 |
Mariners | 4546 | 3184 | 3111.16 | 0.700 | 0.684 | 0.01602 |
Devil Rays | 4560 | 3112 | 3044.55 | 0.682 | 0.668 | 0.01479 |
Diamondbacks | 4571 | 3118 | 3062.57 | 0.682 | 0.670 | 0.01213 |
Brewers | 4252 | 2960 | 2908.65 | 0.696 | 0.684 | 0.01208 |
Tigers | 4527 | 3152 | 3099.48 | 0.696 | 0.685 | 0.01160 |
Cubs | 4117 | 2871 | 2825.48 | 0.697 | 0.686 | 0.01106 |
Rangers | 4697 | 3200 | 3152.10 | 0.681 | 0.671 | 0.01020 |
Dodgers | 4392 | 3073 | 3031.40 | 0.700 | 0.690 | 0.00947 |
Rockies | 4537 | 3043 | 3008.62 | 0.671 | 0.663 | 0.00758 |
Mets | 4424 | 3094 | 3061.94 | 0.699 | 0.692 | 0.00725 |
Padres | 4423 | 3051 | 3043.61 | 0.690 | 0.688 | 0.00167 |
Yankees | 4483 | 3087 | 3085.86 | 0.689 | 0.688 | 0.00025 |
Marlins | 4367 | 2965 | 2965.36 | 0.679 | 0.679 | -0.00008 |
Nationals | 4538 | 3161 | 3167.85 | 0.697 | 0.698 | -0.00151 |
Royals | 4611 | 3068 | 3099.55 | 0.665 | 0.672 | -0.00684 |
Reds | 4650 | 3148 | 3191.15 | 0.677 | 0.686 | -0.00928 |
The first thing that strikes me is that the Yankees move up. I didn't expect that. One reason readers suggested a visiting team model was that fielders with poor range like Jeter and Williams would bring down the average and would end up being rated higher than they should be. Yet the Yankees get better with a model dominated by the opposition!
Let me suggest that the original model measured something this model isn't; a player against himself as he ages. So this model is comparing the 2005 Bernie Williams vs. the 2002, 2003 and 2004 Williams. My guess is his range is going down as he ages. The same with Jeter. So instead of pulling the averages down, their younger selfs were pulling the averages up.
Even with that, I don't see a big difference between the Models. Does anyone believe that one is really superior to the other?
I haven't been involved with any of the park discussions, but it sounds like you adjust for ballpark on the vector level (or some very small level like that). I suggest you regress your factors to the mean. There's a simple analysis: take the three-year figures and regress them against the fourth year to see how predictive they are. I'm guessing you won't find many that are terribly predictive. Personally, I would drop the ones with a correlation below .2 or .1 and regress all others to the mean by (1-r).
You could also do this for all players vs. just the visiting players to see which data set is more predictive.
My own bias would be to forego the park adjustment altogether unless you find some vectors in some parks with a relatively high correlation from year to year (say, over .5).