Baseball Musings
Baseball Musings
January 22, 2006
Improving the Park Model

One suggestion for improving the model for the Probabilistic Model of Range was to use just visiting players to construct the park model. The reason for this is that an everyday player can skew the values associated with a park. A really good fielder, accounting for the almost half the model, makes everyone else look worse than they are. The same is true of a very poor fielder making everyone else look better.

My resistance to this idea was two-fold.

  1. I didn't want to throw out perfectly good data, especially with small sample sizes.
  2. I couldn't come up with a good way to smooth the data. Only looking at half the data, there are going to be rare events that aren't covered just by the visiting team.

I decided to try to solve the smoothing problem today. I use the orginal park model (the first table in this post) to agument the data when the visiting team numbers are missing or sparse. What I want is the visiting team model to dominate. For a given set of parameters, if the number of ball in play against the visiting team is greater than or equal to the number of balls in play by the home team, then I just use the visiting team model. Other wise, I use a model weighted like this:

  • 2.0*VisitBallsInPlay/AllBallsInPlay for the visiting team model
  • 1.0-(2.0*VisitBallsInPlay/AllBallsInPlay) for the overall park model.

Let's say there were 100 balls in play for a particular set of parameters. If 60 of those came against visiting teams, I just use the visiting team model. But if 40 of those came against the visiting team, I weight the model 80% visiting team, 20% overall park model.

The following table should be compared to the first table of this post. That's the model used for smoothing.

Probabilistic Model of Range, 2005. Model Includes Parks, Smoothed Visiting Team Fielding
TeamInPlayActual OutsPredicted OutsDERPredicted DERDifference
Astros42042963 2845.45 0.705 0.677 0.02796
Athletics42863064 2944.68 0.715 0.687 0.02784
White Sox44573175 3052.08 0.712 0.685 0.02758
Phillies42112962 2846.84 0.703 0.676 0.02735
Indians43853108 2988.57 0.709 0.682 0.02724
Cardinals44143101 2991.45 0.703 0.678 0.02482
Braves45593162 3059.99 0.694 0.671 0.02238
Blue Jays45113156 3058.15 0.700 0.678 0.02169
Twins45453193 3094.64 0.703 0.681 0.02164
Angels43833070 2987.00 0.700 0.681 0.01894
Giants45203152 3070.46 0.697 0.679 0.01804
Orioles43773032 2953.85 0.693 0.675 0.01786
Red Sox45753127 3053.44 0.683 0.667 0.01608
Pirates44673095 3023.38 0.693 0.677 0.01603
Mariners45463184 3111.16 0.700 0.684 0.01602
Devil Rays45603112 3044.55 0.682 0.668 0.01479
Diamondbacks45713118 3062.57 0.682 0.670 0.01213
Brewers42522960 2908.65 0.696 0.684 0.01208
Tigers45273152 3099.48 0.696 0.685 0.01160
Cubs41172871 2825.48 0.697 0.686 0.01106
Rangers46973200 3152.10 0.681 0.671 0.01020
Dodgers43923073 3031.40 0.700 0.690 0.00947
Rockies45373043 3008.62 0.671 0.663 0.00758
Mets44243094 3061.94 0.699 0.692 0.00725
Padres44233051 3043.61 0.690 0.688 0.00167
Yankees44833087 3085.86 0.689 0.688 0.00025
Marlins43672965 2965.36 0.679 0.679 -0.00008
Nationals45383161 3167.85 0.697 0.698 -0.00151
Royals46113068 3099.55 0.665 0.672 -0.00684
Reds46503148 3191.15 0.677 0.686 -0.00928

The first thing that strikes me is that the Yankees move up. I didn't expect that. One reason readers suggested a visiting team model was that fielders with poor range like Jeter and Williams would bring down the average and would end up being rated higher than they should be. Yet the Yankees get better with a model dominated by the opposition!

Let me suggest that the original model measured something this model isn't; a player against himself as he ages. So this model is comparing the 2005 Bernie Williams vs. the 2002, 2003 and 2004 Williams. My guess is his range is going down as he ages. The same with Jeter. So instead of pulling the averages down, their younger selfs were pulling the averages up.

Even with that, I don't see a big difference between the Models. Does anyone believe that one is really superior to the other?


Comments

I haven't been involved with any of the park discussions, but it sounds like you adjust for ballpark on the vector level (or some very small level like that). I suggest you regress your factors to the mean. There's a simple analysis: take the three-year figures and regress them against the fourth year to see how predictive they are. I'm guessing you won't find many that are terribly predictive. Personally, I would drop the ones with a correlation below .2 or .1 and regress all others to the mean by (1-r).

You could also do this for all players vs. just the visiting players to see which data set is more predictive.

My own bias would be to forego the park adjustment altogether unless you find some vectors in some parks with a relatively high correlation from year to year (say, over .5).

Posted by: studes at January 23, 2006 11:06 AM
Post a comment









Remember personal info?