Baseball Musings: Improving the Park Model

January 22, 2006

Improving the Park Model

One suggestion for improving the model for the Probabilistic Model of Range was to use just visiting players to construct the park model. The reason for this is that an everyday player can skew the values associated with a park. A really good fielder, accounting for the almost half the model, makes everyone else look worse than they are. The same is true of a very poor fielder making everyone else look better.

My resistance to this idea was two-fold.

I didn't want to throw out perfectly good data, especially with small sample sizes.
I couldn't come up with a good way to smooth the data. Only looking at half the data, there are going to be rare events that aren't covered just by the visiting team.

I decided to try to solve the smoothing problem today. I use the orginal park model (the first table in this post) to agument the data when the visiting team numbers are missing or sparse. What I want is the visiting team model to dominate. For a given set of parameters, if the number of ball in play against the visiting team is greater than or equal to the number of balls in play by the home team, then I just use the visiting team model. Other wise, I use a model weighted like this:

2.0*VisitBallsInPlay/AllBallsInPlay for the visiting team model
1.0-(2.0*VisitBallsInPlay/AllBallsInPlay) for the overall park model.

Let's say there were 100 balls in play for a particular set of parameters. If 60 of those came against visiting teams, I just use the visiting team model. But if 40 of those came against the visiting team, I weight the model 80% visiting team, 20% overall park model.

The following table should be compared to the first table of this post. That's the model used for smoothing.

Probabilistic Model of Range, 2005. Model Includes Parks, Smoothed Visiting Team Fielding
Team	InPlay	Actual Outs	Predicted Outs	DER	Predicted DER	Difference
Astros	4204	2963	2845.45	0.705	0.677	0.02796
Athletics	4286	3064	2944.68	0.715	0.687	0.02784
White Sox	4457	3175	3052.08	0.712	0.685	0.02758
Phillies	4211	2962	2846.84	0.703	0.676	0.02735
Indians	4385	3108	2988.57	0.709	0.682	0.02724
Cardinals	4414	3101	2991.45	0.703	0.678	0.02482
Braves	4559	3162	3059.99	0.694	0.671	0.02238
Blue Jays	4511	3156	3058.15	0.700	0.678	0.02169
Twins	4545	3193	3094.64	0.703	0.681	0.02164
Angels	4383	3070	2987.00	0.700	0.681	0.01894
Giants	4520	3152	3070.46	0.697	0.679	0.01804
Orioles	4377	3032	2953.85	0.693	0.675	0.01786
Red Sox	4575	3127	3053.44	0.683	0.667	0.01608
Pirates	4467	3095	3023.38	0.693	0.677	0.01603
Mariners	4546	3184	3111.16	0.700	0.684	0.01602
Devil Rays	4560	3112	3044.55	0.682	0.668	0.01479
Diamondbacks	4571	3118	3062.57	0.682	0.670	0.01213
Brewers	4252	2960	2908.65	0.696	0.684	0.01208
Tigers	4527	3152	3099.48	0.696	0.685	0.01160
Cubs	4117	2871	2825.48	0.697	0.686	0.01106
Rangers	4697	3200	3152.10	0.681	0.671	0.01020
Dodgers	4392	3073	3031.40	0.700	0.690	0.00947
Rockies	4537	3043	3008.62	0.671	0.663	0.00758
Mets	4424	3094	3061.94	0.699	0.692	0.00725
Padres	4423	3051	3043.61	0.690	0.688	0.00167
Yankees	4483	3087	3085.86	0.689	0.688	0.00025
Marlins	4367	2965	2965.36	0.679	0.679	-0.00008
Nationals	4538	3161	3167.85	0.697	0.698	-0.00151
Royals	4611	3068	3099.55	0.665	0.672	-0.00684
Reds	4650	3148	3191.15	0.677	0.686	-0.00928

The first thing that strikes me is that the Yankees move up. I didn't expect that. One reason readers suggested a visiting team model was that fielders with poor range like Jeter and Williams would bring down the average and would end up being rated higher than they should be. Yet the Yankees get better with a model dominated by the opposition!

Let me suggest that the original model measured something this model isn't; a player against himself as he ages. So this model is comparing the 2005 Bernie Williams vs. the 2002, 2003 and 2004 Williams. My guess is his range is going down as he ages. The same with Jeter. So instead of pulling the averages down, their younger selfs were pulling the averages up.

Even with that, I don't see a big difference between the Models. Does anyone believe that one is really superior to the other?

Posted by David Pinto at 01:11 PM | Probabilistic Model of Range | TrackBack (0)

Comments

I haven't been involved with any of the park discussions, but it sounds like you adjust for ballpark on the vector level (or some very small level like that). I suggest you regress your factors to the mean. There's a simple analysis: take the three-year figures and regress them against the fourth year to see how predictive they are. I'm guessing you won't find many that are terribly predictive. Personally, I would drop the ones with a correlation below .2 or .1 and regress all others to the mean by (1-r).

You could also do this for all players vs. just the visiting players to see which data set is more predictive.

My own bias would be to forego the park adjustment altogether unless you find some vectors in some parks with a relatively high correlation from year to year (say, over .5).

Posted by: studes at January 23, 2006 11:06 AM