The Value of Sample Size
*I wrote this article for a class called "Sports and Society." Yes, that is the name of the class. I know, but I really like sports. I take them seriously. I think this stuff is interesting. I hope you do too. Enjoy. I hope.*
5,600 Minutes: How Do You Measure a Player in a Year? | FanGraphs Baseball
I think a good place of starting before analyzing anything relating to sports is to think about sample size. The article above specifically targets baseball statistics as a way of considering the role of data as the size of the data set increases. I will come back and pick out interesting pieces from the article, but merely addressing the importance of sample size is a big step in making the most out of any statistic you might have.
In conversation, we consistently use anecdotes as tools for explaining our analysis of the world. Since as individuals we are limited in how much of the world we can sense, personal experiences are normally the best way for us to take meaningful impressions of the world. Very often these impressions are personal shortcuts for developing expectations. The idea that we might use a sample size that can be counted on one hand to draw conclusions about the world is troubling, especially when you consider how small your personal sample size is compared to the rest of the world. A common non-sports example is going to a restaurant. If your first experience at a highly regarded restaurant is poor, you are unlikely to go back, regardless of how highly regarded the restaurant might be. A common sports example is drawing conclusions from the outcome of any single sporting event. It is very common for fans to rely solely on the most recent experience when determining "greatest play ever" or "best player in the league." We are all slaves to the moment and fall in love with the most recent example of greatness. However, this is a very poor way of analyzing anything, let alone sports. An individual game does mean something. It means whatever the outcome of the game was. The Packers won the Superbowl. UCONN won the NCAA Tournament. Auburn won the BCS Championship. Does winning the championship mean they are the best teams in their respective leagues? No. Are any of them the best teams in their respective leagues? Maybe. This is one mental exercise for thinking about sample size. Individual data points have meaning on a micro-level, but taking a single game to mean anything more than who won and lost is fruitless. One shot is not enough information to say whether not a player is a good shooter: neither is 2, 3, or 4 shots. You need hundreds (?) of shots before any calculation of shooting percentage becomes more than a descriptive tool of how a player performed in the recent pass.
Analysis of the pass and projections for the future require lots of time delving through years of data points. In the article above, the author writes about at what point baseball statistics become valuable tools of analysis. The theme behind his reasoning is that reliability increases directly with sample size. Yearly statistics, let alone individual games, can be too fickle (unreliable from year to year) to draw conclusions from. If statistics are not reliable, they are not very useful for predicting the future. The author explains his methodology:
What we’re talking about here is a concept known in social science research as measure reliability. It’s the idea that if I took the same measure over and over again, I’d get (roughly) the same answer each time. This shouldn’t be confused with measure validity, which is whether or not the measure I’m using is actually measuring what I think it does. I might ask 25 people to tell me what color the sky is, and they might all say “green with orange polka dots.” The measure is very reliable, but not very valid. In statistics, the way to increase reliability of a measure is to have more observations in the data set. If I took a player’s on-base percentage for his first five at-bats in a season, and then his next five, and then his next five, and so on, those numbers are going to fluctuate all over the place. But if I do it in 200 at-bat sequences, the numbers will be more stable. I’ll hopefully get (roughly) the same number each time I take a sample of 200 at-bats. The question I ask is when does that number become stable enough that we say that it’s OK to make inferences about a group of players?
The article goes on to state, based on his calculations, at what point popular and unpopular baseball statistics stabilize and become useful tools of analysis. My goal is not to preach about what statistics are good and bad or how they should be used. My goal is to emphasize that statistics can be extremely helpful tools; however, they are only valuable if used correctly. If they are used incorrectly then statistics become at best worthless and at worst detrimental to the discussion. In class, when anecdotes are told to prove points, we do not learn much. We learn what happened in one situation at one time. We do not learn what to expect in the general case. When we read papers in which several stories are told to prove the point the author is trying to make, we should be skeptical if only because we know that the author is basing his conclusions on limited observations.
D