This article was originally published in Data Science Central
The ‘Bell curve’ or the ‘Gaussian bell curve’ is one of the fundamental concepts on which most of the statistical analysis is based. From social sciences to astronomy to financial services- most of the application of statistics in the real world relies on the assumption that the data being analysed is distributed in the shape of the bell curve.
In the last article we discussed the usefulness of the Bell curve. It helps us simplify things and use rules to understand distributions. The curve’s symmetry and consistency make it ideal for making predictions.
In this article, we will discuss how these same qualities of the bell curve that make it so tempting and useful can also be a curse.
Does all information follow the Bell Curve?
There are many examples of normal (or approximately normal) distribution around us. The statistical concepts have been empirically tested and verified countless times.
Certain quantities in physics are distributed normally such as the velocities of the molecules in an ideal gas. In biology, the logarithm of various variables such as the thickness of the tree bark or claws of a mammal tend to have a normal distribution. In Finance, changes in log of certain phenomenon such as exchange rates and price indices are assumed to be normal though this assumption is hotly contested by some. Bell curve grading assigns relative grades based on a normal distribution of scores.
As Dr. Taleb says in his book, The Black Swan, we can make good use of the Gaussian approach (i.e. the bell curve) for variables for which there is a rational reason for the largest not to be too far from the average. If there is gravity pulling down numbers, or if there are physical limitations preventing very large observations (say, the length of the tail of a cat), we end up in mediocristan.
Mediocristan is a term coined by Dr. Taleb to denote situations where the Gaussian approach (normal, binomial, poisson etc.) will work.
The Curse of the Bell Curve
The Curse of the Bell Curve, however comes from the fact that we often use the bell curve in situations that bear no resemblance to a normal distribution. Many real life phenomena do not follow the bell curve and yet we assume a normal distribution just because the simplicity of the bell curve is highly tempting. Let us examine some glaring examples here.
Most real life data does not exhibit normal distribution. A normal distribution is more of an exception than a rule. Real world data shows variations (high and low) that are far more frequent than what the bell curve predicts. Even data that seems to be normally distributed may seems so only because our observation period is not long enough.
This is an important lesson for any analyst dealing with real world data. Always check the data for normality. And always look for a rational explanation about why the data should be normal. Only if you are satisfied on both the counts, should you assume a normal distribution. And then also, proceed with caution.
The concept of The Bell Curve is a highly seductive one. Once it gets into your mind it is hard to get past it. Hence be careful about its use.
The bell curve has a lot of uses and it should not be discarded completely. But it should be used judiciously or the consequences can be disastrous.