When we build statistical models, it is important to cross check the sanity of the data as there is always a risk involved that the data may not be as per expectations. This may lead to biased results and it is best to detect such issues early on in the study. One such anomaly is that of an outlier. Outlier is by definition an improbable value or a value that maybe a slightly far off from the general trend of the existing data. When outliers become extreme observations at either the left or the right it could alter the assumptions made by the statistician about the behaviour of the given population. Note that not all extreme observations are outliers and not all outliers are always extreme observations. Data quality is essential for the trustworthiness of the final analysis.

The SAS procedure Univariate is a very sophisticated tool that has high level statistical output built over a period of time. It gives an extended output for data diagnostics and detecting anomalies that the normal proc means and proc summary may not be able to provide. For example the default output of proc means many give us an idea of the average, minimum, maximum, standard deviation and Count of every numeric variable. This will help us to know the general skewness of the data and potential outliers. But to confirm if the outliers are present in the data, one can run proc univariate which gives us specific output such as box and whisker plots, percentiles and extreme values which enable us to further dig into a potential outlier and confirm if it really is based on facts.

This will help us to know the general skewness of the data and potential outliers. But to confirm if the outliers are present in the data, one can run proc univariate which gives us specific output such as box and whisker plots, percentiles and extreme values which enable us to further dig into a potential outlier and confirm if it really is based on facts.

The number of extreme observations may vary from parameter to parameter, but as a quick, dirty way to identify dirty data this method is still quite effective. The option NEXTRVALS does a similar thing by showing the extreme values. What they both lack is the context of the extreme values compared to the rest of the data in the curve. This is why the Quantiles analysis is the most useful.

Similarly, the box and whisker plot, Normal probability plot and stem and leaf plot can give a visual representation in SAS if an outlier exists. The box plot is a graphical representation of the 5-number summary for a variable.  It is based on the quartiles of a variable.  The rectangular box corresponds to the lower quartile and the upper quartile.  The line in the middle is the median.  The plus sign in the middle is the mean. We can visually compare the lengths of the whiskers.  If one is clearly longer than the other one, the distribution may be skewed. The stem-leaf plot is used to visualize the overall distribution of a variable. This is a list of the five lowest and five highest values of the variable. The interquartile range is the difference between the upper and the lower quartiles.  It measures the spread of a data set.  It is robust to extreme observations.

Based on the outliers detected, specific treatments are available in the industry for preparing the data for analysis and modelling. e.g. One way is capping that use the percentiles output. In that, every value that is above 99% or 95% or every value below 5% or 1% depending on the business case is simply replaced by the mean value

For further information, do take a look at these articles:

Interested in learning about other Analytics and Big Data tools and techniques? Click on our course links and explore more.