One of the first things that an Analyst will want to do, once he/she gets the data, is to do data exploration. PROC MEANS is a SAS procedure that enables the analysts to get the elementary, yet most wanted statistics – N (Number of non-missing Records), Mean (In simple terms, Average), Stdev (Standard Deviation), Min (Minimum) and Max (Maximum), with minimal code.

If any other statistic, other than the default ones, is needed, we can just plug those terms in after the PROC MEANS statement. For example, if we want to have a look at how skewed the data is, we can just plug in SKEW in the PROC MEANS statement.

The NMISS option and N option are 2 interesting options of PROC MEANS. With these, we can generate a new variable that can give us the ratio of missing values to non-missing values. This will give us an idea as to whether we should try imputing the missing values or that variable will not be usable.

Another interesting feature that I like about PROC MEANS is that, apart from being able to generate statistics for individual variables, it also lets us generate statistics for continuous variables w.r.t certain other categorical variables, such as PROC MEANS MEAN; BY GENDER; VAR WEIGHT HEIGHT; RUN;.

The only additional overhead in the above case is that the data should have been sorted by the BY variable in prior. This overhead has also been taken care of currently. That is, CLASS statement can be used instead of BY to generate statistics by some categorical variable. This does not require that the data should have been sorted by the CLASS variable.

Different ways in which the PROC MEANS output can be made usable:

  1. By default, the output of PROC MEANS gets printed on the output window. In order for this not to go unusable, we can use the ODS (Output Delivery System) and save the output as html, excel or csv and then use that to do any easy manipulations. This is the first, easy option.
  2. There is another, little more sophisticated option, which lets us save the output of PROC MEANS as a SAS dataset. This is by using the OUTPUT OUT = dataset statement. This gives the convenience of saving the output as a SAS dataset and using it later on.

How much useful can this be? Not much. Why? The number of rows that get generated is equal to the product of the number of statistics that are requested for each variable and the number of variables.

  1. There is a third way in which the outputs can be extracted from PROC MEANS. This gives the output in a very usable form but, it requires little more coding. As everywhere else, more flexibility comes with more effort.

Here, the required statistics for each variable will be outputted as a separate column. That way, we can re-merge the statistics back to the original file and use it.

If we have to generate multiple statistics for multiple variables, giving them names explicitly through the third method might be tedious. In order to overcome this, there is another shortcut – AUTONAME.

We can simply say OUTPUT OUT = TEST N= Min= Max=/AUTONAME; this way, the statistics generated for each of the variables will get the name in this format – VariableName_statistic. Eg: Product_Price_Mean.

Interested in learning about other Analytics and Big Data tools and techniques? Click on our course links and explore more.
Jigsaw’s Data Science with SAS Course – click here.
Jigsaw’s Data Science with R Course – click here.
Jigsaw’s Big Data Course – click here.
SHARE
share

Are you ready to build your own career?