For an analyst, writing computationally efficient codes is extremely important. A computationally efficient code takes less time to compile and execute, as a result it reduces the running time of statistical models and reports. Hence it is essential that we should learn to code efficiently by incorporating different new or non-conventional methodologies of coding.

One way to achieve this is by “Vectorization”.  Using this technique we can optimize the performance of the code thereby speeding up the compilation and execution. In this article, I will be talking about “Using Vectorization techniques to optimize code performance in SAS coding language”.

Vectorization is basically a way in which the software program is made to perform on multiple threads simultaneously. In general if a program is not vectorized (scalar implementation), it operates on each thread one by one till it completes the operation on all the threads available. Whereas Vectorization allows the code to complete the operation on the entire set of threads in one go. Hence an unvectorized code takes substantially more time than the vectorized code to run.

In order to use Vectorization techniques in SAS, one has to learn the SAS Integrated Matrix Language (SAS IML). Apart from the fact that the SAS IML language enables one to use a wide range of matrix operations on the dataset, it also gives the flexibility to create your own procedural modules and using them in the subsequent codes wherever necessary.

Let us see how we can use vectorization techniques to write optimized codes in SAS IML. Let us solve a simple problem.

Problem-1: We have a dataset which is having 1000 rows and 100 columns. The values in the dataset are the random values ranging from 0 to 1. We want to replace the values in the dataset with zero where the values are less than 0.5 and with one where the values are greater than 0.5.

One way of doing this task is to use loops to replace the values. The following code does this task using two do loops where each value in the dataset is checked one by one whether it is greater equal to 0 .5 or less than 0.5.

Using Loop:

proc iml;

x = uniform(repeat(1,1000,100)); /**Dataset of 1000 rows and 100 columns**/

do i = 1 to 1000;

do j = 1 to 100;

if x[i,j] < 0.5

then

x[i,j] = 0;

else if x[i,j] >= 0.5

then x[i,j] = 1;

end;

end;

quit;

Let us learn how to vectorize this code. We will do that by indexing the values in the dataset. For that the LOC function in IML is used (Google how LOC does indexing). Below is the code where in one case the lOC function finds the positions in the dataset where values are less than 0.5 and in the other it finds the positions where the values are greater than equal to 0.5. Hence for all those positions in the dataset where the condition is satisfied, either 0 or 1 is assigned based on which condition is satisfied. Here both indexing and assignment happens in one go for the all the observations and not one by one like it happens when we use DO LOOP.

Using Vectorization:

proc iml;

x = uniform(repeat(1,1000,100)); /** Data set of 1000 rows and 100 columns**/

x[loc(x<.5)]=0;

x[loc(x>=.5)]=1;

quit;

You can see the difference in the time taken by both these codes to run in the log below. When we use DO Loop it takes 0.31 seconds to run where as using vectorization techniques, it gets reduced to only 0.04 seconds, thereby improving the efficiency by 87%.

Log of the run:

Problem 2: Let us move to a slightly complicated vectorization technique. The problem here is to find the series of values for the following series:

Yn = a + b Yn-1, for a given N where a and b are arbitrary constants.

You can take any value of a, b, N and Y0 for creating the series.

Using Loop:

proc iml;

a={5};

b={.5};

N={10};

Y0=2;

do i=1 to N;

if i=1 then

Yn=Y0;

else

Yn=(a+(b#Yn));

Series= Series||Yn;

end;

print Series;

quit;

Using Vectorisation:

proc iml;

a={5};

b={.5};

N={10};

Y0=2;

x={0 1 2 3 4 5 6 7 8 9}; /**For N=10 take values from 0 to 9***/

Series=a#(1-b##x)/(1-b)+(b##x)#Y0;

print Series;

quit;

When the above codes are run for a large series, you can see a significant difference in their run times.

The vectorization is done in the following ways

1-  According to the series, the first term is y0. The second term is hence

Y1 = a + b Y0

Third term:

Y2 = a + b (Y1) =a + b (a + b Y0) = a+ab+b2Y0 = a(1+b)+b2Y0

Fourth term:

Y3 = a + b (Y2) = a + b (a(1+b)+b2Y0) = a(1+b+b2)+b3Y0

Nth term:

YN = a(1+b+b2+b3…..+bn) + bnY0 = a(1-bN)/(1-b) + bN Y0

(Note: 1+b+b2+b3…..+bN follows geometric progression)

2-  We create a vector X={0 2 3 4 ……N-1}. In the above expression for YN, we put the vector X in place of N so that all the terms in the series are calculated simultaneously as X takes values from 0 to N-1. It is to be noted that unlike the case with using DO loop, where calculation of each term happens only after the previous term is calculated, the vector X which contains values ranging from 0 to N-1 in the vectorized code ensures that calculations happen at the same time thereby making it computationally more efficient.

In similar ways many complex codes written using Loops can be converted to vectorized form to increase performance. This will be extremely useful in cases where the model or the report is run frequently. Moreover vectorization makes the code compact and easy to maintain. For interested readers who would want to test what they learnt above can try out a simple problem. You have a dataset which is in N x N matrix form (N columns and N rows). Find out its diagonal elements by using Loop and vectorization technique and compare the efficiency level.

About the Author:

Biswajit Pani holds a master degree in Financial Economics, specialising in Applied Econometrics from the Madras School of Economics.

He over three years of experience working in the IT and Analytics industries, spreading across diverse areas namely business operation, marketing, loyalty and risk management.

He has worked both as a programmer and a Statistical Analyst and is currently employed with FORD Motors Credit Company.

He is currently exploring matrix programing language using SAS IML.

 

Interested in learning about other Analytics and Big Data tools and techniques? Click on our course links and explore more.
Jigsaw’s Data Science with SAS Course – click here.
Jigsaw’s Data Science with R Course – click here.
Jigsaw’s Big Data Course – click here.
SHARE
share

Are you ready to build your own career?