By: Kafeel Basha- Jigsaw Academy Faculty

ANOVA

SAS® (Statistical Analysis Software) is an analytic software suite which has the largest market share and is becoming increasingly popular in academics in recent years. SAS offers one-stop solution from data management to advanced analytics, and its “procedure based” language lets users focus less on programming details but more on analysis and modeling. With the SAS System, you can easily access data from any source, perform data management, carry out statistical analysis, and then present your findings in a variety of reports and graphs-all within a single software environment. SAS/STAT software enables you to evaluate data from a variety of sources, including clinical trials, marketing databases, health surveys, customer preference studies, stock market research, and so on. SAS/STAT software provides statistical techniques for applications that span every industry.

In a series of blogs we will post over the next coupe of weeks, we are going to discuss the concepts of ANOVA & Chi square using SAS.

In this first part we will first look at some of the procedures involved in analysing ANOVA & Chi square using SAS. We will then explain ANOVA and give you many examples of how ANOVA is used to determine the significant differences between the means of three or more independent distinct groups, how it compares the means between the groups you are interested in and determines whether any of those means are significantly different from each other and finally how it is used to determine the significant differences between the means of three or more independent distinct groups with unequal sample size.

Procedures involved in analysing ANOVA & Chi square using SAS

PROC FREQ:

Frequency distribution tables are produced by the following statements to be arranged in frequency order, from the highest frequency to the lowest.

Syntax:

PROC FREQ <options>;;

TAB LES variables </ option>;;

Run;

where variables are the names of one or more variables whose tables are desired.5 The

variables may be numeric or character.

PROC ANOVA:

PROC ANOVA handles only balanced ANOVA designs.

Syntax:

PROC ANOVA <options>;

CLASS variables </ option>;

MODEL dependents=effects </ options>;

run;

PROC GLM:

PROC GLM handles any ANOVA/regression design, helpful tool for unbalanced design.

PROC GLM options ;

CLASS variable-list;

MODEL dependents= independents / options;

run;

Note: Dependent variable should be numeric.

Brief introduction to Analysis of variance (ANOVA):

Analysis of variance (ANOVA) is a collection of statistical models used to analyze the differences between group means and their associated procedures such as “variations” among and between groups. In the ANOVA setting, the observed variance in a particular variable is partitioned into components attributable to different sources of variation. In its simplest form, ANOVA provides a statistical test of whether or not the means of several groups are equal, and therefore generalizes the t-test to more than two groups to avoid the chances of arising type I error.

Assumptions: It is assumed that subjects are randomly assigned to one of 3 or more groups and that the data within each group are normally distributed with equal variances across groups. Sample sizes between groups do not have to be equal, but large differences in sample sizes for the groups may affect the outcome of some multiple comparisons tests.

Two sources of variance:

Between group variance: Differences between group means

Within group variance: Differences among people within the same group

Null Hypothesis: The group means are same or there is no significance difference among group means.

There are two useful classifications of ANOVA:

1. One way classification.

2. Two way classification.

Each classification is categorised as balanced and Unbalanced ANOVA.

One Way Classification.

Balanced Design:

The one-way analysis of variance (ANOVA) is used to determine the significant differences between the means of three or more independent distinct groups.

The one-way ANOVA compares the means between the groups you are interested in and determines whether any of those means are significantly different from each other.

Unbalanced Design:

The one-way analysis of variance (ANOVA) is used to determine the significant differences between the means of three or more independent distinct groups with unequal sample size.

Now let’s take a look at some examples:

Problem1: Balanced One Way

Three brands were compared to see if they have significance difference. The data is given as At 0.05 level of significance what is the conclusion?

Solution using SAS.

Null Hypothesis: There is no significance difference between brands.

Alt. Hypothesis: There is a significance difference between brands.

We have to arrange the data in order. (Sorted in excel ANOVA1)

SAS Codes:

Import data using proc import or infile statement.

Data Anova1;

Infile “path” dsd missover firstobs=2;

Input Brands\$ values;

Run;

Proc import datafile=”path”;

Out=Anova1

dbms=csv replace;

run;

proc anova data=Anova1;

class Brands;

model values=Brands;

run; The initial table in this listing is the Analysis of Variance Table. The most important line to observe in this table is the “Model.” At the right of this line is the p-value for the overall ANOVA test. It is listed as “Pr > F” and is p = 0.0675. This tests the overall model to determine if there is a difference in means between BRANDS. In this case, since the p-value is large, you can conclude that there no significance difference between brands ie. Group mean are equal. Accept Null.

Note: PROC GLM will produce essentially the same results as PROC ANOVA with the addition of a few more options. For example, your can include an OUTPUT statement and output residuals that can then be examined.

Problem 2: Unbalanced One Way

1. The number of claims processed in each day for employees in four different companies. The following is the data: At 0.05 level of significance what is the conclusion?

Solution using SAS.

Null Hypothesis: There is no significance difference between claims processed by companies.

Alt. Hypothesis: There is a significance difference between claims processed by companies.

We have to arrange the data in order. (Sorted in excel ANOVA2)

Since it is an unbalanced one way anova we have to use glm procedure.

Import data using proc import or infile statement.

Data Anova2;

Infile “path” dsd missover firstobs=2;

Input Company\$ Claim;

Run;

Proc import datafile=”path”;

Out=Anova2

dbms=csv replace;

run;

proc glm data=Anova2;

class Company;

model Claim=Company;

run; The initial table in this listing is the Analysis of Variance Table. The most important line to observe in this table is the “Model.” At the right of this line is the p-value for the overall ANOVA test. It is listed as “Pr > F” and is p = 0.2636. This tests the overall model to determine if there is a difference in means between “Claims processed by companies”. In this case, since the p-value is large, you can conclude that there is evidence that there is no statistically significant difference in Claims processed by companies. Accept Null.

Two way classification: Without interaction

Balanced Design:

Two-way analysis of variance (ANOVA) is an extension of the one-way ANOVA that examines the influence of two different categorical independent variables on one dependent variable.

Unbalanced Design:

Two-way analysis of variance (ANOVA) is an extension of the one-way ANOVA that examines the influence of two different categorical independent variables on one dependent variable with unequal sample size.

Problem 3: Balanced anova without interaction.

Perform two way classification ANOVA for the given categorical data treatment and land. At 0.05 level of significance what is the conclusion?

Solution using SAS.

Null Hypothesis: There is no significance difference between the column means and row means.

Alt. Hypothesis: There is a significance difference between column means and row means.

We have to arrange the data in order. (Sorted in excel ANOVA3)

Importing data using infile and proc import

Data Anova3;

Infile “path” dsd missover firstobs=2;

Input Treatment\$ Land\$ values;

Run;

Proc import datafile=”path”;

Out=Anova2

dbms=csv replace;

run;

proc glm data=Anova3;

class Land Treatment;

model Values=Treatment Land;

run; The initial table in this listing is the Analysis of Variance Table. The most important line to observe in this table is the “Model.” At the right of this line is the p-value for the overall ANOVA test. It is listed as “Pr > F” and is p = 0.3827. Since the p value is greater than the significance level, there is no significance difference between the column means and row means. Accept Null.

Problem 4: Balanced anova with interaction.

1. The general manager of a company believes that the sales are being affected by age in two different cities. The following is the data: Which is correct at 0.05 significance level?

Solution using SAS.

Age: H0: The sales are same across all age categories

Ha: Salea are not same across all age categories

City: H0: The sales are same across cities

Ha: The sales are not same across all cities

Interaction:H0: The sales are not affected by the interaction of age and city

Ha: The sales are affected by the interaction of age and city

We have sorted the table data as follows.

data interaction;

input Age\$ City\$ @;

do i = 1 to 4;

input y @;

output;

end;

datalines;

20-40 A 383 408 369 375

20-40 B 415 398 386 391

40-60 A 289 356 305 305

40-60 B 300 322 310 293

60abv A 250 263 259 246

60abv B 237 280 279 282

;

proc print;

run;

proc glm data=interaction;

class Age City;

model y=Age City Age*City;

run; Note: Age*City is used since we are performing ANOVA with interaction.

As we have 4 values in each sample categorical variable, we use @ to call the values within sample from 1 to 4 assigning the values to y variable.

According to a significance level of 5%, the Age*City interaction is not significant F=0.96 with p=0.4013. This indicates that the effect of Age does not depend on the level of City and vice versa

. Therefore, the tests for the individual effects are valid, showing is not significant Age effect F=101.58 with p=0.0001 but significant City effect F=301.041667 with p=0.3541. Type I and Type III

The Analysis of Variance result Table shows that the Age factor has effect when interaction is presents as the p value is less than the significance level. We reject the null hypothesis for Age factor.

Problem 4: Unbalanced anova with interaction.

Perform a two way anova for the given unbalanced data At 0.05 level of significance what is the conclusion?

Solution using SAS.

Age: H0: The sales are same across column(A1,A2) categories

Ha: Salea are not same across column(A1,A2) categories

City: H0: The sales are same across row(B1,B2)

Ha: The sales are not same across row(B1,B2)

Interaction:H0: The sales are not affected by the interaction of row and column.

Ha: The sales are affected by the interaction of row and column.

We have sorted the table data as follows.

data Unbal;

input A\$ B\$ @;

do i=1 to 2;

input y @;

output;

end;

datalines;

B1 A1 12 14

B1 A2 20 18

B2 A1 11 9

B2 A2 17

;

proc print;

run;

proc glm data=Unbal;

class A B;

model y=A B A*B;

run; Note: A*B is used since we are performing ANOVA with interaction.

As we have 2 values in each sample categorical variable, we use @ to call the values within sample from 1 to 2 assigning the values to y variable.

The overall  test is significant as F=15.29 with p=0.0253 indicating strong evidence that the means for the four different AB cells are different. You can further analyze this difference by examining the individual tests for each effect.

Type I and Type III sums of squares are typically not equal when the data are unbalanced; Type III sums of squares are preferred in testing effects in unbalanced cases because they test a function of the underlying parameters that is independent of the number of observations per treatment combination.

According to a significance level of 5%, the A*B interaction is not significant F=0.20 with p=0.6850. This indicates that the effect of A does not depend on the level of B and vice versa. Therefore, the tests for the individual effects are valid, showing a significant A effect F=33.80 with p=0.0101 but no significant B effect F=5.00 with p=0.1114.

Watch out for part 2 of this article coming soon, where we go on to explain Chi square test using SAS.

Related Articles:

Clustering in SAS

Regression Modeling

Interested in learning about other Analytics and Big Data tools and techniques? Click on our course links and explore more.
Jigsaw’s Data Science with SAS Course – click here.
Jigsaw’s Data Science with R Course – click here.
Jigsaw’s Big Data Course – click here.