Showing posts with label The General Linear Model. Show all posts
Showing posts with label The General Linear Model. Show all posts

Case Study : The Pygmalion Effect, in SAS - Two Way ANOVA


Case Study: The Pygmalion Effect
There are 10 army companies each with 3 platoons. One platoon is randomly picked for Pygmalion treatment and the other platoon are control groups. The platoon leader is told his platoon has scored highly on tests that indicate they are superior. After basic training, platoons are given a weapons test. 

The response variable is an average score on basic weapons test per platoon.
There are two factors; company 10 levels(Company 1 ~ 10) and treatment 2 levels (Pygmalion, control) Therefore, there are 19 (=9+1+(9*1)) explanatory variables.

The data is following below.

Reference : Ramsey and Schafer (1997) The Statistical Sleuth, Duxbury Press, p. 365. The Pygmalion effect.


1. Two Way ANOVA with Interaction (Interaction Model & Hypothesis)  
The model with interaction :
$Y_{i}=\beta_{0}+ \beta_{1}I_{pyg,i}+ \beta_{2}I_{comp1,i}+...+\beta_{10}I_{comp9,i}$
                                        $+\beta_{11}I_{pyg,i}\cdot I_{comp1,i}+...+\beta_{19}I_{pyg,i}\cdot I_{comp9,i}+e_{i}$

$H_{0}:\beta_{11}=\beta_{12}=...=\beta_{19}=0$ (all possible interactions)
$H_{1}$ : at least $\beta_{1i}$ is not equal to 0

1.1 SAS Code and Result
 
We can create the Two-Way ANOVA by using proc glm. There are two factors; company and treatment. And we are also interested in interaction terms so that out model should be score=company | treatment, which | creates interactions. The another way to create interaction term is score=company treatment company*treatment.  
 

After running the SAS code above, the overall ANOVA results are following below.

In the interaction (company*treatment), the F value is 0.67 with high P-value (0.7221).
As we get a high P-value, we fail to reject the null hypothesis which means all possible interaction terms can be considered to 0. Therefore, our next model is an additive model which excludes the interaction terms.


2. Two Way ANOVA w/o Interaction. (Additive Model & Hypothesis)
The model w/o interaction:
$Y_{i}=\beta_{0}+ \beta_{1}I_{pyg,i}+ \beta_{2}I_{comp1,i}+...+\beta_{10}I_{comp9,i}$

2.1 The First Question - SAS Result and Conclusion 
Our first question is there is a difference in mean score between Pygmalion and the control groups!!
$H_{0}:\beta_{1}=0$ 
$H_{1} : \beta_{1}$ is not equal to 0
As our interest is to find a difference between Pygmalion and the control groups, so we need to look at the treatment source. As we have a small P-value (0.0119), so we reject the null hypothesis which means there is a evidence of a difference in mean score between them.
Note that F value is distributed by F(# beta being tested, # degrees of freedom of SSE) = F(1,18)

However, in the diagnostics for score matrix, we can see a decreasing variance estimate pattern. Even though we have a small P-value, the evidence of our conclusion is week.



2.2 The Second Question - SAS Result and Conclusion
Our second question is there are differences in companies!!
$H_{0}:\beta_{2}=\beta_{3}=...=\beta_{10}=0$ 
$H_{1}$ : at least one $\beta_{2}...\beta_{10}$ is not equal to 0
 
As our interest is to find a difference among 10 companies, so we need to look at the company source. As we have a high P-value (0.1484), we fail to reject the null hypothesis which means there is no evidence of difference among companies
Note that F(9,18) = F(# beta's being tested which are beta 2~10, # d/f of SSE)

Combining two question conclusion, our final model will be $Y_{i}=\beta_{0}+\beta_{1}I_{pyg,i}$


3. Model Checking
In the diagnostic panel, we can see whether our assumptions are satisfied or not.
There is no outliers, and normality looks ok. However, we have a concern of decreasing variance. Another assumption is independent observation which we assume that platoons were picked at random and were not interacting. 


Two-Way ANOVA


[1] Two Way Classification or Two-Way Analysis of Variance
This is another special case of the GLM (General Linear Model). In the GLM, the response variable is continuous and the explanatory variable is categorical or continuous. For the GLM information, click here! :D So, the Two-Way ANOVA has two factors, each with at least two levels. The main question is weather the treatment variable have an effect or not.   

What is the factor? A factor is a categorical predictor variable consisting of different class levels like various types of treatments. For example, if you want to predict the grade of the course you are taking, then there are several factors such that the number of hours you are studying, the number of assignments or a female/male student etc. 

[2] Assumptions
The samples must be independent, and selected by randomization condition.
The equal variance assumption and normal error assumption should be satisfied.



[3] Model and the Expected Values Example  
Consider the model for a two-way analysis of variance with two levels of each factor (a 2x2) classification. $Y_{i}=\beta_{0}+\beta_{1}I_{factor 1,i}+ \beta_{2}I_{factor 2,i}+\beta_{3}I_{factor 1,i}I_{factor 2,i}+e_{i}$ where $I_{factor 1,i}$ if the ith observation is in the first group of factor 1 and is 0 otherwise.

**The expected values of $Y_{i}$ for each of the 4 groups means are following.
1) i : 1st level of the factor 1, and 1st level of factor 2 : $E[Y_{i}]=\beta_{0}+\beta_{1}+\beta_{2}+\beta_{3}$
2) i : 1st level of the factor 1 and 2nd level of the factor 2: $E[Y_{i}]=\beta_{0}+\beta_{1}$
3) i: 2nd level of the factor 1 and 1st level of the factor 2: $E[Y_{i}]=\beta_{0}+\beta_{2}$
4) i : 2nd level of the factor 1 and 2nd level of the factor 2: $E[Y_{i}]=\beta_{0}$  

[4] The Full Model & The Reduced Model and Test Statistics 
The full model is a model with all explanatory variables, whereas the reduced model (or additive model) is a model without variables whose coefficient you are testing.

The test statistics is following below.
$F_{obs}= \frac{(SStrt_{full}-SStrt_{reduced}) / \#of\beta's \ being \ tested }{MSE_{full}}$

Case Study : The Spock Conspiracy Trial, in SAS - by using proc glm & the Bonferroni Correction


Case Study: The Spock Conspiracy Trial in SAS
Main question is there is a difference among the 6 other judges?
We will use multiple linear regression model with 6 dummy predictor variables. (One-way ANOVA)
Reference: http://www.inside-r.org/node/159733 


[1] The Hypothesis and Model 
 $H_{0}=\mu_A = \mu_B= ...=\mu_E = \mu_F$
 $H_{1}$ = At least one judge is different from others.
 Model : $Y_{i}=\beta_0 + \beta_{1} \cdot I_{A_{,i}} + \beta_{2} \cdot I_{B_{,i}} + ... + \beta_{5} \cdot I_{F_{,i}} + e_{i}$, where I is an indicator variable.


[2] SAS Code
Please refer to the previous post how to read the data using infile statement. The data was divided into two groups; 'SPOCKS' and 'OTHERS' using if statement.  This data name is 'all', and we will use this 'all' data.

Read the 'all' data, and we will exclude 'SPOCK' group using ne statement as our purpose is to find a difference among 6 other judges. So we create indicator variables. This data name is 'otherjudges'.

We use proc glm statement to get One-way ANOVA result with Bonferroni correction. 

* cldiff statement reference: https://support.sas.com/documentation/cdl/en/statug/63347/HTML/default/viewer.htm#statug_glm_sect018.htm
*pdiff statement reference : http://support.sas.com/documentation/cdl/en/statug/63347/HTML/default/viewer.htm#statug_glm_sect016.htm


[3.1] SAS Result - Overall AVNOA
As there are 6 judges, so the number of degrees of freedom is 5. The F-value (the ratio of variance) is 1.22 = 54.291574/53.59180. The P-value is 0.3239 which is higher than 0.05, so we can see that there is no evidence of difference among the means of the 6 judges. (We fail to reject the null hypothesis). We don't need to see the post-hoc pairwise comparison, but let's see the result of the LSmeans with the Bonferroni adjustment.


[3.2] SAS Result - LSMeans and Difference Matrix
The least square means are following below. From the ANOVA result above, there is no difference among those means.  
In the difference matrix below, the null hypothesis is that ith LS means is equal to jth LS mean. As all p-values in the matrix are higher than 0.05, so we fail to reject the null hypothesis.

[3.3] SAS Result - Difference Matrix with the Bonferroni Correction
How to read the difference matrix is the same above. The null hypothesis is that ith LS means is equal to jth LS mean. With the adjustment for multiple comparison, all the p-values are 1 which is higher than 0.05. Therefore, we have a strong evidence that there is no difference in means among 6 judges.
 

The General Linear Model (GLM)


[1] What's the General Linear Model (GLM)?
In GLM, there is a continuous response variable, and one or more categorical/continuous explanatory variables. The response Y is linear in the beta's. In SAS, we use proc glm or proc mixed statement. In R we use lm().  For more information, please refer to the reference below.


One way ANOVA is a special case of a general linear model. Of course ANOVA, ANCOVA, linear regression, mixed model also are the GLM. In GLM, we can predict beta by using the least squared or the best linear unbiased prediction. Don't confuse the difference between General Linear Model and Generalized Linear Model later. Let's find out what the hypothesis and assumptions of GLM and what the one-way ANOVA is.

[2] GLM: Hypothesis and Assumptions
The null hypothesis is $\beta_{1}=\beta_{2}=...=\beta_{p}=0$
The GLM assumptions should be the errors are normally distributed with mean 0 and a constant variance, and they are uncorrelated.
 
We can predict the beta with the least squares estimates.
$\widehat{\beta}=(X^TX)^{-1}X^TY$
- The columns of X should be linearly independent, and $X^TX$ should be invertible.  


[3] One Way ANOVA : Assumptions, Table
In the One way ANOVA, the reponse variable is continuous, there is one categorical factor with at least 2 levels. The group variances are equal or at least fairly similar, and the errors are normality distributed. We are not assuming no outliers, also balanced groups are not required. The one way ANOVA table is following.

The P-value is come from high observed F statistics, and this statistics is come from the ratio of variances of two groups. Therefore, the main idea is if the between groups SS is larger than within the groups SS, there is evidence that means are different.