Showing posts with label Simple Linear Regression. Show all posts
Showing posts with label Simple Linear Regression. Show all posts

Case Study: SLR with 1 dummy variable in R

 
Case Study: The Spock Conspiracy Trial in R
The main question is there is evidence that women underrepresented on Spock judges venire when compared to other judges?
 
1. Prepare Data
If you don't have the data set, just follow procedure below; install a package and load the package. Our case study data set is called "case0502".
 
2. Make a Boxplot
Our goal is to compare the mean values between Spock's and other judges (1~6). First of all, we can make a bloxplot displaying the 5 statistics: minimum, first quartile, median, third quartile, and maximum.  
 
3. Conduct the SLR 
First, we divide into two groups: Spocks' and others by using ifelse function as we want to compare means of two groups. And then conduct the simple linear regression by using lm function. The response variable is the "percent" and the explanatory variable is a "twogroup" which is a dummy variable.  
 
4. Conclusion  
If we use anova( ), then we can see the one-way ANOVA table above. The main idea is that between groups SS (twogroups Sum sq=1600.6) is larger than withing groups SS (residuals sum sq=49.79), there is evidence that means are different. Therefore, as we have a small p-value (1.03e-06), there is strong evidence in mean difference between two groups.  
Remark) We will get the same conclusion if we conduct the SLR. More information, Click! 
 
 
 

The Least Squares Estimators are Unbiased

From the previous posting, the least squares estimators
 1) Estimate of of $\beta_{1}$ : $b_{1}=\frac{\sum(X_{i}-\bar{X})(Y_{i}-\bar{Y})} {\sum(X_{i}-\bar{X})^2} = \frac{\sum(X_{i}-\bar{X})Y_{i}}{\sum (X_{i}-\bar{X})^2}=\frac{\sum X_{i}Y_{i}-n\bar{X}\bar{Y}}{\sum X_{i}^2-n\bar{X}^2}$

 2) Estimate of of $\beta_{0}$: $b_{0}=\bar{Y}-b_{1}\bar{X}$
 
They are unbiased!! why? $\Rightarrow$ Need to show that $E[b_{1}]=\beta_{1}$ and $E[b_{0}]=\beta_{0}$
Proof
First of all, there are three things we're going to use.
$k_{i}= \frac{X_{i}-\bar{X}}{\sum(X_{i}-\bar{X})^2}$, where $\sum k_{i}= 0, \ \sum k_{i}X_{i}=0, \ \sum k_{i}^2=\frac{1}{\sum(X_{i}-\bar{X})^2}$    
(1) $\sum k_{i}=0 \Rightarrow \frac{(X_{i}-\bar{X})}{S_{XX}}= \frac{\sum X_{i}-n\bar{X}}{S_{XX}}= \frac{n\bar{X}-n\bar{X}}{S_{XX}}=0$
(2) $\sum k_{i}X_{i}=1 \Rightarrow \frac{\sum( X_{i}-\bar{X})(X_{i}-\bar{X})}{S_{XX}}=\frac{\sum(X_{i}-\bar{X})^2}{S_{XX}}= \frac{S_{XX}}{S_{XX}}=1$
(3) $\sum k_{i}^2= \frac{1}{\sum(X_{i}-\bar{X})^2} \Rightarrow \sum k_{i}^2= \sum( \frac{X_{i}-\bar{X}}{S_{XX}})^2=\frac{1}{S_{XX}}\sum (X_{i}-\bar{X})^2=\frac{S_{XX}}{(S_{XX})^2}=\frac{1}{S_{XX}}$ 
 
Now, by using these results, we can finally show that the least squares estimators are unbiased!!
Proof
1) $E[b_{1}]= \beta_{1}$  
   $\Rightarrow$ $E[b_{1}]=E[\sum k_{i}Y_{i}]= \sum k_{i}E[Y_{i}]$ as $\sum k_{i}$ is a constant! And $Y_{i}=\beta_{0}+\beta_{1}X_{i}+\varepsilon _{i}$
                   $=\sum k_{i}(\beta_{0}+\beta_{1}X_{i})=\beta_{0} \sum k_{i}+ \beta_{1} \sum k_{i}X_{i}=\beta_{1}$

2) $E[b_{0}]=\beta_{0}$
  $\Rightarrow E[b_{0}]=E[\bar{Y}-b_{1}\bar{X}]=E[\bar{Y}]-E[b_{1}\bar{X}]=\beta_{0}+\beta_{1}\bar{X}-\bar{X}\beta_{1}=\beta_{0}$
        Notice that $Y_{i}$ is a random value, therefore $\bar{Y}$, $\sum k_{i}Y_{i}$ are also random!!
        However, $\bar{X}$ is NOT a random value, but it's a constant!
        *E(aY)=a E(Y), where a is a constant!

 




Estimation for the SLR - the Least Squares Method

*How to Estimate Parameter $\beta_{0}, \beta_{1}$ ? $\Rightarrow$ by using the Least Squares Method!!


Reference: http://en.wikipedia.org/wiki/File:Linear_regression.svg

From the graph above, blue dots are the X and Y values from a given data set. And our purpose is to predict the red line which is linear. In order to predict this line, we should figure out the slope and intercept value when the error is minimized.

Wait!! How come can we predict the intercept and slope? Because they are constants!
For more information - click!

[1] Find the line that gives the minimum overall squared errors
Sum of Squared Errors $\mathsf {Q= \sum_{i=1}^n \epsilon _{i}^2 = \sum (Y_{i}-\hat{Y_{i}})^2 = \sum (Y_{i}-\beta_{0}-\beta_{1}X_{i})^2}$

Proof

$ b_{0}= \frac{dQ}{db_{0}} =2 \cdot \sum_{i}^{n}(Y_{i}-b_{0}-b_{1}X_{1})^1(-1)= 0$                                                      $ = \sum y_{i}-n\cdot b_{0}-b_{i} \cdot \sum X_{i}=0 \rightarrow \frac{\sum Y_{i}}{n}-b1 \cdot \frac{\sum X_{i}}{n}=b_{0}$
               $ \therefore  \bar{Y}-b_{1}\bar{X}=b_{0}$, where $ b_{0},b_{1}$ are unknown! 

 $ b_{1}=\frac{dQ}{db_{1}}= 2\cdot \sum(Y_{i}-b_{0}-b_{1}X_{i})^1(-X_{i})=0 $
                 $=\sum X_{i}Y_{i}-b_{0}\cdot \sum X_{i}- b_{1} \cdot \sum X_{i}^2 =0 $
                 $= \sum X_{i}Y_{i}-(\hat{Y}-b_{1}\bar{X})\sum X_{i}-b_{1}\cdot \sum X_{i}^2=0$
                 $= \sum X_{i}Y_{i}-\bar{Y} \cdot \sum X_{i}=b_{1}(\sum X_{i}^2-\bar{X}\cdot \sum X_{i})$
                 $\therefore b_{1}=\frac{\sum X_{i}Y_{i}-\bar{Y} \sum X_{i}}{\sum X_{i}^2 - \bar{X} \sum X_{i}}= \frac{\sum X_{i}(Y_{i}-\bar{Y})}{\sum X_{i}(X_{i}-\bar{X})}$


                Remark) The slope is only linear combination of $X_{i}$ and $Y_{i}$

 

[2] Properties of the Least Squares Fitted Line

(1) $\sum_{i=1}^n e_{i}=0$
     Proof $\Rightarrow \sum (Y_{i}-\hat{Y}_{i})= \sum (Y_{i}-b_{0}-b_{1}X_{i})$
                                         $=\sum Y_{i}-n(\bar{Y}-b_{1}\bar{X}) - b_{1} \sum X_{i}$
                                         $=n \bar{Y} - n \bar{Y} + n b_{1}\bar{X}-b_{1}n \bar{X}=0$
(2) $\sum_{i=1}^n \hat{Y_{i}}= \sum_{i=1}^n Y_{i}$
     Proof $\Rightarrow \sum_{i=1}^n \hat{Y_{i}}= \sum(Y_{i}-e_{i})= \sum Y_{i} - \sum e_{i} = \sum_{i=1}^n Y_{i}$ $(\because \sum e_{i}=0)$
 
(3) $\sum_{i=1}^n X_{i}e_{i}=0$
     Proof $\Rightarrow \sum X_{i}e_{i}= \sum(X_{i}-\bar{X})e_{i}= \sum (X_{i}-\bar{X})(Y_{i}-b_{0}-b_{1}X_{i})$
                                 $= \sum(X_{i}-\bar{X})(Y_{i}-\bar{Y})-b_{0}\sum(X_{i}-\bar{X})-b_{1} \sum(X_{i}-\bar{X})(X_{i}-\bar{X})$
                                 $= S_{XY}-b_{1}S_{XX}$  $\because b_{1}= \frac{S_{XY}}{S_{XX}} \Rightarrow \therefore S_{XY}-\frac{S_{XY}}{S_{XX}}\cdot S_{XX}=0$ 

(4) $\sum_{i=1}^n \hat{Y}_{i}e_{i}=0$ 
     Proof $\Rightarrow \sum(b_{0}+b_{1}X_{i})e_{i}=b_{0}\sum e_{i}+ b_{1}\sum X_{i}e_{i}=0$   $\because \sum e_{i}=0,\ \sum X_{i}e_{i}=0$  
  


  

1. Simple Linear Regression (SLR)'s Equation and Assumptions


[1] What’s Regression Models?
It describes a *statistical relationship between X and Y, where X is called an explanatory, independent or predictor variable, and Y is called a responsible or dependent variable.

 *Wait!! What’s the statistical relationship?
- In Mathematics, y=f(x) is called a functional relationship, which the f(x) is some EXACT function. Therefore, you can draw a graph. You don't need to predict the Y values as long as we can figure out the equation of the function.
- In Statistics, however, there exists an error term (ε) such as measurement errors which we don't know these values. Therefore, the statistical relationship would be explained by y=f(x) + ε. 


[2] Simple Linear Regression (SLR) Equation
Here, ‘simple’ means there is one predictor only!  

In an observational data sets, suppose we want to predict Y values based on the X, explanatory variables, then we need to build up a statistical relationship equation by using these variables. In equation, there should be a slope, intercept, also an error term, as this is a statically relationship equation. 
Therefore, the equation is $Y_{i}= \beta_{0}+\beta_{1}X_{i}+ \varepsilon _{i}$

In this equation, we know Y and X values from a data sets, but we don’t know what the β0(intercept), β1(slope) and εi’s are. We can figure out the intercept and slope based on the analysis as they are constants. However, Y and ε are random values which means if we know its mean and variance, we will also know its distribution!!
 
Let's find out the error term's distribution first!
[3] The SLR Assumptions: The Error terms’ Mean and Variance.
There are three SLR assumptions regarding the error term.
(1) E[εi] = 0,
(2) Var[εi] = σ2
(3) Cov[εi ,εj’s] = 0  The error terms are uncorrelated. 
Therefore, the error terms’ distribution will be ε~ N(0, σ2)


The Y is also random value as the error term is a random. Therefore, we can find Y's mean and variance as well.  

[4] The Y’s Mean and Variance 
(1) E[Yi]= β0+ β1Xi
Proof: E[Y] = E[β0+ β1Xi + εi] = E[β0] + E[β1Xi ] + E[εi] = β0 + β1Xi
         By assumption above E[εi]=0, and β0, β1 and Xi are constants.

(2) Var[Yi]= σ2
Proof: Var[Y] = Var[β0+ β1Xi + εi] = Var[εi] = σ2
        By assumption above Var[εi] = σ2, and β0+β1Xi is constants.

(3) Cov[Yi, Yj]= 0
Proof: Cov[β0+ β1Xi + εi, β0+ β1Xj + εj]= Cov [β0, β0] + Cov[β0, β1Xj] +…
         (expanding each term)..+ Cov[εi , εj]=0 As Cov [constant, random] = 0,
         Only Cov[εi, εj] is left, and its value is 0 by assumption.


Remark!! A statistical relationship between X and Y does NOT necessarily mean that X causes Y, as these X and Y are an observational data.