Stat n Math : The Generalized Linear Model

Showing posts with label The Generalized Linear Model. Show all posts

Case Study : 2X2 Contingency Table in SAS

Case Study : Framingham Heart study

In 1948, in Massachusetts, 5209 healthy men and women, aged 30-60 were recruited and followed to examine risk factors for cardiovascular disease(CVD). After 10 years, checked whether they developed CVD or not. So main question is high cholesterol is associated with increase risk of CVD.

Reference: https://www.framinghamheartstudy.org/, https://en.wikipedia.org/wiki/Framingham_Heart_Study

[1] Data Assumption and SAS Code for reading the data.
The number of total sample (n) is 1329 (men). The cholesterol was measured in 1948, and after ten years, checked whether cardiovascular disease (CVD) was found or not. Therefore, our observations can be classified into 2 ways: Cholesterol status (H or L), and CVD status (present or absent).

Recall, our assumption should be the number of total sample (n=1329) is fixed!! And cholesterol status and CVD status are categorical random variable with 2 levels.

data fram;
input chol $ cvd $ count;
datalines;
low present 51
low absent 992
high present 41
high absent 245;

This SAS code presents the data table below.

[2] Hypothesis
Recall that two variables A and B are independent if and only if P(AB)=P(A)xP(B), Therefore, the main idea is to compare between the joint distribution and the marginal distribution, where

The Joint distribution:
The probability that an observation falls into row i, column j, for i & j=1, 2 $= P(C=i, D=j)= \pi_{ij}$

The Marginal distribution:
The probability an observation falls into row I $= P(C=i)= \pi_{i \cdot}$
The probability an observation falls into column j $= P(D=i)= \pi_{ \cdot j}$

$H_0$ : $\pi_{ij}=\pi_{i\cdot}\pi_{\cdot j}$ There is no relationship between Cholesterol and CVD.
$H_1$ : $\pi_{ij} \neq \pi_{i\cdot}\pi_{\cdot j}$

[3] SAS Code - 2x2 Contingency Table
proc freq;
weight count;
table chol*cvd / chisq;
run;

[4] Conclusion

Under the null hypothesis, with large samples, our test statistics is followed by Chi-squared distribution with (I-1)(J-1) degrees of freedom.

From the SAS result, Chi-square statistic is 31.0818 with 1 degree of freedom. And the P-value is less than 0.0001 (We reject the null hypothesis) Therefore ,we have strong evidence that cholesterol and CVD are not independent, which means CVD status depends on the cholesterol level.

Intro to Log-linear Model - IxJ Contingency Table

The main idea is to score between what we expected counts and what we observed counts. If the score is small, then both are close to each other.

The cell counts of a contingency table is a common use of a log-linear model which is an example of generalized linear model.

[1] IxJ Contingency Table
There are a row factor with I levels and a column factor with J levels.

[2] Notation and Hypothesis
2.1 The Joint distribution of A and B
The probability that an observation falls into row i, column j, for i=1,...,I, j=1,...,J $= P(A=i, B=j)= \pi_{ij}$

2.2 The Marginal distribution of A and B
The probability an observation falls into row I $= P(A=i)= \pi_{i \cdot}$
The probability an observation falls into column j $= P(B=i)= \pi_{ \cdot j}$

Our purpose is to compare the relationship A and B (whether A and B are independent or not) based on the joint distribution and marginal distributions.
Recall that two variables A and B are independent if and only if P(AB)=P(A)xP(B)

Therefore our hypothesis is...
$H_0$ : $\pi_{ij}=\pi_{i\cdot}\pi_{\cdot j}$ There is NO relationship between A and B.
$H_1$ : $\pi_{ij} \neq \pi_{i\cdot}\pi_{\cdot j}$

[3] Assumption
Our Assumption is overall total (grand total) number, n, is fixed. And the data should be counted.

[4] Test Statistic
Under the null hypothesis, we estimate the expected count, $\mu_{ij}$ for the (i,j)th cell.
$\hat{\mu_{ij}}= n \cdot \hat{\pi}_{i \cdot} \cdot \hat{\pi}_{\cdot j}=n \left ( \frac{y_{i \cdot}}{n} \right )\left ( \frac{y_{\cdot j}}{n} \right )= \frac{y_{i \cdot} y_{\cdot j}}{n}$

Test statistic is $X^2= \sum_{i=1}^{I}\sum_{j=1}^{J} \frac{\left ( y_{ij}-\hat{\mu}_{ij} \right )^2}{\hat{\mu}_{ij}}$,
where under $H_0$ with large samples, $X^2= \chi^2_{df}$ with df=(I-1)(J-1)

[5] Conclusion
Chi-square test is to compare two values between Chi-squared observations and Chi-squared critical.
We reject the null hypothesis if Chi-squared observation is greater than Chi-square critical which can be found Chi-squared table.

In order to look up a critical value, we need a degree of freedom, level of significance and tail situation.

Case Study : Poisson Regression Model

Case Study : Mating Success of Elephants

Joyce Poole studied a population of African elephants in Amboseli National Park for 8 years. The data contains the number of successful matings(from 0) and ages(at beginning between 27~52 years) of 41 male elephants. The main question is what the relationship between mating success and age is!

Reference: Ramsey, F.L. and Schafer, D.W. (2002). The Statistical Sleuth: A Course in Methods of Data Analysis (2nd ed), Duxbury. // Data from J.H. Poole, "Mate Guarding, Reproductive Success and Female Choice in African Elephants", Animal Behavior 37(1989): 842-49 In R: http://finzi.psych.upenn.edu/library/Sleuth3/html/case2201.html

[1] Data and Model
The predictor variable is AGE between 27 and 52 years, and the outcome is number of successful matings from 0. Note that E(Y)= $\mu$

As the outcome is counts and small numbers so that it will not follow a normal distribution on age. Also the outcome is not a binary nor binomial (b/c there is no fixed number of trials with a definite upper limit). Therefore underlying probability distribution of response is Poisson.

Model: $g(\mu)=\log(\mu)= \mathbb{X}\beta = \beta_0+\beta_1 \cdot AGE_{i1}$,
where $\mu_i$ is the mean number of matings for an elephant of Age.
Then, $\mu_i=\exp\left \{ \beta_0+\beta_1 \cdot AGE_{i1} \right \}$ If age increases by one unit, then $\mu$ changes by a factor of $\exp(\beta_1)$.

[2] SAS Code
Proc genmod can be used for any generalized linear model.

proc genmod;
model matings = age / dist=poisson;
run;

[3] SAS Result

3.1 Fitted model
$\hat{\log(\mu)}= - 1.5820 + 0.0687 \cdot AGE$
We have a small P-value ( < 0.001) based on the Wald test, there is strong evidence that the mean number of successful matings depends on Age. Therefore, for every 1-year increase in Age, the mean number of successful matings increases by a factor of exp(0.0687)=1.071 (~7%)

3.2 Fitted Model vs Saturated Model
$H_0$ : Fitted model fits as well as saturated model.
$H_1$ : Saturated model (which uses indicator variables for each value of Age) fits better.
Test Statistics (Deviance) : 51.0116
Distribution of test statistics : n-p = 41-2 = 39 (n= # sample size, p= # parameters tested $\beta_0, \beta_1$)
P-value should be calculated by R : 0.0943
Therefore, there is week evidence that fitted model fits as well as the saturated model.

3.3 Dispersion parameter
In order to assess the equality between mean and variance, we can check Deviance/DF=51.0116/39, and Pearson Chi-Square/DF =45.1360/39. Both values are close to 1 so we can see that Poisson model is appropriate.

Intro to Poisson Regression Model

[1] When do we use Poisson Model?
Poisson Model is an example of Generalized Linear Model which is useful for counts of rare events.

If outcomes are counts and small numbers, it will not have a normal distribution so that we cannot apply linear regression. If outcomes are not binary, or if there is no a
fixed number of trials in outcomes, we cannot apply logistic regression neither.

[2] Model
Recall) Y ~ Poisson($\mu$), then the probability mass function is $P(Y=y)=\frac{\mu^y e^{-\mu}}{y!}$, y=0,1,...
And E(Y)=$\mu$, and Var(Y)=$\mu$.

As Poisson regression model is an example of a Generalized Linear Model, so we need a link function g($\cdot$) in order to make a linear relationship between y and X.
Poisson link function is $g(\mu)=\log(\mu)$, so model E(Y) has a linear function in the parameters, $g(E(Y))= \beta_0+\beta_1 X_1+...+ \beta_pX_p=\mathbb{X}\beta$
Therefore, Poisson regression model is also called "log-linear" model.

So, if one of the explanatory variable $x_j$ increases by one unit, holding other variables constant, $\mu_j$ changes by a factor of exp($\beta_j$).

[3] Estimation of Model Parameters
So, how can we estimate model parameters? - We can use Maximum Likelihood Estimation.
Likelihood function: $L=\prod_{i=1}^{n}\frac{\mu^{y_i} e^{-\mu_i}}{y_i!}$
The Full log-likelihood function : $\sum_{i=1}^{n} \left \{ y_i \cdot log(\mu_i)-\mu_i - \log(y_i!)\right \}$

Note that, in SAS, the log-likelihood is $\sum_{i=1}^{n} \left \{ y_i \cdot log(\mu_i)-\mu_i \right \}$, which constant ($- \log(y_i!)$ ) is deleted so that log-likelihood is larger than the full log-likelihood

So, SAS or R will automatically find the coefficient of parameters.

[4] Model Assessment
Note that this has similar procedure as in Binomial Logistic Regression.

4.1 Checking for linear relationship : plot $- \log(y_i!)$ versus X's

4.2 Correct form (whether $\beta's$ are zero or not!): Check Wald and LRT tests
Wald Procedure
Hypothesis : $H_{0}:\beta_{j}=0$ which means $X_{j}$ has no effect on log-odds!!, $H_{1}:\beta_{j}\neq 0$
Test Statistics : $Z_{obs}=\frac{\hat{\beta_{j}}}{se(\hat{\beta_{j}})}$
                        where $\hat{\beta_{j}}$ is a maximum likelihood estimate.
Note that if there is large enough sample, MLE's are normally distributed so that under t$H_{0}$, our test statistics, $Z_{obs}$, is an observation from an approximate Normal(0,1) distribution!!
95% Confidence interval : $\hat{\beta_{j}}\pm 1.96 \cdot se(\hat{\beta_{j}})$

LRT (Likelihood Ratio Test)
Likelihood Ratio : $\frac{L_{R}}{L_{F}}$, where $L_{R}$ is reduced model, and $L_{F}$ is full model of same data.
Hypothesis :
$H_{0}: \beta_{1}=...=\beta_{k}=0$ Reduced model is appropriate so that it fits data as well as full model.
$H_{1}:$ at least one $\beta_{1}=...=\beta_{k}\neq 0$
Test Statistics : $G^2=-2 \log L_{R}-(-2\log L_{F})=-2 \log \frac{L_R}{L_F}$
Note that, under the null hypothesis, $G^2$ is an observation from a chi-square distribution with k degrees of freedom for large n! k is the number of parameter fewer in reduced model.

4.3 Adequate Fit: Deviance GOF test
$H_0$ : The Fitted model is enough vs $H_1$ : Saturated model is required.
The smaller deviance implies fitted model is good enough for the data.
Wait! What is the Deviance?
Deviance = $-2 \log \frac{L_F}{L_S}= -2 (\log L_F-\log L_S)=2(\log L_S-\log L_F)$, where
the log-likelihood is $-2 \log \frac{L_F}{L_S}= -2 (\log L_F-\log L_S)=2(\log L_S-\log L_F)$
Recall) Likelihood is $L=\prod_{i=1}^{n}\binom{m_i}{y_i}\pi^{y_i}_i (1-\pi_i)^{m_i-y_i}$ (m= fixed number of trials, n=total sample size)

4.4 Outliers: Look at residuals, Deviance and Pearson residuals.
Both are less than |2|, then we consider that there is no outliers.
In SAS, use reschi for Pearson residuals, resdev for deviance in proc statement.

[5] Common problem (Dispersion Parameter)
In Poisson model, we expect the variance to be equal to the mean, $\mu$. However sometimes variance is larger than its mean.

$\frac{deviance}{df}$, $\frac{Pearson \chi}{df}$ are closed to 1, then Poisson model is appropriate. But if not, then we need to add an extra parameter to get balanced between variance and mean in the model.

Case Study: Binomial Logistic Regression

Case Study : Krunnit Islands off Finland.

In a study of the Krunnit Islands archipelago, researchers presented results of extensive bird surveys taken over four decades. They visited each island several times, cataloguing species. If a species was found on a specific island in 1949, it was considered to be at risk of extinction for the next survey of the island in 1959. If it was not found in 1959, it was counted as an “extinction”, even though it might reappear later.

The main question is to find large or small area preserves better!

Reference : Ramsey, F.L. and Schafer, D.W. (2013). The Statistical Sleuth: A Course in Methods of Data Analysis (3rd ed), Cengage Learning. (Data from Vaisanen and Jarvinen, “Dynamics of Protected Bird Communities in a Finnish Archipelago,” Journal of Animal Ecology 46 (1977): 891-908.)

[1] Data and Model
The data counts of bird species for 18 Krunnit Islands off Finland. The data contains 1) AREA (area of island, $x_i$ ), 2) ATRISK (number of species on each island in 1949, $m_i$ ), and 3) EXTINCT (number of species no longer found on each island in 1959, $y_i$ ).

$\pi_i$ is probability of extinction of each island, assuming species survival is independent.
Then, $Y_i\sim Binomial (m_i,\pi_i)$

Observed response proportion is $\hat{\pi_i}=\frac{y_i}{m_i}$ = EXTINCT / ATRISK
Observed logit : $\log(\frac{\pi_{S,i}}{1-\pi_{S,i}})$, where S means "Saturated"
Model : $\log(\frac{\pi_{S,i}}{1-\pi_{S,i}})= \beta_0+\beta_1 \cdot AREA_i$

[2] Initial Assessment
If we plot the observed logits versus Area, it does not seem to be a linear relationship. An assumption for a linear relationship between logit and explanatory variable is violated.

proc sgplot;

scatter y=logitpi x=area; run;

Therefore, it is better to look at log(Area) instead just Area.
Our model : $\log(\frac{\pi_{S,i}}{1-\pi_{S,i}})= \beta_0+\beta_1 \cdot \log(AREA_i)$

proc sgplot;

scatter y=logitpi x=logarea; run;

[3] SAS Result

proc logistic plot(only)=effect;

model nextinct/nspecies=logarea / scale=none; run;

* scale statement shows Pearson and Deviance GOD tests

Number of observations are 18, and sum of frequencies is 632 (=$\sum_{i=1}^{18}m_i$ )
Fitted model: $logit(\hat{\pi})= - 1.196-0.297 \log(AREA)$

Testing Global Null Hypothesis : BETA=0 (Wald Test) (Log Area is significant?)
- Likelihood Ratio statistic is 33.2765 which is computed by 578.013 - 544.736, and the degree of freedom is 1 as we are testing only $\beta_1$.
- In conclusion, we have a small P-value (< 0.0001), we have strong evidence that coefficient of log(AREA) is not zero which means log(AREA) is useful to predict the logit.

[4] Interpretation of $\beta_1$
Our model is  $\log(\frac{\pi_{S,i}}{1-\pi_{S,i}})= \beta_0+\beta_1 \cdot \log(AREA_i)$
$\frac{\pi}{1-\pi}= e^{\beta_0} \cdot e^{\beta_1 \log(x)}=e^{\beta_0}x^{\beta_1}$

Therefore, changing x by a factor of h, then it changes the odds by a factor of $h^{\beta_1}$ For example, if we double island area, then odds will be changed by a factor of $2^{-0.2971}$ = 0.81, which means the odds of extinction on larger island are 81% of the odds of extinction on an island half its size.

[5] Estimate the Probability of Extinction
In Binomial Logistic Regression, we can estimate the probability of extinction for a species on the specific island based on the fitted model (M); $logit(\hat{\pi}_{M,i})= - 1.196-0.297 \log(AREA)$

For example, The area of Ulkokrunni island (i=1) is 185.5 $km^2$
Then, $logit(\hat{\pi}_{M,i})= -1.196 - 0.297 \cdot \log(185.5)=-2.747$ (For calculation, use ln)
Then, $\hat{\pi}_{M,i}= e^{-1.196} \cdot AREA^{-0.297}=0.06$

[6] Deviance Goodness of Fit Test (Drop-in-Deviance Test)
In order to compare between reduced fitted model (reduced, R) vs saturated model (full, F), we can use Drop-in-Deviance Test.

$H_0$: Our fitted model fits data as well as saturated model.
$H_1$: Saturated model $logit(\pi)= \beta_0+ \beta_1 I_1+...+ \beta_{n-1}I_{n-1}$ is better.
Test Statistics: Deviance = $-2 \log \frac{L_R}{L_F}=12.06$ with n-(p+1) = 18 -1 + 1 =16 d/f, p: # testing $\beta$
Conclusion: As we have high P-value (0.7397), the fitted model is good enough over saturated model.

The Binomial Logistic Regression

Binomial logistic regression model is an example of Generalized Linear Model.

[1] Assumptions
Observations are independent, and sample size is large enough for valid inference-tests and confidence interval as the Generalized Linear Model use MLE(Maximum Likelihood Estimate) to predict the parameter coefficients.

There is a linear relationship between observed logit and quantitative explanatory variables.

[2] Model
The response variable is binomial count of the number of "success" Y~ Binomial(m, $\pi$)
$P(Y=y)=\binom{m}{y}\pi^y(1-\pi)^{m-y}$, y=0,1,...,m

If we consider $\frac{y}{m}$ as a proportion of success out of m independent Bernoulli trials then,
- The mean: $E(\frac{y}{m})=\pi$
- The variance: $Var(\frac{y}{m})= \frac{\pi(1-\pi)}{m}$

Model: $\log (\frac{\pi}{1-\pi})= f(\mathbb{X}; \beta)$ where $f(\mathbb{X}; \beta)$ is a linear function of the $\beta$'s.

[3] Models and Deviance & Global Likelihood Ratio Test
In Binomial Logistic regression, we can do more tests for model adequacy than in Binary Logistic Regression.

There are three types of models ; Saturated, Fitted, and Null model) describing the data, but we need to choose the most adequate model among them through tests.

3.1 Saturated Model (or Full Model)
It is an exact model describing the data so that predictor variables are indicator variables for each level of X!
$logit (\hat{\pi})= \hat{\alpha_{0}}+ \hat{\alpha_{1}}I_1+...+ \hat{\alpha}_{n-1}I_{n-1}$

3.2 Fitted Model (or Reduced Model)
The model with only a few predictor variables.
$logit (\hat{\pi})= \hat{\beta_{0}}+ \hat{\beta_{1}}X$

3.3 Null model
The model with intercept only!
$logit (\hat{\pi})= \hat{\gamma_{0}}$

In order to compare between Saturated and Fitted model, then we use Deviance Test!
In order to compare between Fitted and Null, then we use Global Likelihood Ratio Tests!

3.4 Deviance Test - Saturated vs. Fitted
$H_0$ : The Fitted model is enough vs $H_1$ : Saturated model is required.
The smaller deviance implies fitted model is good enough for the data.

Wait! What is the Deviance?
Deviance = $-2 \log \frac{L_F}{L_S}= -2 (\log L_F-\log L_S)=2(\log L_S-\log L_F)$, where
the log-likelihood is $-2 \log \frac{L_F}{L_S}= -2 (\log L_F-\log L_S)=2(\log L_S-\log L_F)$
Recall) Likelihood is $L=\prod_{i=1}^{n}\binom{m_i}{y_i}\pi^{y_i}_i (1-\pi_i)^{m_i-y_i}$ (m= fixed number of trials, n=total sample size)

3.5 Global Likelihood Ratio Test - Fitted vs. Null
$H_0$ : The Null model is enough vs $H_1$ : The Fitted model is required.

[4] Other Model Fit Statistics - AIC & BIC
There are two popular fit statistics; AIC and SC for comparing models with same response and same data.

4.1 Akaike's Information Criterion (AIC)
     AIC = -2 log L + 2(p+1),
              where 2(p+1) is a penalty and p+1 : # parameters including $\beta_0$

4.2 Schwarz's (Bayesian Information) Criterion (SC)
     SC = -2 logL + (p+1) log N,
            where (p+1)log N is a penalty and p is the number of explanatory variables
     SC applies stronger penalty for model complexity than AIC.

If difference in AIC's > 10, then one model fits better than another. If AIC's < 2, then one model can be considered to be equivalent to another. Overall, smaller is better!

Case Study : The Binary Logistic Regression in SAS

Case Study : The Donner Party in SAS

The Donner and Reed family (87 people) travelling by covered wagon got stuck in a snow storm in October in the Sierra Nevada in 1846. By the time they were rescued in next April, 40 members had died from starvation and harsh condition. The researchers attempted to investigate whether females are better able to withstand this condition than males, and whether the odds of survival were different between male and females for any given age.

Reference: Gayson, D.K., 1990, "Donner Party deaths: A demographic assessment," Journal of Anthropological Research, 46, 223-42, and Ramsey, F.L. and Schafer, D.W., 2002, The Statistical Sleuth, 2nd Ed, Duxbury Press, p. 580.

[1] Data and Model

Response $Y_{i}$ : Binary variable; survived/died. (Note that Not continuous & normal)

Predictors $X_{i}$ : Age, Sex of ith pioneer.

Odds in favor of success is $\frac{\pi}{1-\pi}$ & Log Odds: $\log \frac{\pi}{1-\pi}$

Model : $\log \frac{\pi}{1-\pi}= \beta_{0} + \beta_{1}AGE_{i1} + \beta_{2}SEX_{i2}$, i=1,...,45 Binary Logistic Regression

Note that we cannot predict survival ($\pi$=1) or death ($\pi$=0) of a pioneer, but we can estimate $\pi_{i}$, the probability of survival, odds of survival and log-odds of survival based on the predictors!

[2] SAS Code & Result - proc logistic

The default is an alphanumeric order. So if we run SAs with default, between Die and Survived, Die comes first. Therefore, $\pi$=P(DIE)!! In order to reserve order we can use DESCENDING option in proc statement which makes $\pi$=P(Survived)

In generalized linear model, we estimate parameter coefficient by using the maximum likelihood estimate which is calculated by Fisher Scoring algorithm in SAS.

2.1 Default Code

Note that the predictor variable SEX is a categorical variable. The class statement creates female = 1, and male = -1.

2.2 Default Result

2.3 SAS Result Intepretation
* Model Equation
$\log(\frac{\hat{\pi}}{1-\hat{\pi}})=-2.43+0.078 \cdot AGE_{i}-0.80\cdot SEX_{i}$

* Wald Procedures
Hypothesis : $H_{0}: \beta_j=0$ vs $H_{1}: \beta_j\neq 0$
Test Statistics: $Z_{obs}=\frac{\hat{\beta_j}}{se(\hat{\beta_j})}$ ~ approx. Normal(0,1) distribution
95% Confidence Interval : $\hat{\beta_j}\pm1.96 \cdot se(\hat{\beta_j})$

Does the Age predictor variable have effect on log-odds?
Hypothesis: $H_{0}: \beta_{AGE}=0$ vs $H_{1}: \beta_{AGE}\neq 0$
Test Statistics : $(\frac{-0.078}{0.0373})^2=4.3988$ ~ $\chi_{1}$ (Why? b/c If $Z$ ~N(0,1) then $Z^2$ ~ $\chi_{1}$)
P-value : 0.036 which means we have moderate evidence that AGE has an effect on survival.
95% Confidence Interval : $-0.078 \pm 1.96 \cdot 0.0373 = (-0.15, -0.0055)$
CI for Odds Ratio : $(e^{-0.15}, e^{-0.0055})=(0.86, 0.995)$
Interpretation for Odds Ratio: For the same sex, the odds ratio for a 1-yera increase in age is between 0.86 and 0.995.

Note that if a predictor variable($X_1$) increase by 1 unit, holding all other predictor variables are constant, the odds that Y=1 change by a multiplicative factor of $e^{\beta_1}$.

* Likelihood Ratio Test
Hypothesis : $H_{0}: \beta_1= \beta_2=0$ vs \H_1$= The Fitted model is better.
Test Statistics: $G^2=-2 \log \frac{L_R}{L_F}=2 \log L_F-2 \log L_R=61.827 - 51.256=10.57$
Interpretation : As we have small P-value (0.0051), we have strong evidence that fitted model is better.

Reference: https://onlinecourses.science.psu.edu/stat504/node/159

Remark 1) SAS Code - DESCENDING Options
If we put descending option, we will get $\pi$ =P(Survived)

Remark 2) SAS Code - CLASS Options
In categorical predictor defined by the class statement, effect coding is 1/-1. In order to use indicator variable such as $I_{Female}=1$ if female, and 0 if Male, we can use / param=ref; statement.

[3] Model with age, sex and their interaction!
Q) Is the coefficient for the age-sex interaction statistically significantly different from 0?

A) We can use a Likelihood ratio test!

First, SAS code is following below.
proc logistic descending;
title 'Model with age sex and their interaction';
class sex / param=ref;
model status = age sex age*sex;
run;

Then, the SAS result is following below.

The Likelihood Ratio Test with $H_0$: the coefficient of the interaction term($\beta_3$ )= 0
The deviance for the model with the interaction is 47.346, whereas the model without the interaction is 51.256. So Test statistic is 51.256-47.346 = 3.91 which is distributed with Chi-squared with 1 degree of freedom (as we are testing only $\beta_3$)
From Chi-Square table, the p-value is between 0.025 and 0.05.

Therefore, there is moderate evidence that the coefficient of the interaction is not 0.

Q) What is the estimated odds ratio for a female that is 10 years older than another female?
A) Refer to the coefficient estimates, the estimated odds ratio will be,
$\exp\left \{ -0.0325\cdot10-0.1616\cdot 10 \right \}=0.14$

The Binary Logistic Regression

Binary logistic regression model is an example of Generalized Linear Model.

[1] Assumptions
Observations are independent, and sample size is large enough for valid inference-tests and confidence interval as the Generalized Linear Model uses MLE (Maximum Likelihood Estimate) to predict the parameter coefficients. Also, underlying probability model for response is Bernoulli.

The correct form of model is linear relationship between logits and explanatory variables.

As this is an example of Generalized Linear Model, there are less assumptions to check for than in linear regression. (Note that General Linear Model uses least square mathematical method to predict parameter coefficients.)
- We don't need to check for outliers since the response variable is either 0 or 1.
- There is no residual plots as there is no meaning from residuals.
- Variance is not constant. (see below)

[2] Model
$Y_{i}|X_{i}$ = 1 if response is in category of interest, 0 otherwise. (Binary!)
where $Y_{i}|X_{i}$ is distributed by Bernoulli ($\pi_{i}$), where $\pi_{i}$ is a probability of success.

Then $E[Y_{i}|X_{i}]=\pi_{i}$, and $Var[Y_{i}|X_{i}]=\pi_{i} \cdot (1-\pi_{i})$
Therefore, variance is not constant as the probability is vary.

Logistic Regression Model : $log(\frac{\pi}{1-\pi})= \beta_{0}+\beta_{1}X_{1}+...+\beta_{p}X_{p}$

Logistic Function : $\pi=\frac{\exp(\beta_{0}+\beta_{1}X_{1}+...+\beta_{p}X_{p})}{1+\exp(\beta_{0}+\beta_{1}X_{1}+...+\beta_{p}X_{p})}$,
                                where $\exp(\beta_{0}+\beta_{1}X_{1}+...+\beta_{p}X_{p})\in (-\infty, \infty )$

[3] Estimation Method of Parameter Coefficients - by MLE
Data is $Y_{i}$ where 1 if response is in category of interest, 0 otherwise.
Model : $P(Y_{i}=y_{i})=\pi_{i}^{y_{i}}\cdot (1-\pi_{i})^{1-y_{i}}$.

Joint Density: $P(Y_{i}=y_{i},...,Y_{n}=y_{n})=\prod_{i=1}^{n}\pi_{i}^{y_{i}}\cdot (1-\pi_{i})^{1-y_{i}}$
           where $\pi=\frac{\exp(\beta_{0}+\beta_{1}X_{1}+...+\beta_{p}X_{p})}{1+\exp(\beta_{0}+\beta_{1}X_{1}+...+\beta_{p}X_{p})}$ and $1-\pi_{i}= \frac{1}{1+\exp(\beta_{0}+\beta_{1}X_{1}+...+\beta_{p}X_{p})}$
Note that we assume n observations are independent!

So if we think of the joint density as a function of $\beta$, then
- Likelihood Function will be $L(\beta_{0},...,\beta_{p})= \prod_{i=1}^{n}\pi_{i}(\beta)^{y_{i}}(1-\pi_{i}(\beta))^{1-y_{i}}$
Finally if we maximize the log-likelihood, $(\hat{\beta_{0}},...,\hat{\beta_{p}})=\arg \max \left \{ \log L(\beta_{0},...,\beta_{p}) \right \}$
However, there is no exact values so that we can get solution by using Newton-Raphson algorithm or Fisher Scoring Algorithm. As the parameter coefficients are estimated by this method, our one of the assumptions should be large enough sample!

[4] Wald Procedure
How can we know whether a explanatory has effect on log-odds or not?
We can use Wald procedure to test whether $\beta's$ are zero or not!!

Hypothesis : $H_{0}:\beta_{j}=0$ which means $X_{j}$ has no effect on log-odds!!, $H_{1}:\beta_{j}\neq 0$
Test Statistics : $Z_{obs}=\frac{\hat{\beta_{j}}}{se(\hat{\beta_{j}})}$
                        where $\hat{\beta_{j}}$ is a maximum likelihood estimate.
Note that if there is large enough sample, MLE's are normally distributed so that under t$H_{0}$, our test statistics, $Z_{obs}$, is an observation from an approximate Normal(0,1) distribution!!

95% Confidence interval : $\hat{\beta_{j}}\pm 1.96 \cdot se(\hat{\beta_{j}})$

[5] Likelihood Ratio Test
In order to compare which model is appropriate, we can use likelihood ratio test.
Likelihood Ratio : $\frac{L_{R}}{L_{F}}$, where $L_{R}$ is reduced model, and $L_{F}$ is full model of same data.

Hypothesis :
$H_{0}: \beta_{1}=...=\beta_{k}=0$ Reduced model is appropriate so that it fits data as well as full model.
$H_{1}:$ at least one $\beta_{1}=...=\beta_{k}\neq 0$

Test Statistics : $G^2=-2 \log L_{R}-(-2\log L_{F})=-2 \log \frac{L_R}{L_F}$
Note that, under the null hypothesis, $G^2$ is an observation from a chi-square distribution with k degrees of freedom for large n! k is the number of parameter fewer in reduced model.

Remark!! Both Wald Test and Likelihood Ratio Test are testing whether $\beta$ parameter is zero or not! But those two procedures are different so that each following distribution is also different!! Just in case they do not agree, we use the Likelihood Ratio Test since it is more reliable.

The Generalized Linear Model

[1] What's the Generalized Linear Model?
"The generalized linear model generalizes linear regression by allowing the linear model to be related to the response variable via a link function"
- Reference : https://en.wikipedia.org/wiki/Generalized_linear_model

In other words, we want to have a model E(Y) as a linear function in the parameters. So beta is a linear form. $g(E(Y))=\beta_{0}+\beta_{1}X_{1}+...+\beta_{p}X_{p}=\mathbb{X}\beta$
Here, g( ) is a link function which generalizes linear regression!

Note that in the General Linear Model, we can predict the beta's by using least squares. However, in the Generalized Linear Model, we use maximum likelihood (*MLE) or Bayesian to predict the beta's.

*Remark) Large -Sample Properties of MLE
If sample size is large enough and the model is correct,
MLEs are unbiased, normally distributed and have minimum variance.
And this MLE is computed by Newton-Raphson algorithm or Fisher scoring.

[2] What's the Link Function?

Please refer to the reference to find the link function information.
- Reference : http://support.minitab.com/en-us/minitab/17/topic-library/modeling-statistics/regression-and-correlation/logistic-regression/link-function/

[3] Examples of The Generalized Linear Model

Binary Logistic Regression - for more info, click!
- Logistic Regression Model : $log( \frac{\pi}{1-\pi})= \beta_{0}+ \beta_{1}X_{1}+...+ \beta_{p}X_{p}$
- Logistic Function : $\pi = \frac{\exp (\beta_{0}+\beta_{1}X_{1}+...+\beta_{p}X_{p})}{1+\exp (\beta_{0}+\beta_{1}X_{1}+...+\beta_{p}X_{p})}$
where $\pi$ is a probability of success.

- in SAS code: proc logistic

Poisson Regression

Gamma Regression

Stat n Math

Pages