Stat n Math : Case Study: Binomial Logistic Regression

Case Study : Krunnit Islands off Finland.

In a study of the Krunnit Islands archipelago, researchers presented results of extensive bird surveys taken over four decades. They visited each island several times, cataloguing species. If a species was found on a specific island in 1949, it was considered to be at risk of extinction for the next survey of the island in 1959. If it was not found in 1959, it was counted as an “extinction”, even though it might reappear later.

The main question is to find large or small area preserves better!

Reference : Ramsey, F.L. and Schafer, D.W. (2013). The Statistical Sleuth: A Course in Methods of Data Analysis (3rd ed), Cengage Learning. (Data from Vaisanen and Jarvinen, “Dynamics of Protected Bird Communities in a Finnish Archipelago,” Journal of Animal Ecology 46 (1977): 891-908.)

[1] Data and Model
The data counts of bird species for 18 Krunnit Islands off Finland. The data contains 1) AREA (area of island, $x_i$ ), 2) ATRISK (number of species on each island in 1949, $m_i$ ), and 3) EXTINCT (number of species no longer found on each island in 1959, $y_i$ ).

$\pi_i$ is probability of extinction of each island, assuming species survival is independent.
Then, $Y_i\sim Binomial (m_i,\pi_i)$

Observed response proportion is $\hat{\pi_i}=\frac{y_i}{m_i}$ = EXTINCT / ATRISK
Observed logit : $\log(\frac{\pi_{S,i}}{1-\pi_{S,i}})$, where S means "Saturated"
Model : $\log(\frac{\pi_{S,i}}{1-\pi_{S,i}})= \beta_0+\beta_1 \cdot AREA_i$

[2] Initial Assessment
If we plot the observed logits versus Area, it does not seem to be a linear relationship. An assumption for a linear relationship between logit and explanatory variable is violated.

proc sgplot;

scatter y=logitpi x=area; run;

Therefore, it is better to look at log(Area) instead just Area.
Our model : $\log(\frac{\pi_{S,i}}{1-\pi_{S,i}})= \beta_0+\beta_1 \cdot \log(AREA_i)$

proc sgplot;

scatter y=logitpi x=logarea; run;

[3] SAS Result

proc logistic plot(only)=effect;

model nextinct/nspecies=logarea / scale=none; run;

* scale statement shows Pearson and Deviance GOD tests

Number of observations are 18, and sum of frequencies is 632 (=$\sum_{i=1}^{18}m_i$ )
Fitted model: $logit(\hat{\pi})= - 1.196-0.297 \log(AREA)$

Testing Global Null Hypothesis : BETA=0 (Wald Test) (Log Area is significant?)
- Likelihood Ratio statistic is 33.2765 which is computed by 578.013 - 544.736, and the degree of freedom is 1 as we are testing only $\beta_1$.
- In conclusion, we have a small P-value (< 0.0001), we have strong evidence that coefficient of log(AREA) is not zero which means log(AREA) is useful to predict the logit.

[4] Interpretation of $\beta_1$
Our model is  $\log(\frac{\pi_{S,i}}{1-\pi_{S,i}})= \beta_0+\beta_1 \cdot \log(AREA_i)$
$\frac{\pi}{1-\pi}= e^{\beta_0} \cdot e^{\beta_1 \log(x)}=e^{\beta_0}x^{\beta_1}$

Therefore, changing x by a factor of h, then it changes the odds by a factor of $h^{\beta_1}$ For example, if we double island area, then odds will be changed by a factor of $2^{-0.2971}$ = 0.81, which means the odds of extinction on larger island are 81% of the odds of extinction on an island half its size.

[5] Estimate the Probability of Extinction
In Binomial Logistic Regression, we can estimate the probability of extinction for a species on the specific island based on the fitted model (M); $logit(\hat{\pi}_{M,i})= - 1.196-0.297 \log(AREA)$

For example, The area of Ulkokrunni island (i=1) is 185.5 $km^2$
Then, $logit(\hat{\pi}_{M,i})= -1.196 - 0.297 \cdot \log(185.5)=-2.747$ (For calculation, use ln)
Then, $\hat{\pi}_{M,i}= e^{-1.196} \cdot AREA^{-0.297}=0.06$

[6] Deviance Goodness of Fit Test (Drop-in-Deviance Test)
In order to compare between reduced fitted model (reduced, R) vs saturated model (full, F), we can use Drop-in-Deviance Test.

$H_0$: Our fitted model fits data as well as saturated model.
$H_1$: Saturated model $logit(\pi)= \beta_0+ \beta_1 I_1+...+ \beta_{n-1}I_{n-1}$ is better.
Test Statistics: Deviance = $-2 \log \frac{L_R}{L_F}=12.06$ with n-(p+1) = 18 -1 + 1 =16 d/f, p: # testing $\beta$
Conclusion: As we have high P-value (0.7397), the fitted model is good enough over saturated model.

Stat n Math

Pages

Case Study: Binomial Logistic Regression

No comments:

Post a Comment

Search This Blog

Popular Posts