Statistical Tests for Categorical Data (part I)

Preface

Open Rstudio to do the practicals. Note that tasks with * are optional.

R Packages

In this practical, a number of R packages are used. The packages used (with versions that were used to generate the solutions) are:

memisc (version: 0.99.27.3)
MASS (version: 7.3.54)

R version 4.1.0 (2021-05-18)

One sample tests

Task 1

Explore the pre-loaded data set iris.

What is the proportion of each Species group?
We want to investigate whether the proportion of the group setosa is different from 0.5. Define the null and alternative hypothesis.
Perform the test and save the results in an object called test_res_cat. What is the conclusion?
Calculate the test statistic manually and confirm the result with the R function.
Investigate whether the proportion of the group setosa is smaller than 0.5.

Use the R function percent(…) from the memisc package to obtain the percentages.

Rule of thumb: \(150*0.3333\) and \(150*(1-0.3333)\) are large so we can use the normal approximation.

Use the function prop.test() to investigate whether the sample proportion is different from a particular value.

The prop.test() function presents the X-squared test statistic. To calculate the test statistic manually, use the connection between the X-squared and normal distribution.

Check the argument alternative of the prop.test(…) function.

Solution 1

library(memisc)

## Loading required package: lattice

## Loading required package: MASS

## 
## Attaching package: 'memisc'

## The following objects are masked from 'package:stats':
## 
##     contr.sum, contr.treatment, contrasts

## The following object is masked from 'package:base':
## 
##     as.array

percent(iris$Species)

##     setosa versicolor  virginica          N 
##   33.33333   33.33333   33.33333  150.00000

# obtain the number of subjects in each group
table(iris$Species)

## 
##     setosa versicolor  virginica 
##         50         50         50

The proportion of each species is (50/150=) 0.33.

Null hypothesis: the probability of the group setosa is equal to 0.5.
Alternative hypothesis: the probability of the group setosa is different from 0.5.

Rule of thumb: \(150∗0.3333\) and \(150∗(1−0.3333)\) are large so we can use the normal approximation.

# Without continuity correction
test_res_cat <- prop.test(x = 50, n = 150, p = 0.5, alternative = "two.sided", correct = FALSE)
test_res_cat

## 
##  1-sample proportions test without continuity correction
## 
## data:  50 out of 150, null probability 0.5
## X-squared = 16.667, df = 1, p-value = 4.456e-05
## alternative hypothesis: true p is not equal to 0.5
## 95 percent confidence interval:
##  0.2628876 0.4121024
## sample estimates:
##         p 
## 0.3333333

# With continuity correction
test_res_cat <- prop.test(x = 50, n = 150, p = 0.5, alternative = "two.sided", correct = TRUE)
test_res_cat

## 
##  1-sample proportions test with continuity correction
## 
## data:  50 out of 150, null probability 0.5
## X-squared = 16.007, df = 1, p-value = 6.312e-05
## alternative hypothesis: true p is not equal to 0.5
## 95 percent confidence interval:
##  0.2598209 0.4155318
## sample estimates:
##         p 
## 0.3333333

If we assume that the significant level is equal to 0.05, the output indicates that we can reject the null hypothesis that the proportion sample of the setona group is equal to 0.5.

The test statistic (without continuity correction) is: \(z=\frac{p-\pi_0}{\sqrt{\frac{\pi_0(1-\pi_0)}{n}}}\)

The test statistic (with continuity correction) is: \(z=\frac{p-\pi_0 + c}{\sqrt{\frac{\pi_0(1-\pi_0)}{n}}}\),
where

\(c = -\frac{1}{2n}\) if \(p > \pi_0\)
\(c = \frac{1}{2n}\) if \(p < \pi_0\)
\(c = 0\) if \(\mid p - \pi_0 \mid < \frac{1}{2n}\)

and

\(p = 0.33\)
\(\pi_0 = 0.5\)
\(n = 150\)

Connection between normal and X-squared distribution: If \(X_1, X_2, \dots, X_k\) are independent random variables from the standard normal distribution, then the random variable \(Q = X^2_1+X^2_2+ \dots +X^2_k\) follows the \(\chi^2\) distribution with \(k\) degrees of freedom. The prop.test() function presents the X-squared test statistic. Therefore, we take \(z^2\).

# Without continuity correction
# Calculate the test statistic manually
z <- (0.33333 - 0.5) / sqrt((0.5*0.5)/(150))
z^2

## [1] 16.66733

# With continuity correction
# Calculate the test statistic manually
z <- (0.33333 - 0.5 + 1/300) / sqrt((0.5*0.5)/(150))
z^2

## [1] 16.00732

prop.test(x = 50, n = 150, p = 0.5, alternative = "less")

## 
##  1-sample proportions test with continuity correction
## 
## data:  50 out of 150, null probability 0.5
## X-squared = 16.007, df = 1, p-value = 3.156e-05
## alternative hypothesis: true p is less than 0.5
## 95 percent confidence interval:
##  0.0000000 0.4025292
## sample estimates:
##         p 
## 0.3333333

Task 2

Explore the pre-loaded data set iris.

Investigate whether the proportion of the group setosa is different from 0.5 using the exact test.
Investigate whether the proportion of the group setosa is smaller than 0.5 using the exact test. Obtain the p-value manually and confirm the result with the function.

Use the function biom.test(…) to investigate whether the sample proportion is different from a particular value using the exact test.

The binomial distribution model deals with finding the probability of success of an event which has only two possible outcomes in a series of experiments. For any possible outcome of the binomial we obtain the corresponding probability. We can calculate the p-value (without using the biom.test() R function) by considering the probability of seeing an outcome as, or more, extreme.

We can calculate the p-value (without using the biom.test() R function) using the pbinom() or sum(dbinom()) R functions.

Solution 2

binom.test(x = 50, n = 150, p = 0.5, alternative = "two.sided")

## 
##  Exact binomial test
## 
## data:  50 and 150
## number of successes = 50, number of trials = 150, p-value = 5.448e-05
## alternative hypothesis: true probability of success is not equal to 0.5
## 95 percent confidence interval:
##  0.2585564 0.4148430
## sample estimates:
## probability of success 
##              0.3333333

For a one-tailed test, \(H_1:\pi<\pi_0\) \(p-value = \Pr(X = 0) + \dots + Pr(X = k) = \sum_{i=0}^{k}Pr(X = i) = \sum_{i=0}^{k} ({n \atop i}) p^i (1-p)^{n-i}\),

where \(k=50\), \(n=150\) and \(p=0.5\):

pbinom(q = 50, size = 150, prob = 0.5)

## [1] 2.723767e-05

# or

sum(dbinom(x = 1:50, size = 150, prob = 0.5))

## [1] 2.723767e-05

test_res3 <- binom.test(x = 50, n = 150, p = 0.5, alternative = "less")
test_res3$p.value

## [1] 2.723767e-05

Two sample tests

Task 1

Let’s assume that we want to investigate whether there is a difference in the probability of lung cancer between males and females. We have:

number of females: 100
number of males: 70
number of females with lung cancer: 50
number of males with lung cancer: 55

Define the null and alternative hypothesis.
Perform the test. What is the conclusion?
Calculate the test statistic manually.

Rule of thumb: \(100*50/100\), \(100*(1-50/100)\), \(70*55/70\) and \(70*(1-55/70)\) are large so we can use the normal approximation.

Use the R function prop.test() to investigate whether the proportions of the two groups are different.

To calculate the test statistic manually, use the connection between the X-squared and normal distribution.

Solution 1

Null hypothesis: the probability of lung cancer in males is equal to females.
Alternative hypothesis: the probability of lung cancer in males is different from females.

Rule of thumb: \(100*50/100\), \(100*(1-50/100)\), \(70*55/70\) and \(70*(1-55/70)\) are large so we can use the normal approximation.

prop.test(x = c(50, 55), n = c(100, 70), alternative = "two.sided")

## 
##  2-sample test for equality of proportions with continuity correction
## 
## data:  c(50, 55) out of c(100, 70)
## X-squared = 13.049, df = 1, p-value = 0.0003034
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  -0.4351281 -0.1363005
## sample estimates:
##    prop 1    prop 2 
## 0.5000000 0.7857143

If we assume that the significant level is equal to 0.05, the output indicates that we can reject the null hypothesis that the sample probabilities are the same.

Test statistic (with continuity correction) of pooled version:

\(z = \frac{(p_1-p_2) + \frac{F}{2}\big(\frac{1}{n_1} + \frac{1}{n_2}\big)}{\sqrt{ p(1-p) \big(\frac{1}{n_1} + \frac{1}{n_2} \big)}}\),
where

\(F = -1\) if \(p_1 > p_2\)
\(F = 1\) if \(p_1 < p_2\)
\(p = \frac{n_1p_1 + n_2p_2}{n_1 + n_2}\)

p1 <- 50/100
p2 <- 55/70
n1 <- 100
n2 <- 70
p <- (n1*p1 + n2*p2)/(n1 + n2)

z = ((p1 - p2) + (1/2)*(1/n1 + 1/n2)) / (sqrt( p*(1-p)*(1/n1 + 1/n2)  )   ) 
z^2

## [1] 13.04926

Task 2

Explore the pre-loaded data set survey (from the package MASS).

We want to investigate whether the variables Sex and Clap are independent. Define the null and alternative hypothesis.
Perform the test. What is the conclusion?
Calculate the test statistic manually.

Use the R function chisq.test() to investigate whether two categorical variables are independent.

Solution 2

Null hypothesis: there is no association between the variables sex and clap.
Alternative hypothesis: there is an association between the variables sex and clap.

chisq.test(table(survey$Sex, survey$Clap))

## 
##  Pearson's Chi-squared test
## 
## data:  table(survey$Sex, survey$Clap)
## X-squared = 0.25373, df = 2, p-value = 0.8809

If we assume that the significant level is equal to 0.05, the output indicates that we cannot reject the null hypothesis that the variables sex and clap are independent.

The test statistic (without continuity correction) is:

\(X2 = \sum_{i=1}^K \frac{(O_i-E_i)^2}{E_i}\),
where \(K\) are the contingency table cells, \(O\) is the observed value and \(E\) the expected value.

# Calculate the test statistic manually
# Observe the total number of observations for the rows and columns:
addmargins(table(survey$Sex, survey$Clap))

##         
##          Left Neither Right Sum
##   Female   21      24    73 118
##   Male     18      25    74 117
##   Sum      39      49   147 235

# Obtain the observed and expected data:
Obs <- c(table(survey$Sex, survey$Clap))
Exp <- c(118*39/235, 117*39/235, 118*49/235, 117*49/235, 118*147/235, 117*147/235)

test_statistic <- sum((Obs - Exp)^2/Exp)
test_statistic

## [1] 0.2537294

Task 3

Explore the pre-loaded data set survey (from the package MASS).

Investigate whether the variables Sex and M.I are independent using the exact test.

Use the R function fisher.test() to investigate whether two categorical variables are independent using the exact test.

Solution 3

fisher.test(table(survey$Sex, survey$M.I))

## 
##  Fisher's Exact Test for Count Data
## 
## data:  table(survey$Sex, survey$M.I)
## p-value = 0.7678
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.4770082 1.6534163
## sample estimates:
## odds ratio 
##  0.8893934

Task 4*

Explore the pre-loaded data set survey (from the package MASS).

Investigate whether the variables Sex and M.I are independent for the Clap equal to Left group.

We have small sample size, so it might be a better idea to use the exact test.

Use the R function fisher.test() to investigate whether two categorical variables are independent using the exact test.

Solution 4*

# Select only the left group from the Clap variable
survey_leftClap <- survey[survey$Clap == "Left", ]

fisher.test(table(survey_leftClap$Sex, survey_leftClap$M.I))

## 
##  Fisher's Exact Test for Count Data
## 
## data:  table(survey_leftClap$Sex, survey_leftClap$M.I)
## p-value = 0.1527
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.04958942 1.48553847
## sample estimates:
## odds ratio 
##  0.2899661