Preface

Open Rstudio to do the practicals. Note that tasks with * are optional.

R version 4.3.1 (2023-06-16 ucrt)

One sample tests

Task 1

Explore the pre-loaded data set ToothGrowth.

  • Investigate with plots the tooth length.
  • We would like to test whether the mean tooth length is different from 20. Define the null and alternative hypothesis.
  • Perform the test and save the result in an object called test_res. What is the conclusion?
  • Extract the test statistic.
  • Calculate the test statistic manually and confirm the results with the R function.
  • Calculate the confidence interval manually and confirm the results with the R function.
  • Calculate the p-value manually and confirm the results with the R function.
  • Provide the definition of the p-value.
  • Obtain the same results using a linear regression model.

Use the R functions hist(…) / boxplot(…) to visually explore the distribution of a numeric vector.

Since the data looks normally distributed and we have 60 subjects, we can use the t-test. Use the R function t.test() to investigate whether the mean sample is different from a particular value.

To extract a particular result from an object use the $ symbol.

The one-sample t-test is a special case of the regression model. Keep in mind that the p-value of a linear regression model with only the intercept investigates whether the coefficient is equal to 0. Here we want to investigate whether the mean is 20.

Solution 1

head(ToothGrowth)
hist(ToothGrowth$len)

boxplot(ToothGrowth$len)

Null hypothesis: The mean tooth length is equal to 20.
Alternative hypothesis: The mean tooth length is different from 20.

test_res <- t.test(x = ToothGrowth$len, mu = 20)
test_res
## 
##  One Sample t-test
## 
## data:  ToothGrowth$len
## t = -1.2017, df = 59, p-value = 0.2343
## alternative hypothesis: true mean is not equal to 20
## 95 percent confidence interval:
##  16.83731 20.78936
## sample estimates:
## mean of x 
##  18.81333

If we assume that the significant level is equal to 0.05, the output indicates that we cannot reject the null hypothesis that the sample mean length is 20.

test_res$statistic
##         t 
## -1.201661

The test statistic of a one-sample t-test is \(\frac{\bar{x} - \mu_0}{sd(x)/\sqrt{n}}\)

# Calculate the test statistic manually
test_statistic <- (mean(ToothGrowth$len) - 20)/(sd(ToothGrowth$len)/sqrt(length(ToothGrowth$len)))
test_statistic
## [1] -1.201661

The confidence interval is:

[\(\bar{x} - critical \ value_{\alpha/2} \frac{sd(x)}{\sqrt{n}}\), \(\bar{x} + critical \ value_{\alpha/2} \frac{sd(x)}{\sqrt{n}}\)]

The sampling distribution of the mean is assumed to follow the Student’s t-distribution. We will use this distribution to obtain the critical values and the p-value.

# Calculate the confidence interval manually
# lower limit:
mean(ToothGrowth$len) + qt(0.025, df = length(ToothGrowth$len) - 1) * (sd(ToothGrowth$len)/sqrt(length(ToothGrowth$len)))
## [1] 16.83731
# or 
mean(ToothGrowth$len) - qt(0.025, df = length(ToothGrowth$len) - 1, lower.tail = FALSE) * (sd(ToothGrowth$len)/sqrt(length(ToothGrowth$len)))
## [1] 16.83731
# upper limit:
mean(ToothGrowth$len) + qt(0.025, df = length(ToothGrowth$len) - 1, lower.tail = FALSE) * (sd(ToothGrowth$len)/sqrt(length(ToothGrowth$len)))
## [1] 20.78936
# Calculate the p-value manually
p_value = 2 * pt(test_statistic, df = length(ToothGrowth$len) - 1)
p_value
## [1] 0.2342964

Definition of p-value: If the null hypothesis is true, then there is a 24% chance of obtaining a difference at least as extreme as the results observed between the sample mean and 20.

The one-sample t-test is a special case of the regression model. Keep in mind that the p-value of a linear regression model with only the intercept investigates whether the coefficient is equal to 0. Here we want to investigate whether the mean is equal to 20. Therefore, we assume our new outcome to be \(y-20\).

# Subtract the value 20:
y <-  ToothGrowth$len - 20

# Fit the linear regression model with only intercept:
summary(lm(y ~ 1))
## 
## Call:
## lm(formula = y ~ 1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.6133  -5.7383   0.4367   6.4617  15.0867 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  -1.1867     0.9875  -1.202    0.234
## 
## Residual standard error: 7.649 on 59 degrees of freedom

The p-value is 0.234, the df is 59 and the t-value is -1.202. These results were also observed in the t-test:

t.test(x = ToothGrowth$len, mu = 20)
## 
##  One Sample t-test
## 
## data:  ToothGrowth$len
## t = -1.2017, df = 59, p-value = 0.2343
## alternative hypothesis: true mean is not equal to 20
## 95 percent confidence interval:
##  16.83731 20.78936
## sample estimates:
## mean of x 
##  18.81333

The coefficient in the linear regression model can be obtained as the mean sample - 20 (18.81333 - 20 = -1.18667).

Task 2

Explore the pre-loaded data set ToothGrowth.

  • We want to investigate whether the median tooth length of the supplement type VC is different from 20. Define the null and alternative hypothesis.
  • Perform the test and save the results in an object called test_res_b. What is the conclusion?
  • Calculate the test statistic manually.

Use the R function wilcox.test() to investigate whether the median sample is different from a particular value. It is recommended to use the exact p-value calculation when there are no ties, and to use continuity correction if an exact calculation is not possble.

Solution 2

Null hypothesis: The median tooth length of the supplement type VC is equal to 20.
Alternative hypothesis: The median tooth length of the supplement type VC is different from 20.

test_res_b <- wilcox.test(x = ToothGrowth$len[ToothGrowth$supp == "VC"], 
            mu = 20, exact = FALSE)

test_res_b
## 
##  Wilcoxon signed rank test with continuity correction
## 
## data:  ToothGrowth$len[ToothGrowth$supp == "VC"]
## V = 140, p-value = 0.05842
## alternative hypothesis: true location is not equal to 20
# Because there are ties in the data, the exact calculation of the p-value could not be used and the normal approximation was used instead. Continuity correction was assumed (by default).

If we assume that the significant level is equal to 0.05, the output indicates that we cannot reject the null hypothesis that the sample median length of the VC group is 20. Keep in mind that the p-value is very close to the significant level.


The test statistic is either the sum of the ranks (of the absolute difference) where the difference was positive or negative:

# Calculate the test statistic manually
x <- ToothGrowth$len[ToothGrowth$supp == "VC"]
y <- 20

# Calculate the rank of the absolute differece:
res <- rank(abs(x - y))

# Calculate the sum of the ranks where the difference was positive:
sum(res[(x - y) > 0])
## [1] 140
# Calculate the sum of the ranks where the difference was negative:
sum(res[(x - y) < 0])
## [1] 325
# which is equal to:
test_res_b$statistic
##   V 
## 140

Task 3*

Explore the pre-loaded data set ToothGrowth.

  • We want to investigate whether the median tooth length of dose equal to 2 is larger than 20. Define the null and alternative hypothesis. What is the conclusion?
  • Calculate the test statistic manually and compare it with the test statistic from the R function.

This is a one-tailed test.

Use the R function wilcox.test() to investigate whether the median sample is different from a particular value. It is recommended to use the exact p-value calculation when there are no ties, and to use continuity correction if an exact calculation is not possble.

In the R function wilcox.test() check the argument alternative.

Solution 3*

Null hypothesis: The the median tooth length of dose equal to 2 is equal to 20.
Alternative hypothesis: The the median tooth length of dose equal to 2 is larger than 20.

wilcox.test(x = ToothGrowth$len[ToothGrowth$dose == "2"], 
            mu = 20, exact = FALSE, alternative = c("greater"))
## 
##  Wilcoxon signed rank test with continuity correction
## 
## data:  ToothGrowth$len[ToothGrowth$dose == "2"]
## V = 208.5, p-value = 5.972e-05
## alternative hypothesis: true location is greater than 20

If we assume that the significant level is equal to 0.05, the output indicates that we can reject the null hypothesis that the sample median length of dose equal to 2 is 20 (or less).

# Calculate the test statistic manually
x <- ToothGrowth$len[ToothGrowth$dose == "2"]
mu <- 20

# Calculate the rank of the absolute difference:
res <- rank(abs(x - mu))

# Calculate the sum of the ranks where the difference was positive:
sum(res[(x - mu) > 0])
## [1] 208.5
# Calculate the sum of the ranks where the difference was negative:
sum(res[(x - mu) < 0])
## [1] 1.5

Two sample tests

Task 1

Explore the pre-loaded data set iris.

  • Investigate with plots the Sepal.Length per species.
  • We would like to investigate whether the mean sepal length is different between setosa and versicolor. Define the null and alternative hypothesis.
  • Perform the test and save the result in an object called test_res2. What is the conclusion?
  • Extract the test statistic.
  • Calculate the test statistic and the p-value manually.
  • Calculate the confidence interval manually and confirm the results with the R function.

Use the R functions boxplot(…) / hist(…) to visually explore the distribution of a numeric vector per group.

Since the data looks normally distributed and we have 50 subjects, we can use the t-test. Use the R function t.test() to investigate whether the mean values of two sample groups are different.

Note that the samples are independent.

To extract a particular result from an object use the $ symbol.

Solution 1

head(iris)
boxplot(iris$Sepal.Length ~ iris$Species)

par(mfrow = c(1, 3))
hist(iris$Sepal.Length[iris$Species == "setosa"], 
     main = "setosa", xlab = "Length")
hist(iris$Sepal.Length[iris$Species == "versicolor"], 
     main = "versicolor", xlab = "Length")
hist(iris$Sepal.Length[iris$Species == "virginica"], 
     main = "virginica", xlab = "Length")

Null hypothesis: The mean sepal length is equal for setosa and versicolor.
Alternative hypothesis: The mean sepal length is different for setosa and versicolor.

test_res2 <- t.test(x = iris$Sepal.Length[iris$Species == "setosa"], 
                    y = iris$Sepal.Length[iris$Species == "versicolor"])

test_res2
## 
##  Welch Two Sample t-test
## 
## data:  iris$Sepal.Length[iris$Species == "setosa"] and iris$Sepal.Length[iris$Species == "versicolor"]
## t = -10.521, df = 86.538, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.1057074 -0.7542926
## sample estimates:
## mean of x mean of y 
##     5.006     5.936

If we assume that the significant level is equal to 0.05, the output indicates that we can reject the null hypothesis that the sample mean values of the (sepal) length for setosa and versicolor are the same.

Let’s calculate everything also manually.

We first test the homogeneity of the variances:

\(F\) statistic: \(\frac{highest \ variance}{lowest \ variance}\).
Moreover, the degrees of freedom are: \(n_1 - 1\) and \(n_2 - 1.\)

# Calculate the test statistic manually
x <- iris$Sepal.Length[iris$Species == "setosa"]
y <- iris$Sepal.Length[iris$Species == "versicolor"]

# Test whether the variances are equal
# Test statistic:
F_statistic = var(y)/var(x)
# P-value:
pf(F_statistic, df1 = length(x) - 1, df2 = length(y) - 1, lower.tail = FALSE)
## [1] 0.004328594
# We reject the hypothesis that the variances are equal.

The test statistic is: \(t = \frac{(\bar{x}_1 - \bar{x}_2) - (\mu_1 - \mu_2)}{\sqrt{\frac{sd^2(x_1)}{n_1} + \frac{sd^2(x_2)}{n_2}}}\),
where

  • Sample mean of group 1, 2: \(\bar{x}_1\), \(\bar{x}_2\)
  • Standard deviation of group 1, 2: \(sd(x_1)\), \(sd(x_2)\)
  • Number of subjects in group 1, 2: \(n_1\), \(n_2\)

\(\mu_1\) and \(\mu_2\) are 0 since we want to investigate whether the difference is 0.

The degrees of freedom are: \(df = \frac{\bigg[\frac{sd^2(x_1)}{n_1} + \frac{sd^2(x_2)}{n_2} \bigg]^2}{ \frac{[sd^2(x_1)/n_1]^2}{n_1 - 1} + \frac{[sd^2(x_2)/n_2]^2}{n_2 - 1} }\)

# Test statistic:
Test_statistic <- (mean(x) - mean(y))/sqrt(var(x)/length(x) + var(y)/length(y))
Test_statistic
## [1] -10.52099
# Degrees of freedom:
degrees_fr <- (var(x)/length(x) + var(y)/length(y))^2/( (var(x)/length(x))^2/(length(x) - 1) + (var(y)/length(y))^2/(length(y) - 1) )
# P-value:
2 * pt(Test_statistic, df = degrees_fr, lower.tail = TRUE)
## [1] 3.746743e-17
test_res2$statistic
##         t 
## -10.52099
test_res2$p.value
## [1] 3.746743e-17

The confidence interval is:

\(\bigg[(\bar{x}_1-\bar{x}_2) - critical \ value_{\alpha/2} \sqrt{\frac{sd^2(x_1)}{n_1} + \frac{sd^2(x_2)}{n_2}}\), \((\bar{x}_1-\bar{x}_2) + critical \ value_{\alpha/2} \sqrt{\frac{sd^2(x_1)}{n_1} + \frac{sd^2(x_2)}{n_2}}\bigg]\)

The sampling distribution of the mean is assumed to follow the Student’s t-distribution. We will use this distribution to obtain the critical values.

# Calculate the confidence interval manually
# lower limit:
(mean(x) - mean(y)) + qt(0.025, df = degrees_fr) * sqrt(var(x)/length(x) + var(y)/length(y))
## [1] -1.105707
# or
(mean(x) - mean(y)) - qt(0.025, df = degrees_fr, lower.tail = FALSE) * sqrt(var(x)/length(x) + var(y)/length(y))
## [1] -1.105707
# upper limit:
(mean(x) - mean(y)) + qt(0.025, df = degrees_fr, lower.tail = FALSE) * sqrt(var(x)/length(x) + var(y)/length(y))
## [1] -0.7542926

Task 2

Explore the pre-loaded data set iris.

  • Investigate with plots the Petal.Width per species.
  • We want to investigate whether the petal width values are different between setosa and versicolor. Define the null and alternative hypothesis.
  • Perform the test and save the result in an object called test_res2b. What is the conclusion?
  • Extract the test statistic and p-value.
Note that the samples are independent.

The distributions of the variables do not look normally distributed. Therefore, we will use the non-parametric test. Use the R function wilcox.test() to investigate whether the two populations have the same distribution. It is recommended to use the exact p-value calculation when there are no ties, and to use continuity correction if an exact calculation is not possble.

Solution 2

boxplot(iris$Petal.Width ~ iris$Species)

par(mfrow = c(1, 2))
par(mfrow = c(1, 3))
hist(iris$Petal.Width[iris$Species == "setosa"], 
     main = "setosa", xlab = "Width")
hist(iris$Petal.Width[iris$Species == "versicolor"], 
     main = "versicolor", xlab = "Width")
hist(iris$Petal.Width[iris$Species == "virginica"], 
     main = "virginica", xlab = "Width")

Null hypothesis: The distributions of petal width are equal for setosa and versicolor.
Null hypothesis: The distributions of petal width are different for setosa and versicolor.

The distributions of width (petal) for setosa and versicolor do not look normally distributed. We will, therefore, use the non-parametric test.

test_res2b <- wilcox.test(x = iris$Petal.Width[iris$Species == "setosa"], 
                          y = iris$Petal.Width[iris$Species == "versicolor"],
                          exact = FALSE)
test_res2b
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  iris$Petal.Width[iris$Species == "setosa"] and iris$Petal.Width[iris$Species == "versicolor"]
## W = 0, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0

If we assume that the significant level is equal to 0.05, the output indicates that we can reject the null hypothesis that the distributions of the (petal) width for setosa and versicolor are the same.

test_res2b$statistic
## W 
## 0
test_res2b$p.value
## [1] 2.284669e-18

Task 3

Explore the pre-loaded data set airquality.

  • Investigate with plots the variable Wind per month group.
  • Test whether Wind is different between month 5 and 8. Let’s assume that the measurements are paired.
  • Obtain the test statistic manually.
Note that the samples are dependent.

The distributions of the variables do not look normally distributed. Therefore, we will use the non-parametric test. Use the R function wilcox.test() to investigate whether the median values are different. Check the argument paired in the R function.

Solution 3

# First we will explore the data with plots
head(airquality)
boxplot(airquality$Wind ~ airquality$Month)

par(mfrow = c(1, 2))
hist(airquality$Wind[airquality$Month == "5"], 
     main = "5 months", xlab = "Wind")
hist(airquality$Wind[airquality$Month == "8"], 
     main = "8 months", xlab = "Wind")

We notice that the distribution of the variable of interest is skewed (at least in one of the groups) and might not follow a normal distribution. Moreover, the sample size is small. We decide, therefore, to use a non-parametric test.

wilcox.test(x = airquality$Wind[airquality$Month == "5"], 
            y = airquality$Wind[airquality$Month == "8"], 
            paired = TRUE, exact = FALSE)
## 
##  Wilcoxon signed rank test with continuity correction
## 
## data:  airquality$Wind[airquality$Month == "5"] and airquality$Wind[airquality$Month == "8"]
## V = 359, p-value = 0.00952
## alternative hypothesis: true location shift is not equal to 0
# Alternatively:
wilcox.test(Wind ~ Month, 
            data = airquality[airquality$Month == "5" | airquality$Month == "8",],
            paired = TRUE, exact = FALSE)
## 
##  Wilcoxon signed rank test with continuity correction
## 
## data:  Wind by Month
## V = 359, p-value = 0.00952
## alternative hypothesis: true location shift is not equal to 0
wilcox.test(Wind ~ Month, data = airquality,
            subset = Month %in% c(5, 8),
            paired = TRUE, exact = FALSE)
## 
##  Wilcoxon signed rank test with continuity correction
## 
## data:  Wind by Month
## V = 359, p-value = 0.00952
## alternative hypothesis: true location shift is not equal to 0

If we assume that the significant level is equal to 0.05, the output indicates that we can reject the null hypothesis that the sample median wind of the month 5 is equal to the month 8.

# Calculate the test statistic manually
x <- airquality$Wind[airquality$Month == "5"]
y <- airquality$Wind[airquality$Month == "8"]

# Create a data frame including the variables, the difference and the absolute difference:
mat <- data.frame(x, y, diff = c(x-y), abs_diff = abs(x-y))

# Remove the pairs that scored equal:
mat <- mat[-10,]

# Add the ranks of the absolute differences as an extra column:
mat <- data.frame(mat, ranks = rank(abs(mat$x-mat$y)))

# Calculate the sum of the ranks where the difference was positive:
sum(mat$ranks[mat$diff > 0])
## [1] 359
# Calculate the sum of the ranks where the difference was negative:
sum(mat$ranks[mat$diff < 0])
## [1] 106

Task 4*

Explore the pre-loaded data set airquality.

  • We want to investigate whether month 5 has lower temperature than month 8. Let’s assume that the measurements are paired. Define the null and alternative hypothesis.
  • What is the conclusion?
Note that this is a one-tailed test.
To perform a one-sided test, the argument alternative needs to be specified to either “less” or “greater”.

Solution 4*

# First we will explore the data with plots
boxplot(airquality$Temp ~ airquality$Month)

We notice that the distribution of the variable of interest is skewed (at least in one of the groups) and might not follow a normal distribution. Moreover, the sample size is small. We decide, therefore, to use a non-parametric test.

Null hypothesis: The median temperature of month 5 is equal to the median temperature of month 8.
Alternative hypothesis: The median temperature of month 5 is lower than the median temperature of month 8.

wilcox.test(x = airquality$Temp[airquality$Month == "5"], 
            y = airquality$Temp[airquality$Month == "8"], 
            alternative = 'less', paired = TRUE, exact = FALSE)
## 
##  Wilcoxon signed rank test with continuity correction
## 
## data:  airquality$Temp[airquality$Month == "5"] and airquality$Temp[airquality$Month == "8"]
## V = 1, p-value = 6.733e-07
## alternative hypothesis: true location shift is less than 0
# Alternatively...
# When using the formula specification, x refers to the first level or values and y refers to the second level or value.
wilcox.test(Temp ~ Month, 
            data = airquality[airquality$Month == "5" | airquality$Month == "8",],
            alternative = 'less', paired = TRUE, exact = FALSE)
## 
##  Wilcoxon signed rank test with continuity correction
## 
## data:  Temp by Month
## V = 1, p-value = 6.733e-07
## alternative hypothesis: true location shift is less than 0

If we assume that the significant level is equal to 0.05, we can reject the null hypothesis that the sample median values of the two groups are equal.

Task 5*

Explore the pre-loaded data set iris.

  • We would like to investigate whether the mean sepal length is different between setosa and versicolor. Perform a t-test and a linear regression.
  • Compare the results of the two approaches.
Since the data looks normally distributed and we have 50 subjects, we can use the t-test. Use the R function t.test() to investigate whether the mean values of two sample groups are different.
The two-sample t-test is a special case of the regression model. For example, if we assume a regression model including only one variable, the group indicator, we can hypothesize that the slope coefficient is equal to 0.

Solution 5*

# Create a new data set where the first variable indicates the length and the second variable the group (setona or versicolor):
var1 = iris$Sepal.Length[iris$Species == "setosa"]
var2 = iris$Sepal.Length[iris$Species == "versicolor"]

dat <- data.frame(outcome = c(var1, var2),
                  group = c(rep(0, length(var1)),
                            rep(1, length(var2))))

# Fit a linear regression model:
summary(lm(outcome ~ group, data = dat))
## 
## Call:
## lm(formula = outcome ~ group, data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.0360 -0.3135 -0.0060  0.2715  1.0640 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.00600    0.06250   80.09   <2e-16 ***
## group        0.93000    0.08839   10.52   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.442 on 98 degrees of freedom
## Multiple R-squared:  0.5304, Adjusted R-squared:  0.5256 
## F-statistic: 110.7 on 1 and 98 DF,  p-value: < 2.2e-16

The p-value is <0.0001 and the t-value is 10.52. The same results can be observed from the t-test:

var1 = iris$Sepal.Length[iris$Species == "setosa"]
var2 = iris$Sepal.Length[iris$Species == "versicolor"]

t.test(var2, var1)
## 
##  Welch Two Sample t-test
## 
## data:  var2 and var1
## t = 10.521, df = 86.538, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.7542926 1.1057074
## sample estimates:
## mean of x mean of y 
##     5.936     5.006

The slope coefficient in the linear regression model can be obtained as the mean difference (5.936 - 5.006 = 0.93).

We can also investigate the confidence intervals:

# Obtain the confidence interval from the t-test:
t.test(var2, var1)$conf.int
## [1] 0.7542926 1.1057074
## attr(,"conf.level")
## [1] 0.95
# Obtain the confidence interval from the linear regression:
confint(lm(outcome ~ group, data = dat))
##                 2.5 %   97.5 %
## (Intercept) 4.8819618 5.130038
## group       0.7545835 1.105417

M sample tests

Task 1

Explore the pre-loaded data set airquality.

  • Investigate with plots the distribution of temperature per month group.
  • We want to investigate whether the distribution of temperature is different per month group. Define the null and alternative hypothesis.
  • What is the conclusion?

Use the R functions boxplot(…) / hist(…) to visually explore the distribution of a numeric vector per group.

We do not seem to have normally distributed data. Moreover, the sample size is small. Therefore, we can use the non-parametric test Kruskal-Wallis. Use the R function kruskal.test() to investigate whether the samples originate from the same distribution.

Solution 1

# First we will explore the data with plots
boxplot(airquality$Temp ~ airquality$Month)

par(mfrow = c(2, 3))
hist(airquality$Temp[airquality$Month == "5"], 
     main = "Month 5", xlab = "Length")
hist(airquality$Temp[airquality$Month == "6"], 
     main = "Month 6", xlab = "Length")
hist(airquality$Temp[airquality$Month == "7"], 
     main = "Month 7", xlab = "Length")
hist(airquality$Temp[airquality$Month == "8"], 
     main = "Month 8", xlab = "Length")
hist(airquality$Temp[airquality$Month == "9"], 
     main = "Month 9", xlab = "Length")

We notice that the distribution of the variable of interest is skewed (at least in one of the groups) and might not follow a normal distribution. Moreover, the sample size is small. We decide, therefore, to use a non-parametric test.

Null hypothesis: the distribution of temperature is similar per month group.
Alternative hypothesis: the distribution of temperature is different per month group.

kruskal.test(Temp ~ Month, data = airquality)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  Temp by Month
## Kruskal-Wallis chi-squared = 73.328, df = 4, p-value = 4.496e-15

The output shows that, if the significant level is 0.05, we can reject the null hypothesis that the groups have similar distributions.

Task 2

  • Simulate 3 numeric vectors of length 80 with the following details:
    • variable 1: from the normal distribution with mean 20 and standard deviation 10
    • variable 2: from the normal distribution with mean 40 and standard deviation 6
    • variable 3: from the normal distribution with mean 15 and standard deviation 20
  • Investigate whether the distributions of the above created variables are different.

We have normally distributed data and a large sample size. Use the R function aov() to investigate whether the samples originate from the same distribution.

Solution 2

# Generate data
set.seed(123)

BMI1 <- rnorm(80, 20, 10)
BMI2 <- rnorm(80, 40, 6)
BMI3 <- rnorm(80, 15, 20)

# First we will explore the data with plots
boxplot(BMI1, BMI2, BMI3)

par(mfrow = c(1, 3))
hist(BMI1)
hist(BMI2)
hist(BMI3)

The distributions are normal and the sample size is large. We decide, therefore, to use an ANOVA test.

BMI <- c(BMI1, BMI2, BMI3)
group <- c(rep(1, 80), rep(2, 80), rep(3, 80))
summary(aov(BMI ~ as.factor(group)))
##                   Df Sum Sq Mean Sq F value Pr(>F)    
## as.factor(group)   2  26720   13360   83.18 <2e-16 ***
## Residuals        237  38066     161                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The output shows that, if the significant level is 0.05, we can reject the null hypothesis that the sample mean values of the groups are equal.

Correlation tests

Task 1

Explore the pre-loaded data set iris.

  • Investigate with plots the association between Sepal.Length and Sepal.Width.
  • Obtain the Pearson correlation coefficient for the association of Sepal.Length and Sepal.Width.
  • Perform the Pearson correlation test for the variables Sepal.Length and Sepal.Width.

Use the R function plot(…) to create a scatterplot.

Use the R function cor(…) to obtain the correlation coefficient.

Use the R function cor.test(…) to perform the correlation test.

Solution 1

plot(x = iris$Sepal.Length, y = iris$Sepal.Width)

cor(x = iris$Sepal.Length, y = iris$Sepal.Width)
## [1] -0.1175698
cor.test(x = iris$Sepal.Length, y = iris$Sepal.Width)
## 
##  Pearson's product-moment correlation
## 
## data:  iris$Sepal.Length and iris$Sepal.Width
## t = -1.4403, df = 148, p-value = 0.1519
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.27269325  0.04351158
## sample estimates:
##        cor 
## -0.1175698

Task 2*

Explore the pre-loaded data set iris.

  • Investigate with plots the association between Sepal.Length and Sepal.Width in the setosa group.
  • Obtain the Spearman correlation coefficient for the association of Sepal.Length and Sepal.Width in the setosa group.
  • Perform the Spearman correlation test for the variables Sepal.Length and Sepal.Width in the setosa group.

Use the R function cor(…) to obtain the correlation coefficient. Check the argument method.

Use the R function cor.test(…) to perform the correlation test. Check the argument method.

Solution 2*

# Create new vectors with the desired variables
var1 <- iris$Sepal.Length[iris$Species == "setosa"]
var2 <- iris$Sepal.Width[iris$Species == "setosa"]

plot(x = var1, y = var2)

cor(x = var1, y = var2)
## [1] 0.7425467
cor.test(x = var1, y = var2, method = "spearman")
## Warning in cor.test.default(x = var1, y = var2, method = "spearman"): Cannot compute exact p-value
## with ties
## 
##  Spearman's rank correlation rho
## 
## data:  var1 and var2
## S = 5095.1, p-value = 2.317e-10
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## 0.7553375

Task 3*

Explore the pre-loaded data set iris.

  • Perform the Pearson correlation test for the variables Sepal.Length and Sepal.Width.
  • Perform a linear regression assuming the standardized values of Sepal.Length and Sepal.Width.
  • Compare the slope estimate of the linear regression and the correlation coefficient.
  • Compare the p-value of the slope estimate of the linear regression with the p-value of the correlation test.

Use the R function cor.test(…) to perform the correlation test.

Use the R function lm(…) to perform a linear regression.

The slope of the linear regression becomes the correlation coefficient if the two variables have identical standard deviations (they are standardized).

Solution 3*

# Pearson correlation test
cor.test(x = iris$Sepal.Length, y = iris$Sepal.Width)
## 
##  Pearson's product-moment correlation
## 
## data:  iris$Sepal.Length and iris$Sepal.Width
## t = -1.4403, df = 148, p-value = 0.1519
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.27269325  0.04351158
## sample estimates:
##        cor 
## -0.1175698
# Linear regression
# We standardize the variables:
y1 <- (iris$Sepal.Length - mean(iris$Sepal.Length))/sd(iris$Sepal.Length)
y2 <- (iris$Sepal.Width - mean(iris$Sepal.Width))/sd(iris$Sepal.Width) 

# We fit a linear regression model:
summary(lm(y1 ~ y2))
## 
## Call:
## lm(formula = y1 ~ y2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.8793 -0.7648 -0.1352  0.6738  2.6840 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.759e-16  8.136e-02    0.00    1.000
## y2          -1.176e-01  8.163e-02   -1.44    0.152
## 
## Residual standard error: 0.9964 on 148 degrees of freedom
## Multiple R-squared:  0.01382,    Adjusted R-squared:  0.007159 
## F-statistic: 2.074 on 1 and 148 DF,  p-value: 0.1519
# Alternatively
summary(lm(scale(iris$Sepal.Length) ~ scale(iris$Sepal.Width)))
## 
## Call:
## lm(formula = scale(iris$Sepal.Length) ~ scale(iris$Sepal.Width))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.8793 -0.7648 -0.1352  0.6738  2.6840 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)
## (Intercept)             -3.759e-16  8.136e-02    0.00    1.000
## scale(iris$Sepal.Width) -1.176e-01  8.163e-02   -1.44    0.152
## 
## Residual standard error: 0.9964 on 148 degrees of freedom
## Multiple R-squared:  0.01382,    Adjusted R-squared:  0.007159 
## F-statistic: 2.074 on 1 and 148 DF,  p-value: 0.1519

The correlation coefficient and the slope coefficient are both -0.118.
The p-value from the correlation test and the slope coefficient are both 0.152.

 

© Eleni-Rosalina Andrinopoulou