Independent Study to consolidate this week

Introduction to statistical models: Single linear regression

Set up

If you have just opened RStudio you will want to load the tidyverse package

library(tidyverse)

Exercises

💻 Effect of anxiety status and sporting performance. The data in sprint.txt are from an investigation of the effect of anxiety status and sporting performance. A group of 40 100m sprinters undertook a psychometric test to measure their anxiety shortly before competing. The data are their anxiety scores and the 100m times achieved. What you do conclude from these data?

Answer - don’t look until you have tried!

# this example is designed to emphasise the importance of plotting your data first
sprint <- read_table("data-raw/sprint.txt")
# Anxiety is discrete but ranges from 16 to 402 meaning the gap between possible measures is small and 
# the variable could be treated as continuous if needed. Time is a continuous measure that has decimal places and which we would expect to follow a normal distribution 

# explore with a plot
ggplot(sprint, aes(x = anxiety, y = time) ) +
  geom_point()

Answer - don’t look until you have tried!

# A scatterplot of the data clearly reveals that these data are not linear. There is a good relationship between the two variables but since it is not linear, single linear regression is not appropriate.

💻 Juvenile hormone in stag beetles. The concentration of juvenile hormone in stag beetles is known to influence mandible growth. Groups of stag beetles were injected with different concentrations of juvenile hormone (arbitrary units) and their average mandible size (mm) determined. The experimenters planned to analyse their data with regression. The data are in stag.txt

Answer - don’t look until you have tried!

# read the data in and check the structure
stag <- read_table("data-raw/stag.txt")
str(stag)

# jh is discrete but ordered and has been chosen by the experimenter - it is the explanatory variable.  
# the response is mandible size which has decimal places and is something we would expect to be 
# normally distributed. So far, common sense suggests the assumptions of regression are met.

Answer - don’t look until you have tried!

# exploratory plot
ggplot(stag, aes(x = jh, y = mand)) +
  geom_point()

Answer - don’t look until you have tried!

# looks linear-ish on the scatter
# regression still seems appropriate
# we will check the other assumptions after we have run the lm

Answer - don’t look until you have tried!

# build the statistical model
mod <- lm(data = stag, mand ~ jh)

# examine it
summary(mod)
# mand = 0.032*jh + 0.419
# the slope of the line is significantly different from zero / the jh explains a significant amount of the variation in mand (ANOVA: F = 16.63; d.f. = 1,14; p = 0.00113).
# the intercept is 0.419 and differs significantly from zero

Answer - don’t look until you have tried!

# checking the assumption
plot(mod, which = 1)

Answer - don’t look until you have tried!

# we're looking for the variance in the residuals to be equal along the x axis.
# with a small data set there is some apparent heterogeneity but it doesn't look too.
# 
hist(mod$residuals)

Answer - don’t look until you have tried!

# We have some skew which again might be partly a result of a small sample size.
shapiro.test(mod$residuals) # the also test not sig diff from normal

# On balance the use of regression is probably justifiable but it is borderline
# but ideally the experiment would be better if multiple individuals were measure at
# each of the chosen juvenile hormone levels.

Answer - don’t look until you have tried!

# a better plot
ggplot(stag, aes(x = jh, y = mand) ) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, colour = "black") +
  scale_x_continuous(name = "Juvenile hormone (arbitrary units)",
                     expand = c(0, 0),
                     limits = c(0, 32)) +
  scale_y_continuous(name = "Mandible size (mm)",
                     expand = c(0, 0),
                     limits = c(0, 2)) +
  theme_classic()