Independent Study to consolidate this week

Two-sample tests

Set up

If you have just opened RStudio you will want to load the tidyverse package

library(tidyverse)

Exercises

💻 Plant Biotech. Some plant biotechnologists are trying to increase the quantity of omega 3 fatty acids in Cannabis sativa. They have developed a genetically modified line using genes from Linum usitatissimum (linseed). They grow 50 wild type and fifty modified plants to maturity, collect the seeds and determine the amount of omega 3 fatty acids. The data are in csativa.txt. Do you think their modification has been successful?

Answer - don’t look until you have tried!

csativa  <-  read_table("data-raw/csativa.txt")
str(csativa)

# First realise that this is a two sample test. You have two 
# independent samples - there are a total of 100 different plants 
# and the values in one group have no relationship to the values 
# in the other.

Answer - don’t look until you have tried!

# create a rough plot of the data  
ggplot(data = csativa, aes(x = plant, y = omega)) +
  geom_violin()

Answer - don’t look until you have tried!

# note the modified plants seem to have lower omega!

Answer - don’t look until you have tried!

# create a summary of the data
csativa_summary <- csativa |>
  group_by(plant) |>
  summarise(mean = mean(omega),
            std = sd(omega),
            n = length(omega),
            se = std/sqrt(n))

Answer - don’t look until you have tried!

# The data seem to be continuous so it is likely that a parametric 
# test will be fine.
# we will check the other assumptions after we have run the lm

# build the statistical model
mod <- lm(data = csativa, omega ~ plant)


# examine it
summary(mod)
# So there is a significant difference but you need to make sure 
# you know the direction!
# Wild plants have a significantly higher omega 3 content 
# (mean +/- s.e =  56.41 +/- 1.11) than modified plants 
# (49.46 +/- 0.82)(t = 5.03; d.f. = 98; p < 0.0001).

Answer - don’t look until you have tried!

# let's check the assumptions
plot(mod, which = 1)

Answer - don’t look until you have tried!

# we're looking for the variance in the residuals to be the 
# same in both groups.
# This looks OK. Maybe a bit higher in the wild plants
# (with the higher mean)
 
hist(mod$residuals)

Answer - don’t look until you have tried!

shapiro.test(mod$residuals)
# On balance the use of lm() is probably justifiable  The variance 
# isn't quite equal and the histogram looks a bit off normal but 
# the normality test is NS and the effect (in the figure) is clear.

Answer - don’t look until you have tried!

# A figure 
fig1 <- ggplot() +
  geom_point(data = csativa, aes(x = plant, y = omega),
             position = position_jitter(width = 0.1, height = 0),
             colour = "gray50") +
  geom_errorbar(data = csativa_summary, 
                aes(x = plant, ymin = mean - se, ymax = mean + se),
                width = 0.3) +
  geom_errorbar(data = csativa_summary, 
                aes(x = plant, ymin = mean, ymax = mean),
                width = 0.2) +
  scale_x_discrete(name = "Plant type", labels = c("GMO", "WT")) +
  scale_y_continuous(name = "Amount of Omega 3 (units)",
                     expand = c(0, 0),
                     limits = c(0, 90)) +
    annotate("segment", x = 1, xend = 2, 
           y = 80, yend = 80,
           colour = "black") +
  annotate("text", x = 1.5,  y = 85, 
           label = expression(italic(p)~"< 0.001")) +
  theme_classic()

# save figure to figures/csativa.png
ggsave("figures/csativa.png",
       plot = fig1,
       width = 3.5,
       height = 3.5,
       units = "in",
       dpi = 300)