Independent Study to consolidate this week

Association: Correlation and Contingency

Set up

If you have just opened RStudio you will want to load the tidyverse package

library(tidyverse)

Exercises

💻 The slug Arion ater has three major colour forms, black, chocolate brown and red. Sampling in population X revealed 27 black, 17 brown and 9 red individuals, whereas in population Y the corresponding numbers were 39, 10 and 21. Create an appropriate data structure and test whether the proportion of black, brown and red slugs differs between the two populations.

Answer - don’t look until you have tried!

#  names for the rows and columns
vars <- list(pop = c("x","y"), 
             colour = c("black", "brown", "red"))

# matrix of the data with named columns and rows
slugs <- matrix(c(27, 39, 17, 10, 9, 21), 
                nrow = 2, dimnames = vars)
slugs
# gives me
#    colour
# pop black brown red
#   x    27    17   9
#   y    39    10  21

# you may need to try a couple of times to get the numbers in 
# the right places

Answer - don’t look until you have tried!

chisq.test(slugs)
# X-squared = 6.5726, df = 2, p-value = 0.03739

# p < 0.05 so we reject the null hypothesis i.e., the proportions of the  
# colour forms are significantly different in the two populations 
# (mostly as a result of differences in the brown and red classes - 
# look at the differences between observed and expected values for 
# the three colour forms in the table above).
chisq.test(slugs)$expected
#       colour
# pop    black    brown      red
#   x 28.43902 11.63415 12.92683
#   y 37.56098 15.36585 17.07317

💻 The raw, untabulated data are in slugs.txt. Perform the test on these data.

Answer - don’t look until you have tried!

# import the data
slugs <- read_table("data-raw/slugs.txt")

# put it into a table
slugtab <- table(slugs$colour, slugs$pop)

# carry out the test
chisq.test(slugtab)

💻 The data in marks.csv give the marks for ten students in two subjects: Data Analysis and Biology. We have previously asked if there is difference in marks between between the subjects and analysed with a paired-sample test. However, we might instead ask if there is a correlation between subjects. Neither of these approaches is better - which you use depends on the question you are asking. Carry out a correlation on these data and create a figure suitable for a report.

Answer - don’t look until you have tried!

# import the data
marks <- read_csv("data-raw/marks.csv")

# we need to pivot the data to wide
marks_wide <- marks |> 
  pivot_wider(values_from = mark, 
              names_from = subject, 
              id_cols = student)
# rough plot
ggplot(data = marks_wide,
       aes(x = Biology, y = DataAnalysis)) +
  geom_point()

Answer - don’t look until you have tried!

# This is a small data sets and it is difficult to make
# judgements about the assumptions of a parametric correlation
# therefor we wil do a Spearman's rank correlation
cor.test(data = marks_wide, ~ Biology + DataAnalysis,
         method = "spearman")

# There was a significant positive correlation (r = 0.96) between
# Biology and Data Analysis marks (Spearmans’s rank correlation: 
# S = 6.03; n = 10; p-value < 0.0001).


ggplot(data = marks_wide,
       aes(x = Biology, y = DataAnalysis))  +
  geom_point() +
  scale_y_continuous(name = "Biology mark",
                     expand = c(0, 0), limits = c(0, 100)) +
  scale_x_continuous(name = "Data Analysis mark",
                     expand = c(0, 0), limits = c(0, 100)) +
  annotate("text", x = 15,  y = 80, 
           label = expression(italic(r)~"= 0.96; "~italic(p)~"< 0.0001")) +
  theme_classic()