Independent Study to consolidate this week
Association: Correlation and Contingency
Set up
If you have just opened RStudio you will want to load the tidyverse
package
Exercises
- 💻 The slug Arion ater has three major colour forms, black, chocolate brown and red. Sampling in population X revealed 27 black, 17 brown and 9 red individuals, whereas in population Y the corresponding numbers were 39, 10 and 21. Create an appropriate data structure and test whether the proportion of black, brown and red slugs differs between the two populations.
Answer - don’t look until you have tried!
# names for the rows and columns
vars <- list(pop = c("x","y"),
colour = c("black", "brown", "red"))
# matrix of the data with named columns and rows
slugs <- matrix(c(27, 39, 17, 10, 9, 21),
nrow = 2, dimnames = vars)
slugs
# gives me
# colour
# pop black brown red
# x 27 17 9
# y 39 10 21
# you may need to try a couple of times to get the numbers in
# the right places
Answer - don’t look until you have tried!
chisq.test(slugs)
# X-squared = 6.5726, df = 2, p-value = 0.03739
# p < 0.05 so we reject the null hypothesis i.e., the proportions of the
# colour forms are significantly different in the two populations
# (mostly as a result of differences in the brown and red classes -
# look at the differences between observed and expected values for
# the three colour forms in the table above).
chisq.test(slugs)$expected
# colour
# pop black brown red
# x 28.43902 11.63415 12.92683
# y 37.56098 15.36585 17.07317
- 💻 The raw, untabulated data are in slugs.txt. Perform the test on these data.
Answer - don’t look until you have tried!
# import the data
slugs <- read_table("data-raw/slugs.txt")
# put it into a table
slugtab <- table(slugs$colour, slugs$pop)
# carry out the test
chisq.test(slugtab)
- 💻 The data in marks.csv give the marks for ten students in two subjects: Data Analysis and Biology. We have previously asked if there is difference in marks between between the subjects and analysed with a paired-sample test. However, we might instead ask if there is a correlation between subjects. Neither of these approaches is better - which you use depends on the question you are asking. Carry out a correlation on these data and create a figure suitable for a report.
Answer - don’t look until you have tried!
# import the data
marks <- read_csv("data-raw/marks.csv")
# we need to pivot the data to wide
marks_wide <- marks |>
pivot_wider(values_from = mark,
names_from = subject,
id_cols = student)
# rough plot
ggplot(data = marks_wide,
aes(x = Biology, y = DataAnalysis)) +
geom_point()
Answer - don’t look until you have tried!
# This is a small data sets and it is difficult to make
# judgements about the assumptions of a parametric correlation
# therefor we wil do a Spearman's rank correlation
cor.test(data = marks_wide, ~ Biology + DataAnalysis,
method = "spearman")
# There was a significant positive correlation (r = 0.96) between
# Biology and Data Analysis marks (Spearmans’s rank correlation:
# S = 6.03; n = 10; p-value < 0.0001).
ggplot(data = marks_wide,
aes(x = Biology, y = DataAnalysis)) +
geom_point() +
scale_y_continuous(name = "Biology mark",
expand = c(0, 0), limits = c(0, 100)) +
scale_x_continuous(name = "Data Analysis mark",
expand = c(0, 0), limits = c(0, 100)) +
annotate("text", x = 15, y = 80,
label = expression(italic(r)~"= 0.96; "~italic(p)~"< 0.0001")) +
theme_classic()