ggplot(height, aes(x = younger, y = older) ) +
geom_point()
Workshop
Association: Correlation and Contingency
Introduction
Introduction
Session overview
In this session you will
Philosophy
Workshops are not a test. It is expected that you often don’t know how to start, make a lot of mistakes and need help. It is expected that you are familiar with independent study content before the workshop. However, you need not remember or understand every detail as the workshop should build and consolidate your understanding. Tips
- don’t worry about making mistakes
- don’t let what you can not do interfere with what you can do
- discussing code with your neighbours will help
- look things up in the independent study material
- look things up in your own code from earlier
- there are no stupid questions
These four symbols are used at the beginning of each instruction so you know where to carry out the instruction.
Something you need to do on your computer. It may be opening programs or documents or locating a file.
Something you should do in RStudio. It will often be typing a command or using the menus but might also be creating folders, locating or moving files.
Something you should do in your browser on the internet. It may be searching for information, going to the VLE or downloading a file.
A question for you to think about and answer. Record your answers in your script for future reference.
Getting started
Start RStudio from the Start menu.
Make an RStudio project for this workshop by clicking on the drop-down menu on top right where it says Project: (None)
and choosing New Project, then New Directory, then New Project. Navigate to the data-analysis-in-r-2
folder and name the RStudio Project week-6
.
Make new folders called data-raw
and figures
. You can do this on the Files Pane by clicking New Folder and typing into the box that appears.
Make a two new scripts called correlation.R
and contingency-chi-squared-tests.R
to carry out the rest of the work.
Add a comments to each script such as: # Correlation
and # Contingency Chi-squared tests
and load the tidyverse
(Wickham et al. 2019) package in each
Exercises
Pearson’s Correlation
The data given in height.txt are the heights of eleven sibling pairs.
Save a copy of height.txt to your data-raw
folder and import it.
Did you know you can also read a file directly from the internet instead of saving it first?
height <- read_table("https://3mmarand.github.io/R4BABS/r4babs2/week-6/data-raw/data-raw/height.txt")
Exploring
What type of variables are older
and sister
? What are the implications for the test?
Do a quick plot of the data. We don’t have a causal relationship here so either variable can go on the x-axis.
Remembering that one of the assumptions for parametric correlation is that any correlation should be linear, what do you conclude from the plot?
Applying, interpreting and reporting
We will do a parametric correlation in any case.
We can carry out a Pearson’s product moment correlation with:
cor.test(data = height, ~ older + younger, method = "pearson")
Pearson's product-moment correlation
data: older and younger
t = 2.0157, df = 9, p-value = 0.07464
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.06336505 0.86739285
sample estimates:
cor
0.5577091
Notice that were are not using the response ~ explanatory
form here because this is not a causal relationship.
Pearson is the default correlation smethod, therefore we could omit method = "pearson"
.
What do you conclude from the test?
Illustrating
Create a better figure for our data using:
fig1 <- ggplot(height, aes(x = younger, y = older)) +
geom_point() +
scale_x_continuous(name = "Younger sibling height (cm)",
limits = c(120, 190),
expand = c(0, 0)) +
scale_y_continuous(name = "Older sibling height (cm)",
limits = c(120, 190),
expand = c(0, 0)) +
theme_classic()
fig1
Use ggsave()
to save your figure to file in your figures
folder.
Effect of sample size on correlation
Now we will explore the effect of sample size on the value of the correlation coefficient and its significance.
Create a dataset with twice the number of observations:
height2 <- rbind(height, height)
Make sure you view the resulting dataframe. Each pair of values will appear twice.
Now repeat the correlation with height2
What do you conclude? What does this tell you about the sensitivity of correlation to sample size?
Spearman’s rank Correlation
Since our sibling dataset is so small we might very reasonably have chosen to do a non-parametric correlation. The same function is used but we specify a different value for the method
argument.
Carry out a Spearman’s rank correlation:
cor.test(data = height, ~ older + younger, method = "spearman")
Spearman's rank correlation rho
data: older and younger
S = 109.74, p-value = 0.1163
alternative hypothesis: true rho is not equal to 0
sample estimates:
rho
0.5011722
What do you conclude?
Contingency chi-squared test
A human geneticist found that in a sample of 477 blood group O people 65 had peptic ulcers whereas in a sample of 387 blood group A people 31 had peptic ulcers.
Draw a 2 x 2 table of these data (on a piece of paper).
What is your null hypothesis and what type of test is required?
Make a vector obs
that holds the 4 observed numbers. For the moment, don’t worry about what order they are in.
For a contingency chi squared test, the inbuilt chi-squared test can be used in a straightforward way. However, we need to structure our data as a 2 x 2 table rather than as a 1 x 4 vector. A 2 x 2 table can be created with the matrix()
function. We can also name the rows and columns which helps us interpret the results.
To create a list containing two elements which are vectors for the two groups in each variable we do:
# list of two elements
# the two variables are whether someone has an ulcer or not and whether
# they are blood group O or A
vars <- list(ulcer = c("yes","no"), blood = c("O", "A"))
vars
$ulcer
[1] "yes" "no"
$blood
[1] "O" "A"
Now we can create the matrix from our vector of numbers obs
and use our list vars
to give the column and row names:
ulcers <- matrix(obs, nrow = 2, dimnames = vars)
ulcers
blood
ulcer O A
yes 65 31
no 412 356
Check the content of ulcers
and recreate if the numbers are not in the correct place (i.e., do not match your table)
Run a contingency chi-squared with:
chisq.test(ulcers, correct = FALSE)
Pearson's Chi-squared test
data: ulcers
X-squared = 6.824, df = 1, p-value = 0.008994
# you should look up the command in the manual to see what correct = FALSE does
To help us discover what is the direction of any deviation from the null hypothesis it is helpful to see what the expected values were. These are accessible in the $expected
variable in the output value of the chisq.test()
method (See the manual!).
View the expected values with:
chisq.test(ulcers, correct = FALSE)$expected
blood
ulcer O A
yes 53 43
no 424 344
What do you conclude about the association between ABO blood group and peptic ulcers?
Blood group and ulcers- alternative data format.
The data we were given was already tabulated. There are raw data in blood_ulcers.txt. Examine this file to understand the format
Save a copy of blood_ulcers.txt to your data-raw
folder
Read the data in to R and check the structure.
We can tabulate the data and assign it using the table()
command:
ulctab <- table(blood_ulcers$blood, blood_ulcers$ulcer)
# examine the result
ulctab
no yes
A 356 31
O 412 65
We need to give both variables to cross tabulate.
Now carry out the contingency chi-squared like this:
chisq.test(ulctab, correct = FALSE)
Pearson's Chi-squared test
data: ulctab
X-squared = 6.824, df = 1, p-value = 0.008994
Congratulations on making it to the end of the stage 1 Data Analysis in R teaching!
Independent study following the workshop
The Code file
This contains all the code needed in the workshop even where it is not visible on the webpage.
The workshop.qmd
file is the file I use to compile the practical. Qmd stands for Quarto markdown. It allows code and ordinary text to be interweaved to produce well-formatted reports including webpages. View the Qmd in Browser. Coding and thinking answers are marked with #---CODING ANSWER---
and #---THINKING ANSWER---
Pages made with R (R Core Team 2023), Quarto (Allaire et al. 2022), knitr
(Xie 2022), kableExtra
(Zhu 2021)