16  Association: Correlation and Contingency

Incomplete

You are reading a work in progress. This page is a dumping ground for ideas. Sections maybe missing, or contain a list of points to include.

16.1 Overview

no explanatory variables

16.1.1 Correlation

  • two continuous variables

  • neither is an explanatory variable, i.e., there is not a causal relationship between the two variables

  • you want to know if there is a correlation between the two variables

  • the correlation coefficient is a measure of the strength and direction of the linear association between two variables and ranges from -1 to 1

  • A correlation coefficient of 1 indicates a perfect positive association between the variables meaning the highest score in one variable are associated with the highest scores in the other.

  • A correlation coefficient of -1 indicates a perfect negative association between the variables meaning the highest scores in one variable are associated with the lowest scores in the other.

TODO Add a figure of the different correlation coefficients

  • we estimate the population correlation coefficient, \(\rho\) (pronounced rho), with the sample correlation coefficient, \(r\) and test whether it is significantly different from zero because a correlation coefficient of 0 means association.

  • Pearson’s correlation coefficient, is a parametric measure which assumes the variables are normally distributed and that any correlation is linear.

  • Spearman’s rank correlation coefficient, is a non-parametric measure which does not assume the variables are normally distributed

  • we use cor.test() in R.

16.1.2 Contingency Chi-squared

  • two categorical variables

  • neither is an explanatory variable, i.e., there is not a causal relationship between the two variables

  • we count the number of observations in each caetory of each variable

  • you want to know if there is an association between the two variables

  • another way of describing this is that we test whether the proportion of observations falling in to each category of one variable is the same for each category of the other variable.

  • we use a chi-squared test to test whether the observed counts are significantly different from the expected counts if there was no association between the variables.

16.1.3 Reporting

  1. the significance of effect - whether the association is significant different from zero

  2. the direction of effect

    • correlation whether \(r\) is positive or negative
    • which categories have higher than expected values
  3. the magnitude of effect

We do not put a line of best fit on a scatter plot accompanying a correlation because such a line implies a causal relationship.

We will explore all of these ideas with an examples.

16.2 🎬 Your turn!

If you want to code along you will need to start a new RStudio project, add a data-raw folder and open a new script. You will also need to load the tidyverse package (Wickham et al. 2019).

16.3 Correlation

High quality images of the internal structure of wheat seeds were taken with a soft X-ray technique. Seven measurements determined from the images:

  • Area
  • Perimeter
  • Compactness
  • Kernel length
  • Kernel width
  • Asymmetry coefficient
  • Length of kernel groove

Research were interested in the correlation between compactness and kernel width

The data are in seeds-dataset.xlsx.

16.3.1 Import and explore

The data are in an excel file so we will need the readxl (Wickham and Bryan 2023) package.

Load the package:

Find out the names of the sheets in the excel file:

excel_sheets("data-raw/seeds-dataset.xlsx")
## [1] "seeds_dataset"

There is only one sheet called seeds_dataset.

Import the data:

seeds <- read_excel("data-raw/seeds-dataset.xlsx",
                    sheet = "seeds_dataset")

Note that we could have omitted the sheet argument because there is only one sheet in the Excel file. However, it is good practice to be explicit.

area perimeter compactness kernal_length kernel_width asymmetry_coef groove_length
15.26 14.84 0.8710 5.763 3.312 2.2210 5.220
14.88 14.57 0.8811 5.554 3.333 1.0180 4.956
14.29 14.09 0.9050 5.291 3.337 2.6990 4.825
13.84 13.94 0.8955 5.324 3.379 2.2590 4.805
16.14 14.99 0.9034 5.658 3.562 1.3550 5.175
14.38 14.21 0.8951 5.386 3.312 2.4620 4.956
14.69 14.49 0.8799 5.563 3.259 3.5860 5.219
14.11 14.10 0.8911 5.420 3.302 2.7000 5.000
16.63 15.46 0.8747 6.053 3.465 2.0400 5.877
16.44 15.25 0.8880 5.884 3.505 1.9690 5.533
15.26 14.85 0.8696 5.714 3.242 4.5430 5.314
14.03 14.16 0.8796 5.438 3.201 1.7170 5.001
13.89 14.02 0.8880 5.439 3.199 3.9860 4.738
13.78 14.06 0.8759 5.479 3.156 3.1360 4.872
13.74 14.05 0.8744 5.482 3.114 2.9320 4.825
14.59 14.28 0.8993 5.351 3.333 4.1850 4.781
13.99 13.83 0.9183 5.119 3.383 5.2340 4.781
15.69 14.75 0.9058 5.527 3.514 1.5990 5.046
14.70 14.21 0.9153 5.205 3.466 1.7670 4.649
12.72 13.57 0.8686 5.226 3.049 4.1020 4.914
14.16 14.40 0.8584 5.658 3.129 3.0720 5.176
14.11 14.26 0.8722 5.520 3.168 2.6880 5.219
15.88 14.90 0.8988 5.618 3.507 0.7651 5.091
12.08 13.23 0.8664 5.099 2.936 1.4150 4.961
15.01 14.76 0.8657 5.789 3.245 1.7910 5.001
16.19 15.16 0.8849 5.833 3.421 0.9030 5.307
13.02 13.76 0.8641 5.395 3.026 3.3730 4.825
12.74 13.67 0.8564 5.395 2.956 2.5040 4.869
14.11 14.18 0.8820 5.541 3.221 2.7540 5.038
13.45 14.02 0.8604 5.516 3.065 3.5310 5.097
13.16 13.82 0.8662 5.454 2.975 0.8551 5.056
15.49 14.94 0.8724 5.757 3.371 3.4120 5.228
14.09 14.41 0.8529 5.717 3.186 3.9200 5.299
13.94 14.17 0.8728 5.585 3.150 2.1240 5.012
15.05 14.68 0.8779 5.712 3.328 2.1290 5.360
16.12 15.00 0.9000 5.709 3.485 2.2700 5.443
16.20 15.27 0.8734 5.826 3.464 2.8230 5.527
17.08 15.38 0.9079 5.832 3.683 2.9560 5.484
14.80 14.52 0.8823 5.656 3.288 3.1120 5.309
14.28 14.17 0.8944 5.397 3.298 6.6850 5.001
13.54 13.85 0.8871 5.348 3.156 2.5870 5.178
13.50 13.85 0.8852 5.351 3.158 2.2490 5.176
13.16 13.55 0.9009 5.138 3.201 2.4610 4.783
15.50 14.86 0.8820 5.877 3.396 4.7110 5.528
15.11 14.54 0.8986 5.579 3.462 3.1280 5.180
13.80 14.04 0.8794 5.376 3.155 1.5600 4.961
15.36 14.76 0.8861 5.701 3.393 1.3670 5.132
14.99 14.56 0.8883 5.570 3.377 2.9580 5.175
14.79 14.52 0.8819 5.545 3.291 2.7040 5.111
14.86 14.67 0.8676 5.678 3.258 2.1290 5.351
14.43 14.40 0.8751 5.585 3.272 3.9750 5.144
15.78 14.91 0.8923 5.674 3.434 5.5930 5.136
14.49 14.61 0.8538 5.715 3.113 4.1160 5.396
14.33 14.28 0.8831 5.504 3.199 3.3280 5.224
14.52 14.60 0.8557 5.741 3.113 1.4810 5.487
15.03 14.77 0.8658 5.702 3.212 1.9330 5.439
14.46 14.35 0.8818 5.388 3.377 2.8020 5.044
14.92 14.43 0.9006 5.384 3.412 1.1420 5.088
15.38 14.77 0.8857 5.662 3.419 1.9990 5.222
12.11 13.47 0.8392 5.159 3.032 1.5020 4.519
11.42 12.86 0.8683 5.008 2.850 2.7000 4.607
11.23 12.63 0.8840 4.902 2.879 2.2690 4.703
12.36 13.19 0.8923 5.076 3.042 3.2200 4.605
13.22 13.84 0.8680 5.395 3.070 4.1570 5.088
12.78 13.57 0.8716 5.262 3.026 1.1760 4.782
12.88 13.50 0.8879 5.139 3.119 2.3520 4.607
14.34 14.37 0.8726 5.630 3.190 1.3130 5.150
14.01 14.29 0.8625 5.609 3.158 2.2170 5.132
14.37 14.39 0.8726 5.569 3.153 1.4640 5.300
12.73 13.75 0.8458 5.412 2.882 3.5330 5.067

These data are formatted usefully for analysis using correlation and plotting. Each row is a different seed and the columns are the variables we are interested. This means that the first values in each column are related - they are all from the same seed

In the first instance, it is sensible to create a rough plot of our data (See Figure 16.1). Plotting data early helps us in multiple ways:

  • it helps identify whether there extreme values
  • it allows us to see if the association is roughly linear
  • it tells us whether any association positive or negative

Scatter plots (geom_point()) are a good choice for exploratory plotting with data like these.

ggplot(data = seeds,
       aes(x = compactness, y = kernel_width)) +
  geom_point()
Figure 16.1: A default scatter plot of compactness and kernel width in seeds demonstrates an approximately linear positive association them the variables.

The figure suggests that there is a positive correlation between compactness and kernel_width and that correlation is roughly linear.

16.3.2 Check assumptions

We can see from Figure 16.1 that the relationship between compactness and kernel_width is roughly linear. This is a good sign for using Pearson’s correlation coefficient.

Our next check is to use common sense: both variables are continuous and we would expect them to be normally distributed. We can then plot histograms to examine the distributions (See Figure 16.2).

ggplot(data = seeds, aes(x = compactness)) +
  geom_histogram(bins = 12)

ggplot(data = seeds, aes(x = kernel_width)) +
  geom_histogram(bins = 12)
(a) Compactness
(b) Kernel width
Figure 16.2: The distributions of the two variables are slightly skewed but do not seem too different from a normal distribution. We will continue with the Pearson’s correlation coefficient.

The distributions of the two variables are slightly skewed but do not seem too different from a normal distribution.

Finally, we can use the Shapiro-Wilk test to test for normality.

shapiro.test(seeds$compactness)
## 
##  Shapiro-Wilk normality test
## 
## data:  seeds$compactness
## W = 0.99484, p-value = 0.9937
shapiro.test(seeds$kernel_width)
## 
##  Shapiro-Wilk normality test
## 
## data:  seeds$kernel_width
## W = 0.98989, p-value = 0.8486

The p-values are greater than 0.05 so these tests of the normality assumption are not significant. Note that “not significant” means not significantly different from a normal distribution. It does not mean definitely normally distributed.

It is not unreasonable to continue with the Pearson’s correlation coefficient. However, later we will also use the Spearman’s rank correlation coefficient, a non-parametric method which has fewer assumptions. Spearman’s rank correlation is a more conservative approach.

16.3.3 Do a Pearson’s correlation test with cor.test()

We can carry out a Pearson’s correlation test with cor.test() like this:

cor.test(data = seeds, ~ compactness + kernel_width)
## 
##  Pearson's product-moment correlation
## 
## data:  compactness and kernel_width
## t = 7.3738, df = 68, p-value = 2.998e-10
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.5117537 0.7794620
## sample estimates:
##       cor 
## 0.6665731

A variable is not being explained in this case so we do not need to include a response variable. Pearson’s is the default test so we do not need to specify it.

What do all these results mean?

The last line gives the correlation coefficient, \(r\), that has been estimated from the data. Here, \(r\) = 0.67 which is a moderate positive correlation.

The fifth line tells you what test has been done: alternative hypothesis: true correlation is not equal to 0. That is \(H_0\) is that \(\rho = 0\)

The forth line gives the test result: t = 7.3738, df = 68, p-value = 2.998e-10

The \(p\)-value is very much less than 0.05 so we can reject the null hypothesis and conclude there is a significant positive correlation between compactness and kernel_width.

16.3.4 Report

There was a significant positive correlation (\(r\) = 0.67) between compactness and kernel width (Pearson’s: t = 7.374; df = 68; p-value < 0.0001). See Figure 16.3.

Code
ggplot(data = seeds,
       aes(x = compactness, y = kernel_width)) +
  geom_point() +
  scale_y_continuous(name = "Diameter (mm)") +
    scale_x_continuous(name = "Compactness") +
  annotate("text", x = 0.85,  y = 3.6, 
           label = expression(italic(r)~"= 0.67; "~italic(p)~"< 0.001")) +
  theme_classic()

Figure 16.3: Correlation between compactness and kernel width of wheat seeds. High quality images of the internal structure of wheat seeds were taken with a soft X-ray technique and the compactness and kernel width of the seeds were determined. There was a significant positive correlation (\(r\) = 0.67) between compactness and kernel width (Pearson’s: t = 7.374; df = 68; p-value < 0.0001). Note: axes do not start at 0. Data analysis was conducted in R (R Core Team 2023) with tidyverse packages (Wickham et al. 2019).

16.3.5 Spearman’s rank correlation coefficient

TODO

16.4 Contingency Chi-squared test

Researchers were interested in whether different pig breeds had the same food preferences. They offered individuals of three breads, Welsh, Tamworth and Essex a choice of three foods: cabbage, sugar beet and swede and recorded the number of individuals that chose each food. The data are shown in Table 16.1.

Table 16.1: Food preferences of three pig breeds
welsh tamworth essex
cabbage 11 19 22
sugarbeet 21 16 8
swede 7 12 11

We don’t know what proportion of food are expected to be preferred but do expect it to be same for each breed if there is no association between breed and food preference. The null hypothesis is that the proportion of foods taken by each breed is the same.

For a contingency chi squared test, the inbuilt chi-squared test can be used but we need to to structure our data as a 3 x 3 table. The matrix() function is useful here and we can label the rows and columns to help us interpret the results.

Put the data into a matrix:

# create the data
food_pref <- matrix(c(11, 19, 22,
                      21, 16, 8,
                      7, 12, 11),
                    nrow = 3,
                    byrow = TRUE)
food_pref
##      [,1] [,2] [,3]
## [1,]   11   19   22
## [2,]   21   16    8
## [3,]    7   12   11

The byrow and nrow arguments allow us to lay out the data in the matrix as we need. To name the rows and columns we can use the dimnames() function. We need to create a “list” object to hold the names of the rows and columns and then assign this to the matrix object. The names of rows are columns are called the “dimension names” in a matrix.

Make a list for the two vectors of names:

# 

vars <- list(food = c("cabbage",
                      "sugarbeet",
                      "swede"),
             breed = c("welsh",
                       "tamworth",
                       "essex"))

The vectors can be of different lengths in a list which would be important if we had four breeds and only two foods, for example.

Now assign the list to the dimension names in the matrix:

dimnames(food_pref) <- vars

food_pref
##            breed
## food        welsh tamworth essex
##   cabbage      11       19    22
##   sugarbeet    21       16     8
##   swede         7       12    11

The data are now in a form that can be used in the chisq.test() function:

chisq.test(food_pref)
## 
##  Pearson's Chi-squared test
## 
## data:  food_pref
## X-squared = 10.64, df = 4, p-value = 0.03092

The test is significant since the p-value is less than 0.05. We have evidence of a preference for particular foods by different breeds. But in what way? We need to know the “direction of the effect” i.e., Who likes what?

The chisq.test() function has a residuals argument that can be used to calculate the residuals. These are the differences between the observed and expected values. The expected values are the values that would be expected if there was no association between the rows and columns. The residuals are standardised.

chisq.test(food_pref)$residuals
##            breed
## food             welsh    tamworth      essex
##   cabbage   -1.2433504 -0.05564283  1.2722209
##   sugarbeet  1.9317656 -0.16014783 -1.7125943
##   swede     -0.7289731  0.26939742  0.4225344

Where the residuals are positive, the observed value is greater than the expected value and where they are negative, the observed value is less than the expected value. Our results show the Welsh pigs much prefer sugarbeet and strongly dislike cabbage. The Essex pigs prefer cabbage and dislike sugarbeet and the Essex pigs slightly prefer swede but have less strong likes and dislikes.

The degrees of freedom are: (rows - 1)(cols - 1) = 2 * 2 = 4.

16.4.1 Report

Different pig breeds showed a significant preference for the different food types (\(\chi^2\) = 10.64; df = 4; p = 0.031) with Essex much preferring cabbage and disliking sugarbeet, Welsh showing a strong preference for sugarbeet and a dislike of cabbage and Tamworth showing no clear preference.

16.5 Summary

TODO