20 Pseudomonas putida grown in two soil types at two temperatures

Incomplete

You are reading a work in progress. This page is a dumping ground for ideas. Sections maybe missing, or contain a list of points to include.

20.1 Scenario

Pseudomonas putida, is a hydrocarbon-degrading bacterium used for cleaning up oil-contaminated soil. P.putida was growth on either hexadecane contaminated or control soil at two temperatures: 4 degrees or 25 degrees. The was is to determine the effect of temperature and soil type on the number of bacteria in the soil.

Each group in a large class did one replicate and added their data to a Google sheet: https://docs.google.com/spreadsheets/d/1k2Gq4tKGEI0v_tdfgI-1d-Lk4YaWAC50sHffOmnqESU/edit?gid=886771226#gid=886771226

library(googlesheets4)
library(tidyverse)
library(emmeans)

file <- "https://docs.google.com/spreadsheets/d/1k2Gq4tKGEI0v_tdfgI-1d-Lk4YaWAC50sHffOmnqESU/edit?gid=886771226#gid=886771226"

class_data <- read_sheet(file) |> janitor::clean_names()

You will be asked to authenticate in your browser. This message will appear in the console and the browser should open

Waiting for authentication in browser...
Press Esc/Ctrl + C to abort

You can Allow “Tidyverse API Packages wants to access your Google Account”.

You might also be asked

Is it OK to cache OAuth access credentials in the folder

Choose the option for yes

The httpuv package enables a nicer Google auth experience, in many cases, but it
isn't installed.
Would you like to install it now?

Choose the option for yes.

The data should then read in.

Next time you run read_sheet() from googlesheets4

Next time you run read_sheet() from googlesheets4 you will not have to authenticate in your browser if you chose “yes” for Is it OK to cache OAuth access credentials in the folder.

Instead you will see:

The googlesheets4 package is requesting access to your Google account.
Enter '1' to start a new auth process or select a pre-authorized account.
1: Send me to the browser for a new auth process.
2: your.email@whereever.ac.uk

You can chose 2

We don’t need the timestamp or group name columns and these can be dropped using the select() command

The data are in wide format - there is sample in column. We want to convert to long format so that each row contains a single count and the other columns give the soil and the temperature treatment. We can use pivot_longer() for this.

Drooping the columns (-) and pivoting the data can be done in one pipeline:

class_data <- class_data |> 
  select(-timestamp, -group_name) |>
  pivot_longer(cols = everything(),
               names_to = "treatment",
               values_to = "counts_cpg")


class_data <- class_data |>
  extract(
    col = treatment,
    into = c("soil", "temp"),
    regex = "^([uh]).([fi])_cells_gram_of_soil$"
  ) |>
  mutate(
    soil = recode(soil,
                  "u" = "uncontaminated",
                  "h" = "hexadecane contaminated"),
    temp = recode(temp,
                         "f" = "4 degrees",
                         "i" = "25 degrees")
  )

ggplot(class_data, 
       aes(x = soil, y = counts_cpg, fill = temp)) +
  geom_boxplot()

The data are not normally distributed - there are a few very large values. Logging the data will aid visualisation:

ggplot(class_data, 
       aes(x = soil, y = log10(counts_cpg), fill = temp)) +
  geom_boxplot()

That’s better but the warning: Removed 11 rows containing non-finite outside the scale range indicates that there are some zeros. We can add 1 to the counts before logging to avoid this. This is a common practice. It is useful because log(1) = 0 which simplifies interpretation and is better than removing them. Many of the values are very large - the next smallest number is 75 - and adding 1 to large values before logging will have very little impact.

ggplot(class_data, 
       aes(x = soil, y = log10(counts_cpg + 1), fill = temp)) +
  geom_boxplot()

Temperature seems to matter but there is little difference between the two soils. The difference between the two temperatures is about the same in both soils suggesting no interaction.

Add a new column to the data frame with the logged values that we can use in the model.

class_data <- class_data |> 
  mutate(log_counts = log10(counts_cpg + 1))

Summarise the data to get the means and se for each group.

class_data_summary <- class_data |> 
  group_by(soil, temp) |> 
  summarise(mean = mean(log_counts),
            sd = sd(log_counts),
            n = length(log_counts),
            se = sd/sqrt(n))

two-way anova

mod <- lm(data = class_data,
          log_counts ~ soil * temp)

summary(mod)
## 
## Call:
## lm(formula = log_counts ~ soil * temp, data = class_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.5791 -0.3205  0.1573  0.6250  4.6301 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                        7.5791     0.1851  40.943   <2e-16 ***
## soiluncontaminated                 0.0897     0.2618   0.343    0.732    
## temp4 degrees                     -2.4686     0.2618  -9.430   <2e-16 ***
## soiluncontaminated:temp4 degrees  -0.1270     0.3702  -0.343    0.732    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.571 on 284 degrees of freedom
## Multiple R-squared:  0.3973, Adjusted R-squared:  0.391 
## F-statistic: 62.41 on 3 and 284 DF,  p-value: < 2.2e-16

anova(mod)
## Analysis of Variance Table
## 
## Response: log_counts
##            Df Sum Sq Mean Sq  F value Pr(>F)    
## soil        1   0.05    0.05   0.0200 0.8876    
## temp        1 461.62  461.62 187.1022 <2e-16 ***
## soil:temp   1   0.29    0.29   0.1177 0.7318    
## Residuals 284 700.69    2.47                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

There is a strong and significant effect of temperature, no effect of soil and no interaction between temperature and soil. The effect of temperature is independent of soil.

examine the assumptions

ggplot(mapping = aes(x = mod$residuals)) +
  geom_histogram(bins = 30)

Too may values in the middle, too few in the shoulders for a normal distribution.

shapiro.test(mod$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  mod$residuals
## W = 0.83366, p-value < 2.2e-16

Significantly different from a normal distribution.

plot(mod, which = 1)

Variances are similar in each group

Not normal but reasonably symmetrical and variances are similar in each group. In addition, the effect is very clear - not at all borderline - so this case I will continue with the two-way anova.

Post-hoc tests. Expecting differences between temps regardless of soil type.

emmeans(mod, ~ soil * temp) |> pairs()
##  contrast                                                              
##  hexadecane contaminated 25 degrees - uncontaminated 25 degrees        
##  hexadecane contaminated 25 degrees - hexadecane contaminated 4 degrees
##  hexadecane contaminated 25 degrees - uncontaminated 4 degrees         
##  uncontaminated 25 degrees - hexadecane contaminated 4 degrees         
##  uncontaminated 25 degrees - uncontaminated 4 degrees                  
##  hexadecane contaminated 4 degrees - uncontaminated 4 degrees          
##  estimate    SE  df t.ratio p.value
##   -0.0897 0.262 284  -0.343  0.9861
##    2.4686 0.262 284   9.430  <.0001
##    2.5059 0.262 284   9.572  <.0001
##    2.5583 0.262 284   9.772  <.0001
##    2.5956 0.262 284   9.915  <.0001
##    0.0373 0.262 284   0.143  0.9990
## 
## P value adjustment: tukey method for comparing a family of 4 estimates


ggplot() + 
  geom_point(data = class_data, 
             aes(x = soil, y = log_counts, shape = temp), 
             position = position_jitterdodge(dodge.width = 1,
                                             jitter.width = 0.3, 
                                             jitter.height = 0), 
             size = 3, 
             colour = "gray50") +
  geom_errorbar(data = class_data_summary, 
                aes(x = soil, ymin = mean - se,
                    ymax = mean + se, group = temp), 
                width = 0.5, 
                linewidth = 1, 
                position = position_dodge(width = 1)) + 
  geom_errorbar(data = class_data_summary,
                aes(x = soil, ymin = mean, ymax = mean, group = temp), 
                width = 0.4, 
                linewidth = 1, 
                position = position_dodge(width = 1) ) +
  scale_x_discrete(name = "soil", expand = c(0, 0)) +
  scale_y_continuous(name = "Log Counts", expand = c(0, 0),                                                                                                              limits = c(0, 14)) +
  scale_shape_manual(values = c(19, 1), 
                     name = "Temperature") +
  # hexadecane contaminated 25 degrees - 
  # hexadecane contaminated 4 degrees <.0001 
  annotate("segment", 
           x = 0.75, xend = 1.25,
           y = 11.7, yend = 11.7, 
           colour = "black") +
  annotate("text", x = 1, y = 12, 
           label = "p < 0.0001") +
  # uncontaminated 25 degrees - 
  # uncontaminated 4 degrees <.0001 
  annotate("segment", 
           x = 1.75, xend = 2.25, 
           y = 11.7, yend = 11.7, 
           colour = "black") +
  annotate("text", x = 2, y = 12, 
           label = "p < 0.0001") +
  # hexadecane contaminated 25 degrees -
  # uncontaminated 4 degrees <.0001 
  annotate("segment", 
           x = 0.75, xend = 2.25, 
           y = 12.7, yend = 12.7, 
           colour = "black") +
  annotate("text", x = 1.5, y = 13, 
           label = "p < 0.0001") +
  # uncontaminated 25 degrees - 
  # hexadecane contaminated 4 degrees <.0001 
  annotate("segment", 
           x = 1.25, xend = 1.75, 
           y = 10.7, yend = 10.7, 
           colour = "black") +
  annotate("text",
           x = 1.5, y = 11, 
           label = "p < 0.0001") +
  theme_classic() +
  theme(legend.position = "inside",
        legend.position.inside = c(0.92, 0.9),
        legend.background = element_rect(colour = "black"))