Workshop

Data Analysis 4: Immunobiology - Your data presentation

In this workshop learn how to create density plots to visualise the distribution and gating of the signals and calculate the the percentage of cells in each quadrant of a quadrant gated plot of TNFa_APC_Lin signal against the E_coli_FITC_Lin signal. You will also grow your knowledge of ggplot annotation and data import from googlesheet.

Exercises

Add your data

Enter these in the BIO00066I Biomedical Sciences class data

The columns you must add are:

  • apc_mfi: Mean fluorescence intensity of the logicle transformed TNFa_APC_Lin in the TNF-α positive cells

  • perc_tfna_pos: % non debris cells that are TNF-α positive cells

The other columns are calculations you make along the way and may help you get to the apc_mfi and perc_tfna_pos values.

The column names are the same as those used in the Data Analysis 2: Immunobiology - Sample data analysis workshop.

Set up

🎬 Open the RStudio project you created in the Data Analysis 2: Immunobiology - Sample data analysis workshop.

🎬 Create a new script called data-presentation.R

🎬 Load packages:

🎬 Save a copy of live_labelled.csv to your data-processed folder. These cells have been AI cleaned, gated to remove debris and dead cells, and labelled as positive or negative for the E_coli_FITC_Lin and TNFa_APC_Lin signals1.

🎬 Import the data:

clean_trans_nondebris <- read_csv("data-processed/live_labelled.csv")

🎬 I’m also going to select only columns I need to keep my life simple:

clean_trans_nondebris <- clean_trans_nondebris |> 
  select(FS_Lin,
         SS_Lin,
         E_coli_FITC_Lin,
         TNFa_APC_Lin,
         antibody,
         treatment,
         tnfa,
         fitc)

These are the forward and side scatter, the two logicle-transformed signals the groups (treatment and antibody) and whether the cell is positive or negative for each of the signals.

🎬 Click on the data frame in the environment window to view it and make sure you have an understanding of the data.

FS_Lin SS_Lin E_coli_FITC_Lin TNFa_APC_Lin antibody treatment tnfa fitc
27304 21432 2.573944 1.797574 ISOTYPE ECOLIGreen TNF-α -'ve FITC +'ve
16437 18261 3.303969 1.553092 ISOTYPE ECOLIGreen TNF-α -'ve FITC +'ve
23356 20257 3.024955 1.432302 ISOTYPE ECOLIGreen TNF-α -'ve FITC +'ve
20567 19949 1.665994 1.602869 ISOTYPE ECOLIGreen TNF-α -'ve FITC -'ve
31556 24441 3.363674 1.674940 ISOTYPE ECOLIGreen TNF-α -'ve FITC +'ve
25858 18398 1.592289 1.469350 ISOTYPE ECOLIGreen TNF-α -'ve FITC -'ve
25512 11655 1.592289 1.535148 ISOTYPE ECOLIGreen TNF-α -'ve FITC -'ve
26949 18936 3.629136 1.762831 ISOTYPE ECOLIGreen TNF-α -'ve FITC +'ve
22715 24400 3.345617 1.810729 ISOTYPE ECOLIGreen TNF-α -'ve FITC +'ve
29398 21824 3.687067 1.688024 ISOTYPE ECOLIGreen TNF-α -'ve FITC +'ve

🎬 Use fct_relevel() to put treatment groups in order so that our graphs are better to interpret.

clean_trans_nondebris <- clean_trans_nondebris |> 
  mutate(treatment = fct_relevel(treatment, c("MEDIA",
                                              "LPS",
                                              "ECOLIGreen")))

We need to calculate some summary information such as the percent of TNF-α positive cells in each sample and the mean fluorescence intensity of the TNFa_APC_Lin signal in the TNF-α positive cells. If you especially hate R but are familiar with pivot tables in Excel, then you could use that….. but it would not be as reproducible as using R and tidyverse functions.

🎬 Calculate the number of cells in each sample:

# number of cells in each sample after gating
clean_trans_nondebris_n <-  clean_trans_nondebris |> 
  group_by(antibody, treatment) |> 
  summarise(n_nondebris = n()) 

🎬 Click on the dataframe in the environment window to view it and make sure you have an understanding of the summary.

antibody treatment n_nondebris
ISOTYPE MEDIA 13106
ISOTYPE LPS 45138
ISOTYPE ECOLIGreen 26866
TNFAPC MEDIA 24419
TNFAPC LPS 25479
TNFAPC ECOLIGreen 26002

🎬 Calculate the number of TNF-α positive cells in each sample and the mean fluorescence intensity of the TNFa_APC_Lin signal in the TNF-α positive cells:

## summarise the number of TNF-α +'ve cells in each sample  
clean_trans_nondebris_tfna_pos <- clean_trans_nondebris |> 
  filter(tnfa == "TNF-α +'ve") |>
  group_by(antibody, treatment) |>
  summarise(n_pos_tnfa = n(),
            mean_apc = round(mean(TNFa_APC_Lin), 2))

🎬 Click on the dataframe in the environment window to view it and make sure you have an understanding of the summary.

antibody treatment n_pos_tnfa mean_apc
ISOTYPE MEDIA 11 2.14
ISOTYPE LPS 24 2.15
ISOTYPE ECOLIGreen 23 2.14
TNFAPC MEDIA 13503 2.26
TNFAPC LPS 25414 3.14
TNFAPC ECOLIGreen 25965 3.27

Note: If your dataframe has 0 rows first check you have written “TNF-α +’ve” correctly in the filter function (other no rows will match). If you have written it correctly, and you are working on your own machine, you may have different character encoding. The simplest solution: is to repeat the step in workshop 2

  • is to repeat the step in workshop 2 “Add a label, tnfa, to the data to indicate if the cell is positive or negative for TNF-α” by using TNF-a rather than TNF-α in the labelling

  • make sure you resave the data (“data-processed/live_labelled.csv”) after that step

  • make sure you edit the code in this workshop to reflect the change in the label

In order to calculate the percentage of cells that are TNF-α positive, we need to join the two summaries together and add a column using mutate().

🎬 Join the summary of the total number of cells in each sample with the summary of the number of TNF-α positive cells in each sample and calculate the percentage of cells that are TNF-α positive:

## join the summary with the summary of the number of cells in each sample
## and calculate the percentage of cells that are TNF-α +'ve
clean_trans_nondebris_tfna_pos <- 
  clean_trans_nondebris_tfna_pos |> 
  left_join(clean_trans_nondebris_n, by = c("antibody", "treatment")) |> 
  mutate(perc_tfna_pos = round(n_pos_tnfa/n_nondebris * 100, 1) )

🎬 Click on the dataframe in the environment window to view it and make sure you have an understanding of the summary.

antibody treatment n_pos_tnfa mean_apc n_nondebris perc_tfna_pos
ISOTYPE MEDIA 11 2.14 13106 0.1
ISOTYPE LPS 24 2.15 45138 0.1
ISOTYPE ECOLIGreen 23 2.14 26866 0.1
TNFAPC MEDIA 13503 2.26 24419 55.3
TNFAPC LPS 25414 3.14 25479 99.7
TNFAPC ECOLIGreen 25965 3.27 26002 99.9
Tip for your own summaries

You can use the same process calculate the percentage of cells that are FITC positive in each sample and the mean fluorescence intensity of the FITC signal in the FITC positive cells.

Distribution of APC TNF-α signal with gate

To annotate the distribution of APC TNF-α signal with the gate used to define whether the cells are positive or negative for TNF-α it is useful to assign that value to a variable that we can use in our plots. We will do the FITC gate value at the same time2.

🎬 Assign the gate values to variables:

apc_cut <- 2
fitc_cut <- 2

Plot one sample

The pipe (|>) allows us to filter the data before plotting it which allows select the sample we want to plot.

🎬 Plot the distribution of the APC TNF-α signal for the MEDIA treatment and the ISOTYPE antibody:

clean_trans_nondebris |> 
    filter(treatment == "MEDIA",
           antibody == "ISOTYPE") |>
  ggplot(aes(x = TNFa_APC_Lin)) +
  geom_density(fill = "gray80") +
  geom_vline(xintercept = apc_cut, 
             color = "red") +
  scale_y_continuous(expand = c(0, 0),
                     limits = c(0, 2.5),
                     name = "Density") +
  scale_x_continuous(name = "Logicle transformed APC TNF-α signal") +
  theme_bw()

  • geom_vline() adds a vertical line to the plot at the value of apc_cut which is the gate value we determined for the APC TNF-α signal.

  • geom_density() is a smoothed version of a histogram and shows the distribution of the data. The fill argument sets the colour of the plot to a light grey.

  • the expand argument in a scale_x_.... or scale_y_.... sets the axis line at zero rather than being below it.

Plot annotation

You have (at least) three options for adding the summary statistics to the plot.

  1. Most simple: adding in word/googledocs (or whatever you write your report in).

    Save the plot using ggsave(), insert as an image into your report and a text box.

  2. In R by hard coding the values in the geom_text() function.

clean_trans_nondebris |> 
  filter(treatment == "MEDIA",
         antibody == "ISOTYPE") |>
  ggplot(aes(x = TNFa_APC_Lin)) +
  geom_density(fill = "gray80") +
  geom_vline(xintercept = apc_cut, 
             color = "red") +
  geom_text(label = "0.1 % cells\nTNF-α +'ve\nMFI = 2.14",
            x = 1.6,
            y = 2.2,
            colour = "red") +
  scale_y_continuous(expand = c(0, 0),
                     limits = c(0, 2.5),
                     name = "Density") +
  scale_x_continuous(name = "Logicle transformed APC TNF-α signal") +
  theme_bw()

  1. In R and fully reproducibly by using the clean_trans_nondebris_tfna_pos dataframe with the summary statistics in geom_text(). That dataframe also needs filtering to the sample you are plotting.
clean_trans_nondebris |> 
  filter(treatment == "MEDIA",
         antibody == "ISOTYPE") |>
  ggplot(aes(x = TNFa_APC_Lin)) +
  geom_density(fill = "gray80") +
  geom_vline(xintercept = apc_cut, 
             color = "red") +
   geom_text(data = clean_trans_nondebris_tfna_pos |> 
              filter(treatment == "MEDIA",
                     antibody == "ISOTYPE"), 
             aes(label = paste0(perc_tfna_pos, 
                                "% cells\nTNF-α +'ve\nMFI = ",
                                mean_apc)), 
             x = 1.7, 
             y = 2.1,
             colour = "red") +
  scale_y_continuous(expand = c(0, 0),
                     limits = c(0, 2.5),
                     name = "Density") +
  scale_x_continuous(name = "Logicle transformed APC TNF-α signal") +
  theme_bw()

This has several advantages:

  • if the data changes, the plot annotation will update automatically just as the distribution will.

  • extending to multiple facets requires little extra work.

Write to file

🎬 Assign the plot to apc_distibution_media_isotype

apc_distibution_media_isotype <- clean_trans_nondebris |> 
  filter(treatment == "MEDIA",
         antibody == "ISOTYPE") |>
  ggplot(aes(x = TNFa_APC_Lin)) +
  geom_density(fill = "gray80") +
  geom_vline(xintercept = apc_cut, 
             color = "red") +
   geom_text(data = clean_trans_nondebris_tfna_pos |> 
              filter(treatment == "MEDIA",
                     antibody == "ISOTYPE"), 
             aes(label = paste0(perc_tfna_pos, 
                                "% cells\nTNF-α +'ve\nMFI = ",
                                mean_apc)), 
             x = 1.7, 
             y = 2.1,
             colour = "red") +
  scale_y_continuous(expand = c(0, 0),
                     limits = c(0, 2.5),
                     name = "Density") +
  scale_x_continuous(name = "Logicle transformed APC TNF-α signal") +
  theme_bw()

🎬 Save the plot to a file:

ggsave("figures/apc_distibution_media_isotype.png",
       device = "png",
       plot = apc_distibution_media_isotype,
       width = 4,
       height = 2.5,
       units = "in",
       dpi = 300)

Multiple facets

To plot all the samples in one go we can use facet_grid(). treatment ~ antibody puts the treatments in rows and the antibodies in columns. Of course, we now don’t need to filter the data to a single sample.

🎬 Plot the distribution of the APC TNF-α signal for all samples:

clean_trans_nondebris |> 
  ggplot(aes(x = TNFa_APC_Lin)) +
  geom_density(fill = "gray80") +
  geom_vline(xintercept = apc_cut, 
             color = "red") +
   geom_text(data = clean_trans_nondebris_tfna_pos, 
             aes(label = paste0(perc_tfna_pos, 
                                "% cells\nTNF-α +'ve\nMFI = ",
                                mean_apc)), 
             x = 4, 
             y = 1.8,
             colour = "red") +
  scale_y_continuous(expand = c(0, 0),
                     limits = c(0, 2.5),
                     name = "Density") +
  scale_x_continuous(name = "Logicle transformed APC TNF-α signal") +
  facet_grid(treatment ~ antibody) +
  theme_bw()

Note that adding plot annotations manually (method 1) and using R fully reproducibly (method 3) are possible on faceted plots but method 2 is not because you need the annotations to change with values in treatment and antibody.

Tip for your own plots

You can combine filtering data and faceted plots to do facet plots of a subset of the samples.

Overlay instead of facets

Using facets is one way to show multiple samples in one plot. Another way is to overlay the plots by mapping the fill aesthetic to the antibody variable. Making the fill semi-transparent with alpha = 0.3 allows you to see the overlap of the distributions.

🎬 Overlay the distribution of the APC TNF-α signals for the media treated samples:

clean_trans_nondebris |> 
   filter(treatment == "MEDIA") |>
   ggplot(aes(x = TNFa_APC_Lin, fill = antibody)) +
   geom_density(alpha = 0.3) +
   geom_vline(xintercept = apc_cut, 
              color = "red") +
   scale_y_continuous(expand = c(0, 0),
                      limits = c(0, 2.5)) +
   scale_x_continuous(name = "Logicle transformed APC TNF-α signal") +
   theme_bw() 

To change the colours we need to used a scale_fill_... function. These functions can also be used to change the name (name = ...) of the legend and the names (labels = c(...)) of each group. scale_fill_manual() allows to specify the colours (values = c(...)) manually.

Instead of picking colours, I like to use the viridis scales. The viridis scales provide colour maps that are perceptually uniform in both colour and black-and-white. They are also designed to be perceived by viewers with common forms of colour blindness. See Introduction to viridis for more information.

Here I use scale_fill_viridis_d(). The d stands for discrete. The function scale_fill_viridis_c() would be used for continuous data. I’ve used the default “viridis” (or “D”) option (do ?scale_fill_viridis_d for all the options) and got rid of the “antibody” name. I also moved the legend.

clean_trans_nondebris |> 
   filter(treatment == "MEDIA") |>
   ggplot(aes(x = TNFa_APC_Lin, fill = antibody)) +
   geom_density(alpha = 0.3) +
   geom_vline(xintercept = apc_cut, 
              color = "red") +
   scale_fill_viridis_d(name = NULL) +
   scale_y_continuous(expand = c(0, 0),
                      limits = c(0, 2.5)) +
   scale_x_continuous(name = "Logicle transformed APC TNF-α signal") +
   theme_bw() +
   theme(legend.position = c(0.85, 0.85))

Percentage of cells in each quadrant

We want to calculate the the percentage of cells in each quadrant of a quadrant gated plot of TNFa_APC_Lin signal against the E_coli_FITC_Lin signal.

🎬 Calculate the number of cells in each quadrant for each sample:

## calculate the number of cells in each quadrant
all_combin_n <- clean_trans_nondebris |> 
  group_by(antibody, treatment, tnfa, fitc) |>
  summarise(n = n()) 

🎬 Click on the data frame in the environment window to view it and make sure you have an understanding of the data.

antibody treatment tnfa fitc n
ISOTYPE MEDIA TNF-α +'ve FITC -'ve 11
ISOTYPE MEDIA TNF-α -'ve FITC +'ve 4
ISOTYPE MEDIA TNF-α -'ve FITC -'ve 13091
ISOTYPE LPS TNF-α +'ve FITC +'ve 1
ISOTYPE LPS TNF-α +'ve FITC -'ve 23
ISOTYPE LPS TNF-α -'ve FITC +'ve 5
ISOTYPE LPS TNF-α -'ve FITC -'ve 45109
ISOTYPE ECOLIGreen TNF-α +'ve FITC +'ve 18
ISOTYPE ECOLIGreen TNF-α +'ve FITC -'ve 5
ISOTYPE ECOLIGreen TNF-α -'ve FITC +'ve 15457
ISOTYPE ECOLIGreen TNF-α -'ve FITC -'ve 11386
TNFAPC MEDIA TNF-α +'ve FITC +'ve 20
TNFAPC MEDIA TNF-α +'ve FITC -'ve 13483
TNFAPC MEDIA TNF-α -'ve FITC +'ve 2
TNFAPC MEDIA TNF-α -'ve FITC -'ve 10914
TNFAPC LPS TNF-α +'ve FITC +'ve 13
TNFAPC LPS TNF-α +'ve FITC -'ve 25401
TNFAPC LPS TNF-α -'ve FITC +'ve 1
TNFAPC LPS TNF-α -'ve FITC -'ve 64
TNFAPC ECOLIGreen TNF-α +'ve FITC +'ve 14969
TNFAPC ECOLIGreen TNF-α +'ve FITC -'ve 10996
TNFAPC ECOLIGreen TNF-α -'ve FITC +'ve 29
TNFAPC ECOLIGreen TNF-α -'ve FITC -'ve 8

As there are four quadrants and six samples, you would expect 24 rows in the data frame. However, there are only 23 rows. This is because there are no cells in the quadrant where both cells are positive for both TNF-α and FITC for the ISOTYOPE antibody and MEDIA treated samples. Since we are probably happy not to annotate a figure with 0 %, this is fine.

To calculate the percentage of cells in each quadrant, we need to join this dataframe with the number of non-debris cells for each sample, i.e., clean_trans_nondebris_n

🎬 Calculate the percentage of cells in each quadrant for each sample:

all_combin_perc <- all_combin_n |> 
  left_join(clean_trans_nondebris_n, 
            by = c("antibody", "treatment")) |> 
  mutate(perc = round(n / n_nondebris * 100, 1)) |> 
  filter(perc > 0)

I have additionally filtered out rows where the % cells rounds to 0. This is again because we are probably happy not to annotate a figure with 0 %.

🎬 Click on the data frame in the environment window to view it and make sure you have an understanding of the data.

antibody treatment tnfa fitc n n_nondebris perc
ISOTYPE MEDIA TNF-α +'ve FITC -'ve 11 13106 0.1
ISOTYPE MEDIA TNF-α -'ve FITC -'ve 13091 13106 99.9
ISOTYPE LPS TNF-α +'ve FITC -'ve 23 45138 0.1
ISOTYPE LPS TNF-α -'ve FITC -'ve 45109 45138 99.9
ISOTYPE ECOLIGreen TNF-α +'ve FITC +'ve 18 26866 0.1
ISOTYPE ECOLIGreen TNF-α -'ve FITC +'ve 15457 26866 57.5
ISOTYPE ECOLIGreen TNF-α -'ve FITC -'ve 11386 26866 42.4
TNFAPC MEDIA TNF-α +'ve FITC +'ve 20 24419 0.1
TNFAPC MEDIA TNF-α +'ve FITC -'ve 13483 24419 55.2
TNFAPC MEDIA TNF-α -'ve FITC -'ve 10914 24419 44.7
TNFAPC LPS TNF-α +'ve FITC +'ve 13 25479 0.1
TNFAPC LPS TNF-α +'ve FITC -'ve 25401 25479 99.7
TNFAPC LPS TNF-α -'ve FITC -'ve 64 25479 0.3
TNFAPC ECOLIGreen TNF-α +'ve FITC +'ve 14969 26002 57.6
TNFAPC ECOLIGreen TNF-α +'ve FITC -'ve 10996 26002 42.3
TNFAPC ECOLIGreen TNF-α -'ve FITC +'ve 29 26002 0.1
Tip for your own plots

You can combine the concept of annotation and faceted plots to annotate a quadrant gated plot of TNFa_APC_Lin signal against the E_coli_FITC_Lin signal you made in the Data Analysis 2: Immunobiology - Sample data analysis workshop.

Importing from google sheets

The BIO00066I Biomedical Sciences class data are in a google sheet. You can download the file as an excel file or .csv. file but it is also possible to import the data directly from the google sheet into R. Use whatever you prefer. An advantage of using the google sheet is you won’t have to remember to download the data when someone updates it.

You can use the googlesheets4 package (Bryan 2023) to do this.

file <- "https://docs.google.com/spreadsheets/d/104EXdgsiIq-FuRF9Ly9zewEVdpkVWbyOwxSAmiqJepg/edit#gid=0"
class_data <- read_sheet(file)

You will be asked:

Is it OK to cache OAuth access credentials in the folder
C:/Users/emmar/AppData/Local/gargle/gargle/Cache between R sessions?
1: Yes
2: No
Selection: 

Type the number for yes.

You will be prompted to authenticate with Google in a browser window.

I suggest clicking the box for:

See, edit, create, and delete all your Google Sheets spreadsheets. Learn more

but you do not have to. Click allow/continue.

You will see a message in the browser window:

Authentication complete. Please close this page and return to R.

You should see that the class data has been imported into R.

Note that you will probably want to do some quality control such has filtering out rows with missing data in some columns. We did this in the Data Analysis 1: Core workshop.

Analysis of the class data

You should be able to apply techniques you have learned in stage 1 to the class data. You can find those materials on the VLE in the 17C/8C module sites. However, you may find it useful to use the latest version of those material which are available outside the VLE at without 2FA.

Stage 1

Independent study following the workshop

Consolidate

The Code file

This contains all the code needed in the workshop even where it is not visible on the webpage.

The workshop.qmd file is the file I use to compile the practical. Qmd stands for Quarto markdown. It allows code and ordinary text to be interweaved to produce well-formatted reports including webpages. View the Qmd in Browser. Coding and thinking answers are marked with #---CODING ANSWER--- and #---THINKING ANSWER---

Pages made with R (R Core Team 2023), Quarto (Allaire et al. 2022), knitr (Xie 2022), kableExtra (Zhu 2021)

References

Allaire, J. J., Charles Teague, Carlos Scheidegger, Yihui Xie, and Christophe Dervieux. 2022. Quarto. https://doi.org/10.5281/zenodo.5960048.
Bryan, Jennifer. 2023. “Googlesheets4: Access Google Sheets Using the Sheets API V4.” https://CRAN.R-project.org/package=googlesheets4.
R Core Team. 2023. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.
Rand, Emma. 2023a. Data Analysis in r for Becoming a Bioscientist. https://3mmarand.github.io/R4BABS/.
———. 2023b. Computational Analysis for Bioscientists (version 0.1). https://3mmarand.github.io/comp4biosci/.
Xie, Yihui. 2022. “Knitr: A General-Purpose Package for Dynamic Report Generation in r.” https://yihui.org/knitr/.
Zhu, Hao. 2021. “kableExtra: Construct Complex Table with ’Kable’ and Pipe Syntax.” https://CRAN.R-project.org/package=kableExtra.

Footnotes

  1. Data Analysis 2: Immunobiology - Sample data analysis workshop has been amended to include the instruction to save these data at the end.↩︎

  2. You might want to look back at Quality control 3: Gating to determine a ‘real’ signal in the week 2 workshop to remind yourself of how we determined the gate values for the sample data.↩︎