Workshop

Data Analysis 2: Biomedical sciences - Sample data analysis

Introduction

Session overview

In this workshop you will learn how to analyse some sample flow cytometry data which is formatted in the same way as the data you will produce.

Exercises

Set up

Make a RStudio Project

🎬 Start RStudio from the Start menu.

🎬 Make an RStudio project. Be deliberate about where you create it so that it is a good place for you.

🎬 Use the Files pane to make a folders for the data. I suggest data-raw/, data-processed/ and data-meta/

🎬 Make a new script called sample-data-analysis.R to carry out the rest of the work.

🎬 Record what you do and what you find out. All of it! Make notes on the inputs and outputs of each command. This will help you work out problems analysing your own data.

Load packages

We need three packages for this workshop:

  • tidyverse (Wickham et al. 2019): importing the meta data which is in a text file, working with the data once it is in a dataframe to filter, summarise and plot.

  • flowCore (Ellis et al. 2024): to import data from flow cytometry files (.fcs), apply the logicle transformation, and use functions for using a “flowSet” data structure.

  • flowAI (Monaco et al. 2016): which performs automated quality control by checking for changes in instrument speed and signal intensity that indicate a problem for the reading for that cells

Working on University computers or using the Virtual Desktop Service

These packages are installed already. You can go straight to loading them.

Working on your own machine

I recommend working on the University computers for this work. You can still work from home by using the Virtual Desktop Service. The VDS allows you to log on to a university computer from your own computer. It means you can access all software and filestores. When using the VDS for R a nd RStudio, it usually makes sense to use other software - such as a browser or file explorer - also through the VDS. Go to the Virtual Desktop Service for set up instructions.

However, If you are confident in your ability to set up your own machine, you will need to install these packages.

You can install tidyverse from CRAN in the normal way.

flowCore and flowAI are not on CRAN but come from Bioconductor.

  • first install BiocManager from CRAN in the normal way.

  • then install from Bioconductor using BiocManager::install("flowCore") and BiocManager::install("flowAI")

If you have difficulty installing these packages, use the Virtual Desktop Service which allows you to log into a university machine from your laptop.

🎬 Load tidyverse,flowCore and flowAI

library(flowCore)
library(flowAI)
library(tidyverse)
Order of package loading

Load tidyverse last because flowCore also has some functions with the same names as functions in tidyverse and you will want the tidyverse ones.

Import the data

There are six data files, one for each treatment and antibody combination:

🎬 Save copies of each file (right click and save as) to your data-raw/ folder.

These are six files associated with one experiment.

We are going to use the command read.flowSet() from the flowCore package to import the data. In R, the data from one file is stored in a “flowFrame” object. A flowFrame is a data structure in the same way that a dataframe or a vector is a data structure. A flowFrame has “slots”.

  • The first slot is called exprs and contains the data itself in matrix which looks just like a dataframe.
  • The second is called parameters and contains the names of the columns in exprs.
  • The third is called description and contains information about the .fcs file.

A “flowSet” is a list of related flowFrames. They are related in that they have the same column names, came from the same machine, and are part of the same experiment. Such data needs to be treated in the same way which is why it is useful to have a structure like a flowSet.

🎬 Make a variable to hold the file names:

myfiles <- list.files("data-raw", pattern = ".fcs$")

list.files() will return a character vector of the names of the files data-raw/. The pattern argument is a “regular expression” which is used to filter the names of the files. Here we are asking for the names of the files which end in .fcs. This would be very useful if you had other file types in that folder. In our case, there are only .fcs files in that folder.

🎬 Type myfiles to check it contains what you expect.

🎬 Use read.flowSet() to import the files from the folder data-raw/ that are listed in the variable myfiles and name the resulting flowSet fs:

fs <- read.flowSet(myfiles, 
                   path = "data-raw")

You now have a flowSet which is a list of 6 flowFrames, one for each .fcs file.

Explore the data structure

🎬 Type fs to see what it contains:

fs 
A flowSet with 6 experiments.

column names(16): TIME Time MSW ... FL 8 Log Event Count

We can index the list to access the first flowFrame in the flowSet.

🎬 Type fs[[1]] to see the information about the first flowFrame in the flowSet:

fs[[1]] 
flowFrame object 'ECOLIGreen_ISOTYPE.fcs'
with 50000 cells and 16 observables:
            name        desc     range  minRange  maxRange
$P1         TIME        TIME     65536         0     65536
$P2     Time MSW    Time MSW         7         0         7
$P3  Pulse Width Pulse Width       397         0       397
$P4       FS Lin      FS Lin     55001         0     55001
$P5      FS Area     FS Area     65536         0     65536
...          ...         ...       ...       ...       ...
$P12    FL 1 Log    FL 1 Log     64962         0     64962
$P13    FL 8 Lin    FL 8 Lin       758         0       758
$P14   FL 8 Area   FL 8 Area      2217         0      2217
$P15    FL 8 Log    FL 8 Log     28864         0     28864
$P16 Event Count Event Count     50002         0     50002
130 keywords are stored in the 'description' slot

This tells you there are 50000 cells (rows) and 16 columns in this dataset. The first two columns TIME and Time MSW give the time in binary format since the beginning of the data acquisition.

🎬 You can use the exprs() function to access the actual data and pipe it to View() to show it in a window:

exprs(fs[[1]]) |> View()

🎬 List the column names in each of the flowFrames:

 [1] "TIME"        "Time MSW"    "Pulse Width" "FS Lin"      "FS Area"    
 [6] "FS Log"      "SS Lin"      "SS Area"     "SS Log"      "FL 1 Lin"   
[11] "FL 1 Area"   "FL 1 Log"    "FL 8 Lin"    "FL 8 Area"   "FL 8 Log"   
[16] "Event Count"

The side scatter (SS) and forward scatter (FS) columns are used to select the live cells. The fluorescence measurements are in the FL columns and it is the linear values, FL 1 Lin and FL 8 Lin that have our FITC and APC signals respectively.

Tip

These exploratory steps are to help you understand the R objects you have. They are not essential parts of the analysis.

Improve the column names

Names like FL 1 Lin meaning “fluorescence channel 1 linear” are not very informative. A list of more useful column names are given in meta.csv. We can import this file and use the resulting dataframe to rename columns in our flowFrames. This will make it easier to understand our data and results.

🎬 Save a copy of the file (right click and save as) to your data-meta/ folder

🎬 Import the meta data and examine it:

meta <- read_csv("data-meta/meta.csv")
name
TIME
Time_MSW
Pulse_Width
FS_Lin
FS_Area
FS_Log
SS_Lin
SS_Area
SS_Log
E_coli_FITC_Lin
E_coli_FITC_Area
E_coli_FITC_Log
TNFa_APC_Lin
TNFa_APC_Area
TNFa_APC_Log
Event_Count

🎬 Assign the names in meta$name to the columns in the flowFrames:

colnames(fs) <- meta$name
# view the effect
colnames(fs) 
 [1] "TIME"             "Time_MSW"         "Pulse_Width"      "FS_Lin"          
 [5] "FS_Area"          "FS_Log"           "SS_Lin"           "SS_Area"         
 [9] "SS_Log"           "E_coli_FITC_Lin"  "E_coli_FITC_Area" "E_coli_FITC_Log" 
[13] "TNFa_APC_Lin"     "TNFa_APC_Area"    "TNFa_APC_Log"     "Event_Count"     
Tip

You don’t have to use the names I have used for your own analysis if you prefer different names. Simply edit the meta.csv file. However, you do need to keep the Time column named that way for the next step to proceed.

Quality control 1: Automated instrument issues

flowAI is a package that provides a set of tools for automated QC of flow cytometry data. It checks for:

  • changes in instrument speed using the Time column,
  • changes signal intensity
  • cells that are outside the dynamic range

It generates a new flowSet with the problematic events removed along with a folder containing filtered .fcs files and a quality control report in html format. It will take a few minutes to do all files.

🎬 Run the automated quality control:

Quality control for the file: ECOLIGreen_ISOTYPE
11.81% of anomalous cells detected in the flow rate check. 
0% of anomalous cells detected in signal acquisition check. 
0% of anomalous cells detected in the dynamic range check. 
Quality control for the file: ECOLIGreen_TNFAPC
34.82% of anomalous cells detected in the flow rate check. 
0% of anomalous cells detected in signal acquisition check. 
0% of anomalous cells detected in the dynamic range check. 
Quality control for the file: LPS_ISOTYPE
36.42% of anomalous cells detected in the flow rate check. 
0% of anomalous cells detected in signal acquisition check. 
0% of anomalous cells detected in the dynamic range check. 
Quality control for the file: LPS_TNFAPC
36.11% of anomalous cells detected in the flow rate check. 
0% of anomalous cells detected in signal acquisition check. 
0% of anomalous cells detected in the dynamic range check. 
Quality control for the file: MEDIA_ISOTYPE
26.04% of anomalous cells detected in the flow rate check. 
0% of anomalous cells detected in signal acquisition check. 
0% of anomalous cells detected in the dynamic range check. 
Quality control for the file: MEDIA_TNFAPC
37.82% of anomalous cells detected in the flow rate check. 
0% of anomalous cells detected in signal acquisition check. 
0% of anomalous cells detected in the dynamic range check. 
fs_clean <- flow_auto_qc(fs, 
                         folder_results = "sample-QC")

❓ What has been removed through the QC process?

❓ What is in the folder “sample-QC”?

Get a feel of your data: Plot distributions

Plotting the distributions of our variables is very useful for getting an overview of our data. We’ll just look at the first flowFrame to get over the principles. However, you may wish to examine all of the flow frames of your own data.

🎬 Plot the distribution of the linear TNFa_APC signal in the first flowFrame in the flowSet:

exprs(fs_clean[[1]]) |> 
  data.frame() |> 
  ggplot(aes(x = TNFa_APC_Lin)) +
  geom_histogram(bins = 100)

This is a very skewed distribution which makes visualisation hard. It is common to log skewed distributions to improve visualisation.

🎬 Plot the distribution of the logged TNFa_APC_Lin column:

exprs(fs_clean[[1]]) |> 
  data.frame() |> 
  ggplot(aes(x = log(TNFa_APC_Lin))) +
  geom_histogram(bins = 100)

Now our data are easier to see.

Transform the data

In fact, we will apply a different transformation called “logicle” (Parks, Roederer, and Moore 2006). This transformation is has a similar effect as logging but avoids some problems that can occur with flow cytometry data and especially flow cytometry data that have been “compensated” . Compensation is routinely applied to flow cytometry data to correct for fluorophores being detected in in multiple channels. This was not a problem in our experiment.

The transformation is applied in two steps: the transformation needed is estimated from the data and then that transformation is applied.

We need to apply the transformation only to the TNFa_APC_Lin and E_coli_FITC_Lin columns. The estimateLogicle() function refers to columns as channels.

🎬 Estimate the transformation from the data:

trans <- estimateLogicle(fs_clean[[1]],
                         channels = c("E_coli_FITC_Lin", "TNFa_APC_Lin"))

🎬 Apply the transformation to all theTNFa_APC_Lin and E_coli_FITC_Lin columns the flowSet:

# apply the transformation
fs_clean_trans <- transform(fs_clean, trans)

🎬 Examine the effect on the first flow frame:

exprs(fs_clean_trans[[1]]) |> 
  data.frame() |> 
  ggplot(aes(x = TNFa_APC_Lin)) +
  geom_histogram(bins = 100)

That looks better.

Make data easier to work with

We are going to get the data from the flowSet and put in a dataframe. FlowCore contains many functions for gating (filtering), plotting and summarising flowSets but putting the data into a dataframe will make it easier for you to use tools you already know and more easily follow the priciples of what you are doing.

We will be using familiar tools like group_by() and summarise(), filter() and ggplot()

🎬 Put the transformed data into a data frame:

# Put into a data frame for ease of use
clean_trans <- fsApply(fs_clean_trans, exprs) |> data.frame() 

fsApply(flowSet, function) applies a function to every flowFrame in a flowSet. Here we have accessed the data matirx in each with exprs().

The are 191663 rows (cells) in the dataset. You can view it by clicking on the dataframe in the Environment. At the moment, we cannot tell which row (cell) is from which sample but we can add the sample names as a column in that dataframe.

  • Each sample name needs to appear as many times as there are cells (events) in the corresponding flowFrame. For example, dim(fs_clean_trans[[1]])["events"] is 44095 cells.

  • sampleNames(fs_clean_trans) gives the names of the samples.

  • We can use the rep() function to repeat the sample names the correct number of times.

🎬 Add the sample name to each row:

clean_trans <- clean_trans |> 
  dplyr::mutate(sample = rep(sampleNames(fs_clean_trans),
                             times = c(dim(fs_clean_trans[[1]])["events"],
                                       dim(fs_clean_trans[[2]])["events"],
                                       dim(fs_clean_trans[[3]])["events"],
                                       dim(fs_clean_trans[[4]])["events"],
                                       dim(fs_clean_trans[[5]])["events"],
                                       dim(fs_clean_trans[[6]])["events"])))

The sample name contains the information about which antibody and treatment the sample was. Our analysis will be a little easier if we add this information to separate columns. We can extract the antibody and treatment from the sample name using the extract() function and two “regular expressions”.

🎬 Add columns for treatment and antibody by extracting that information from the sample name:

clean_trans <- clean_trans |> 
  extract(sample, 
          remove = FALSE,
          c("treatment", "antibody"),
          "([a-zA-Z]+)_([a-zA-Z]+).fcs")
  • the sample name is treatment_antibody
  • each pattern matching the treatment and antibody is enclosed in parentheses
  • we want to keep those patterns to go in the new columns
  • the underscore is matched but not enclosed in parentheses so it is not kept
  • .fcs is matched but not enclosed in parentheses so it is not kept
  • the remove = FALSE argument means that the original column is kept
  • [] enclose a set of characters
  • a-z means any lower case letter, A-Z means any upper case letter so [a-zA-Z] means any letter
  • + means one or more of the preceding character set
  • so[a-zA-Z]+ means one or more letters
Adding the treatment and antibody information

We have added the treatment and antibody information to each row by processing the names of the fcs files. This works because the file names have a specific format which we have matched with the regex. When you work with your own data, the easist thing to do is ensure your files have the same names as the same data.

🎬 View the dataframe to see the new columns.

Our treatments have an order. Media is the control, LPS should be next and ECOLIGreen should be last. We can use the fct_relevel() function to put groups in order so that our graphs are better to interpret.

clean_trans <- clean_trans |> 
  mutate(treatment = fct_relevel(treatment, c("MEDIA",
                                              "LPS",
                                              "ECOLIGreen")))

Save the data

We have cleaned and transform our data. It is a good idea to save it at this point so that we can start from here if we need to.

🎬 Save the data:

write_csv(clean_trans, "data-processed/ai_clean_logicle_trans.csv")

Quality control 2: ‘Gating’ to Removing debris

Some of the cells will have died and broken during the experiment. We need to perform quality control on our data to remove observations that have that have the size and granularity that is not typical of a live cell. This is done by filtering on the forward scatter (size) and side scatter (granularity) channels.

In flow cytometry, it is common to describe filtering as “gating” and the observations (cells) being kept as “in the gate”.

We create a scatter plot of the forward scatter (size) and side scatter (granularity) for each sample to see where we should put the gate.

We could use geom_point() but many points will overlap so it is difficult to see the density of points. geom_hex() puts the points in bins and colours the bin with the number of points.

🎬 Plot the forward scatter and side scatter for each sample

ggplot(clean_trans, aes(x = FS_Lin, y = SS_Lin)) +
  geom_hex(bins = 128) +
  scale_fill_viridis_c() +
  facet_grid(antibody ~ treatment) +
  theme_bw()

We can see a cloud of points in the middle of the plot. This is where the live cells are. Points towards the corners, especially in the bottom left corner are likely to be cell debris. We can draw a gate around the cloud of points to remove the debris. We should use the same filter/gate on all six data sets for consistency. The gates can be a rectangle, polygon, or ellipse. We will use a rectangle.

Manual rectangular gate

Use the zoom on the plot window to better see the cloud of points and the axis values. You need to use minimum and maximum \(x\) and \(y\) values to define the rectangle.

🎬 Define the minimum and maximum \(x\) and \(y\) values for the gate:

xmin <- 15000
xmax <- 35000
ymin <- 7500
ymax <- 28000
Choosing values for your own data

I chose theses values but you might judge the gate differently. You will certainly need to use different values for your own data. The next plot will help you refine them.

🎬 Put those values in a dataframe so we can plot the gate on the hexbin plot:

box <- data.frame(x = c(xmin, xmin, xmax, xmax),
                  y = c(ymin, ymax, ymax, ymin))

🎬 Plot the forward scatter and side scatter for each sample with the gate:

ggplot(clean_trans, aes(x = FS_Lin, y = SS_Lin)) +
  geom_hex(bins = 128) +
  scale_fill_viridis_c() +
  geom_polygon(data = box, aes(x = x, y = y), 
               fill = NA, 
               color = "red",
               linewidth = 1) +
  facet_grid(antibody ~ treatment) +
  theme_bw()

🎬 Adjust the values of xmin, xmax, ymin, and ymax to get a gate that you are happy with.

We have drawn the gate on the plot but we need to use it to filter out the debris (rows) from our data. We want only the cells with FS_Lin values that are between the xmin and xmax and SS_Lin that are between ymin and ymax values we chose.

🎬 Filter the data to remove the debris:

# filter out the debris
clean_trans_live <- clean_trans |> 
  filter(between(FS_Lin, xmin, xmax),
         between(SS_Lin, ymin, ymax)) 

clean_trans_live now has 179794 cells. We should report how many cells were in the gate for each sample.

🎬 Find the number of cells in each sample after flowAI cleaning:

# 
clean_trans_n <-  clean_trans |> 
  group_by(antibody, treatment) |> 
  summarise(n = n()) 

Note than we use the clean and transformed data for summarising the number in each set after cleaning.

antibody treatment n
ISOTYPE MEDIA 20154
ISOTYPE LPS 31789
ISOTYPE ECOLIGreen 44095
TNFAPC MEDIA 31089
TNFAPC LPS 31945
TNFAPC ECOLIGreen 32591

🎬 Find the number of cells in each sample after flowAI cleaning and removing debris

# number of cells in each sample
clean_trans_live_n <-  clean_trans_live |> 
  group_by(antibody, treatment) |> 
  summarise(n_live = n()) 

Now we use the clean, transformed and gated data.

antibody treatment n_live
ISOTYPE MEDIA 17856
ISOTYPE LPS 30675
ISOTYPE ECOLIGreen 41927
TNFAPC MEDIA 28085
TNFAPC LPS 30536
TNFAPC ECOLIGreen 30715

🎬 Join two data frames together using the combination of the treatment and the antibody and calculate what % cells remained in each sample after gating (i.e., the % live cells):

clean_trans_live_n <- clean_trans_live_n |> 
  left_join(clean_trans_n, by = c("antibody", "treatment")) |> 
  mutate(perc_live = round(n_live/n * 100, 1) )
antibody treatment n_live n perc_live
ISOTYPE MEDIA 17856 20154 88.6
ISOTYPE LPS 30675 31789 96.5
ISOTYPE ECOLIGreen 41927 44095 95.1
TNFAPC MEDIA 28085 31089 90.3
TNFAPC LPS 30536 31945 95.6
TNFAPC ECOLIGreen 30715 32591 94.2

🎬 Plot the forward scatter and side scatter for each sample with the gate and the number of cells in each sample.

ggplot(clean_trans, aes(x = FS_Lin, y = SS_Lin)) +
  geom_hex(bins = 128) +
  scale_fill_viridis_c() +
  geom_polygon(data = box, aes(x = x, y = y), 
               fill = NA, 
               color = "red",
               linewidth = 1) +
  geom_text(data = clean_trans_live_n, 
            aes(label = paste0(perc_live, "%")), 
            x = 45000, 
            y = 1000,
            colour = "red",
            size = 5) +
  facet_grid(antibody ~ treatment) +
  theme_bw() 

Reporting

This figure allows you to report what data you included in your analysis of the TNFa_APC_Lin and E_coli_FITC_Lin signals.

🎬 Plot the TNFa_APC_Lin and E_coli_FITC_Lin signals for these live cells:

ggplot(clean_trans_live, aes(x = E_coli_FITC_Lin, 
                                  y = TNFa_APC_Lin)) +
  geom_hex(bins = 128) +
  scale_fill_viridis_c() +
  facet_grid(antibody ~ treatment) +
  theme_bw()

Quality control 3: Gating to determine a ‘real’ signal?

From now on we are working with the cleaned and gated data, that is, the dataframe clean_trans_live.

We have two signals:

  • TNFa_APC_Lin is should indicate the amount of TNF-α protein in the cell
  • E_coli_FITC_Lin should indicate the amount of E. coli in the cell

The TNFa_APC_Lin signal

When the antibody is ISOTYPE there is no TNF-α so that level of TNFa_APC_Lin signal means no TNF-α. In other words, that signal is a control. If you look at the top row of scatter plots, that level seems to be about 3.8. In the rest of the data set we should assume any signal below 3.8 means there is no TNF-α.

We can use this to set a threshold for the TNFa_APC_Lin signal and then label cells as either positive or negative for TNF-α.

🎬 Define the threshold for the TNFa_APC_Lin signal:

apc_cut <- 3.8

🎬 Plot the TNFa_APC_Lin and E_coli_FITC_Lin with the threshold for the TNFa_APC_Lin signal:

ggplot(clean_trans_live, aes(x = E_coli_FITC_Lin, 
                                  y = TNFa_APC_Lin)) +
  geom_hex(bins = 128) +
  geom_hline(yintercept = apc_cut, 
             color = "red") +
  scale_fill_viridis_c() +
  facet_grid(antibody ~ treatment) +
  theme_bw()

That looks about right. We will use this value to label each cell as positive or negative for TNF-α.

🎬 Add a label, tnfa, to the data to indicate if the cell is positive or negative for TNF-α

clean_trans_live <- clean_trans_live |> 
  mutate(tnfa = case_when(TNFa_APC_Lin < apc_cut ~ "TNF-α -'ve",
                               TNFa_APC_Lin >= apc_cut ~ "TNF-α +'ve"))

🎬 Summarise each group by finding the number of TNF-α positive cells and the mean TNFa_APC_Lin signal for each group:

clean_trans_live_tfna_pos <- clean_trans_live |> 
  filter(tnfa == "TNF-α +'ve") |>
  group_by(antibody, treatment) |>
  summarise(n_pos_tnfa = n(),
            mean_apc = mean(TNFa_APC_Lin))
antibody treatment n_pos_tnfa mean_apc
ISOTYPE MEDIA 38 3.950328
ISOTYPE LPS 41 3.968852
ISOTYPE ECOLIGreen 278 3.867921
TNFAPC MEDIA 20066 4.121845
TNFAPC LPS 30481 5.075325
TNFAPC ECOLIGreen 30686 5.201248

🎬 We can add the number of live cells to this data frame and calculate the percentage of live cells that are positive for TNF-α:

clean_trans_live_tfna_pos <- 
  clean_trans_live_tfna_pos |> 
  left_join(clean_trans_live_n, by = c("antibody", "treatment")) |> 
  mutate(perc_pos_tnfa = round(n_pos_tnfa/n_live * 100, 1) )

🎬 And put the columns into a more logical order:

clean_trans_live_tfna_pos <- 
  clean_trans_live_tfna_pos |> 
  select(antibody,
         treatment,
         n,
         n_live,
         perc_live,
         mean_apc,
         n_pos_tnfa,
         perc_pos_tnfa)
antibody treatment n n_live perc_live mean_apc n_pos_tnfa perc_pos_tnfa
ISOTYPE MEDIA 20154 17856 88.6 3.950328 38 0.2
ISOTYPE LPS 31789 30675 96.5 3.968852 41 0.1
ISOTYPE ECOLIGreen 44095 41927 95.1 3.867921 278 0.7
TNFAPC MEDIA 31089 28085 90.3 4.121845 20066 71.4
TNFAPC LPS 31945 30536 95.6 5.075325 30481 99.8
TNFAPC ECOLIGreen 32591 30715 94.2 5.201248 30686 99.9

The fact that a very low percentage of cells are positive for TNF-α in the isotype control is a good indication that cut off we used for the TNFa_APC_Lin signal is a good one.

TNFAPC-MEDIA combination tells us how much TNF-α we expect in unstimulated cells.

This is important data for your write up and will be your contribution to the calss data set so you should write it to file.

🎬 Write clean_trans_live_tfna_pos to file:

write_csv(clean_trans_live_tfna_pos, 
          "data-processed/clean_trans_live_tfna_pos.csv")

The E_coli_FITC_Lin signal

We can apply the same logic to the E_coli_FITC_Lin signal.

For either antibody, when the treatment is media or LPS there is no E.coli so that level of E_coli_FITC_Lin signal means zero. That level looks to be about 2.

🎬 Define the threshold for the E_coli_FITC_Lin signal:

fitc_cut <- 2

🎬 Add a label, fitc, to the data to indicate if the cell is positive or negative for FITC

clean_trans_live <- clean_trans_live |> 
  mutate(fitc = case_when(E_coli_FITC_Lin < fitc_cut ~ "FITC -'ve",
                               E_coli_FITC_Lin >= fitc_cut ~ "FITC +'ve"))
Tip for your own summaries

You may wish to repeat the process we used for the TNFa_APC_Lin signal to find out the number of cells that are FITC positive in each sample and the mean intensity of the FITC signal in the FITC positive cells.

A figure to report baselines for both signals

🎬 Plot the TNFa_APC_Lin and E_coli_FITC_Lin with the thresholds for both signals:

ggplot(clean_trans_live, aes(x = E_coli_FITC_Lin, 
                                  y = TNFa_APC_Lin)) +
  geom_hex(bins = 128) +
  geom_hline(yintercept = apc_cut, 
             color = "red") +
    geom_vline(xintercept = fitc_cut, 
             color = "red") +
  scale_fill_viridis_c() +
  facet_grid(antibody ~ treatment) +
  theme_bw()

Save the data

We have filtered out the dead cells and debris and added labels to the data to indicate if the cells are positive or negative for TNF-α and E. coli. It would be a good idea to save this data to file so we can start from importing this data in the future rather than having to repeat all of the steps we have done so far.

🎬 Save the data:

write_csv(clean_trans_live, "data-processed/live_labelled.csv")

Look after future you!

🎬 This workshop is a template for the analysis of your own data. Go back through your script and check you understand what you have done. Can you identify where you will need to edit? Do you understand what you will need to assess in order to make edits needed for your own data? Examine all of the objects in the environment to check you understand what they are and why they are there.

You’re finished!

🥳 Well Done! 🎉

Independent study following the workshop

Consolidate

The Code file

This contains all the code needed in the workshop even where it is not visible on the webpage.

The workshop.qmd file is the file I use to compile the practical. Qmd stands for Quarto markdown. It allows code and ordinary text to be interweaved to produce well-formatted reports including webpages. View the Qmd in Browser. Coding and thinking answers are marked with #---CODING ANSWER--- and #---THINKING ANSWER---

Pages made with R (R Core Team 2024), Quarto (Allaire et al. 2022), knitr (Xie 2024, 2015, 2014), kableExtra (Zhu 2024)

References

Allaire, J. J., Charles Teague, Carlos Scheidegger, Yihui Xie, and Christophe Dervieux. 2022. Quarto. https://doi.org/10.5281/zenodo.5960048.
Ellis, B, Perry Haaland, Florian Hahne, Nolwenn Le Meur, Nishant Gopalakrishnan, Josef Spidlen, Mike Jiang, and Greg Finak. 2024. “flowCore: flowCore: Basic Structures for Flow Cytometry Data.”
Monaco, Gianni, Hao Chen, Michael Poidinger, Jinmiao Chen, Joao Pedro de Magalhaes, and Anis Larbi. 2016. “flowAI: Automatic and Interactive Anomaly Discerning Tools for Flow Cytometry Data” 32. 10.1093/bioinformatics/btw191.
Parks, David R., Mario Roederer, and Wayne A. Moore. 2006. “A New Logicle Display Method Avoids Deceptive Effects of Logarithmic Scaling for Low Signals and Compensated Data.” Cytometry Part A 69A (6): 541–51. https://doi.org/10.1002/cyto.a.20258.
R Core Team. 2024. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.
Wickham, Hadley, Mara Averick, Jennifer Bryan, Winston Chang, Lucy D’Agostino McGowan, Romain François, Garrett Grolemund, et al. 2019. “Welcome to the Tidyverse 4: 1686. https://doi.org/10.21105/joss.01686.
Xie, Yihui. 2014. “Knitr: A Comprehensive Tool for Reproducible Research in R.” In Implementing Reproducible Computational Research, edited by Victoria Stodden, Friedrich Leisch, and Roger D. Peng. Chapman; Hall/CRC.
———. 2015. Dynamic Documents with R and Knitr. 2nd ed. Boca Raton, Florida: Chapman; Hall/CRC. https://yihui.org/knitr/.
———. 2024. Knitr: A General-Purpose Package for Dynamic Report Generation in r. https://yihui.org/knitr/.
Zhu, Hao. 2024. kableExtra: Construct Complex Table with ’Kable’ and Pipe Syntax. https://CRAN.R-project.org/package=kableExtra.