Workshop

Transcripttranscriptomics 2: Statistical Analysis

Author

Emma Rand

Published

17 December, 2024

Introduction

Session overview

In the workshop, you will learn how to perform differential expression analysis on raw counts using DESeq2 (Love, Huber, and Anders 2014) or on logged normalised expression values using scran (Lun, McCarthy, and Marioni 2016) or both.

Set up

🐸 Frog development

🎬 Open the frogs-88H RStudio Project and the cont-fgf-s30.R script.

πŸŽ„ Arabidopisis

🎬 Open the arab-88H RStudio Project and the suff-def-wild.R script.

πŸ’‰ Leishmania

🎬 Open the leish-88H RStudio Project and the pro-meta.R script.

🐭 Stem cells

🎬 Open the mice-88H RStudio Project and the hspc-prog.R script.

Everyone

🎬 Make a new folder results in the project directory.

This is where we will save our results.

🎬 Load tidyverse (Wickham et al. 2019) You most likely have this code at the top of `your script already.

── Attaching core tidyverse packages ─────────────────────────────────────────────── tidyverse 2.0.0 ──
βœ” dplyr     1.1.3     βœ” readr     2.1.4
βœ” forcats   1.0.0     βœ” stringr   1.5.0
βœ” ggplot2   3.4.3     βœ” tibble    3.2.1
βœ” lubridate 1.9.3     βœ” tidyr     1.3.0
βœ” purrr     1.0.2     
── Conflicts ───────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
βœ– dplyr::filter() masks stats::filter()
βœ– dplyr::lag()    masks stats::lag()
β„Ή Use the conflicted package to force all conflicts to become errors

Have you ever stopped to think about this message? It is telling us that there are functions in the dplyr package that have the same name as functions in the stats package and that R will use the dplyr version. As this is what you want, this has always been fine. It still is fine in this case. However, as you start to load more packages, you will want to know if you are using a function from a package that has the same name as a function in another loaded package. This is where the conflicted (Wickham 2023) package comes in. Conflicted will warn you when you are using a function that has the same name as a function in another package. You can then choose which function to use.

🎬 Load the conflicted package:

Instead of getting a warning every time you are using a function that has a function with the same name in another package, we can declare a preference for one function over another. This is useful for the functions you use a lot or ones where you are certain you always want to use a particular function.

For example, to always use the dplyr version of filter() by default you can add this to the top of your script:

We will also want to ensure that we are using the setdiff() function from the GenomicRanges package which is used by DESeq2.

conflicts_prefer(GenomicRanges::setdiff)

Import

🐸 Frog development

We need to import the S30 data that were filtered to remove genes with 4, 5 or 6 zeros and those where the total counts was less than 20.

🎬 Import the data from the data-processed folder.

Now go to Differential Expression Analysis.

πŸŽ„ Arabidopisis

We need to import the wildtype data that were filtered to remove genes with 3 or 4 zeros and those where the total counts was less than 20.

🎬 Import the data from the data-processed folder.

Now go to Differential Expression Analysis.

πŸ’‰ Leishmania

We need to import the procyclic- and metacyclic-promastigote data that were filtered to remove genes with 4, 5 or 6 zeros and those where the total counts was less than 20.

🎬 Import the data from the data-processed folder.

Now go to Differential Expression Analysis.

🐭 Stem cells

We need to import the progenitor and HSPC cell data that were combined.

🎬 Import the data from the data-processed folder..

Now go to Differential Expression Analysis.

Differential Expression Analysis

🐸 Frog development

These are the steps we will take

  1. Find the genes that are expressed in only one treatment group.
  2. Create a DESeqDataSet object. This is a special object that is used by the DESeq2 package
  3. Prepare the normalised counts from the DESeqDataSet object.
  4. Do differential expression analysis on the genes. This needs to be done on the raw counts.

All but the first step are done with the DESeq2 package

1. Genes expressed in one treatment

The genes expressed in only one treatment group are those with zeros in all three replicates in one group and non-zero values in all three replicates in the other group. For example, those shown here:

xenbase_gene_id S30_C_1 S30_C_2 S30_C_3 S30_F_1 S30_F_2 S30_F_3
XB-GENE-1018260 0 0 0 10 2 16
XB-GENE-17330117 0 0 0 13 4 17
XB-GENE-17332184 0 0 0 6 19 6

We will use filter() to find these genes.

🎬 Find the genes that are expressed only in the FGF-treated group:

s30_fgf_only <- s30_filtered |> 
  filter(S30_C_1 == 0, 
         S30_C_2 == 0, 
         S30_C_3 == 0, 
         S30_F_1 > 0, 
         S30_F_2 > 0, 
         S30_F_3 > 0)

❓ How many genes are expressed only in the FGF-treated group?

🎬 Now you find any genes that are expressed only in the control group.

❓ How many genes are expressed only in the control group?

❓ Do the results make sense to you in light of what you know about the biology?

🎬 Write all the genes that are expressed one group only to file (saved in results)

2. Create DESeqDataSet object

🎬 Load the DESeq2 package:

A DEseqDataSet object is a custom data type that is used by DESeq2. Custom data types are common in the Bioconductor1 packages. They are used to store data in a way that is useful for the analysis. These data types typically have data, transformed data, metadata and experimental designs within them.

To create a DESeqDataSet object, we need to provide three things:

  • The raw counts - these are in s30_filtered
  • The meta data which gives information about the samples and which treatment groups they belong to
  • A design matrix which captures the design of the statistical model.

The counts must in a matrix rather than a dataframe. Unlike a dataframe, a matrix has columns of all the same type. That is, it will contain only the counts. The gene ids are given as row names rather than a column. The matrix() function will create a matrix from a dataframe of columns of the same type and the select() function can be used to remove the gene ids column.

🎬 Create a matrix of the counts:

s30_count_mat <- s30_filtered |>
  select(-xenbase_gene_id) |>
  as.matrix()

🎬 Add the gene ids as row names to the matrix:

# add the row names to the matrix
rownames(s30_count_mat) <- s30_filtered$xenbase_gene_id

You might want to view the matrix (click on it in your environment pane).

The metadata are in a file, frog_meta_data.txt. This is a tab-delimited file. The first column is the sample name and the other columns give the β€œtreatments”. In this case, the treatments stage (with three levels) and treatment (with two levels).

🎬 Make a folder called meta and save the file to it.

🎬 Read the metadata into a dataframe:

meta <- read_table("meta/frog_meta_data.txt")

🎬 Examine the resulting dataframe.

We need to add the sample names as row names to the metadata dataframe. This is because the DESeqDataSet object will use the row names to match the samples in the metadata to the samples in the counts matrix.

🎬 Add the sample names as row names to the metadata dataframe:

meta <- meta |>
  column_to_rownames("sample_id")

We are dealing only with the S30 data so we need to remove the samples that are not in the S30 data.

🎬 Filter the metadata to keep only the S30 information:

meta_s30 <- meta |>
  filter(stage == "stage_30")

We can now create the DESeqDataSet object. The design formula describes the statistical model. You should recognise the form from previous work. The ~ can be read as β€œexplain by” and on its right hand side are the explanatory variables. That is, the model is counts explained by treatment and sibling_rep. We are interested in the difference between the treatments but we include sibling_rep to account for the fact that the data are paired. The condition of interest should go at the end of the design formula.

Note that:

  • The names of the columns in the count matrix have to exactly match the names of the rows in the metadata dataframe. They also need to be in the same order.
  • The names of the explanatory variables in the design formula have to match the names of columns in the metadata.

🎬 Create the DESeqDataSet object:

dds <- DESeqDataSetFromMatrix(countData = s30_count_mat,
                              colData = meta_s30,
                              design = ~ sibling_rep + treatment)

The warning β€œWarning: some variables in design formula are characters, converting to factors” just means that the variable type of treatment and sibling_rep in the metadata dataframe are β€œchar” and they have been converted into the factors.

To help you understand what the DESeqDataSet object we have called dds contains, we can look its contents

The counts are in dds@assays@data@listData[["counts"]] and the metadata are in dds@colData but the easiest way to see them is to use the counts() and colData() functions from the DESeq2 package.

🎬 View the counts:

counts(dds) |> View()

You should be able to see that this is the same as in s30_count_mat.

🎬 View the column information:

colData(dds)
DataFrame with 6 rows and 3 columns
              stage treatment sibling_rep
        <character>  <factor>    <factor>
S30_C_1    stage_30   control       one  
S30_C_2    stage_30   control       two  
S30_C_3    stage_30   control       three
S30_F_1    stage_30   FGF           one  
S30_F_2    stage_30   FGF           two  
S30_F_3    stage_30   FGF           three

You should be able to see this is the same as in meta_s30.

3. Prepare the normalised counts

The normalised counts are the counts that have been transformed to account for the library size (i.e., the total number of reads in a sample) and the gene length. We have to first estimate the normalisation factors and store them in the DESeqDataSet object and then we can get the normalised counts.

🎬 Estimate the factors for normalisation and store them in the DESeqDataSet object:

dds <- estimateSizeFactors(dds)

🎬 Look at the factors (just for information):

sizeFactors(dds)
  S30_C_1   S30_C_2   S30_C_3   S30_F_1   S30_F_2   S30_F_3 
0.8812200 0.9454600 1.2989886 1.0881870 1.0518961 0.8322894 

The normalised counts will be useful to use later. To get the normalised counts we again used the counts() function but this time we use the normalized=TRUE argument.

🎬 Save the normalised to a matrix:

normalised_counts <- counts(dds, normalized = TRUE)

🎬 Make a dataframe of the normalised counts, adding a column for the gene ids at the same time:

s30_normalised_counts <- data.frame(normalised_counts,
                                    xenbase_gene_id = row.names(normalised_counts))

4. Differential expression analysis

We use the DESeq() function to do the differential expression analysis. This function fits the statistical model to the data and then uses the model to calculate the significance of the difference between the treatments. It again stores the results in the DESseqDataSet object. Note that the differential expression needs the raw (unnormalised counts) as it does its own normalisation as part of the process.

🎬 Run the differential expression analysis and store the results in the same object:

dds <- DESeq(dds)

The function will take only a few moments to run on this data but can take longer for bigger datasets.

We need to define the contrasts we want to test. We want to test the difference between the treatments so we will define the contrast as FGF and control.

🎬 Define the contrast:

contrast_fgf <- c("treatment", "FGF", "control")

Note that treatment is the name of the column in the metadata dataframe and FGF and control are the names of the levels in the treatment column. By putting them in the order FGF , control we are saying the fold change will be FGF / control. This means:

  • positive log fold changes indicate FGF > control and
  • negative log fold changes indicates control > FGF.

If we had put them in the order control, FGF we would have the reverse.

🎬 Extract the results from the DESseqDataSet object:

results_fgf <- results(dds,
                       contrast = contrast_fgf)

This will give us the log2 fold change, the p-value and the adjusted p-value for the comparison between the control and the FGF-treatment for each gene.

🎬 Put the results in a dataframe and add the gene ids as a column:

s30_results <- data.frame(results_fgf,
                          xenbase_gene_id = row.names(results_fgf))

It is useful to have the normalised counts and the statistical results in one dataframe.

🎬 Merge the two dataframes:

# merge the results with the normalised counts
s30_results <- s30_normalised_counts |>
  left_join(s30_results, by = "xenbase_gene_id")

Now go to Add gene information.

πŸŽ„ Arabidopisis

These are the steps we will take

  1. Find the genes that are expressed in only one treatment group.
  2. Create a DESeqDataSet object. This is a special object that is used by the DESeq2 package
  3. Prepare the normalised counts from the DESeqDataSet object.
  4. Do differential expression analysis on the genes. This needs to be done on the raw counts.

All but the first step are done with the DESeq2 package

1. Genes expressed in one treatment

The genes expressed in only one treatment group are those with zeros in both replicates in one group and non-zero values in both replicates in the other group. For example, those shown here:

gene_id SRX028956_wild_suf SRX028957_wild_def SRX028960_wild_suf SRX028961_wild_def
AT1G04513 11 0 25 0
AT1G22610 36 0 52 0
AT1G26290 12 0 23 0
AT1G59810 5 0 16 0
AT2G44130 28 0 18 0

We will use filter() to find these genes.

🎬 Find the genes that are expressed only in the sufficient copper group:

wild_suf_only <- wild_filtered |>
  filter(SRX028961_wild_def == 0,
         SRX028957_wild_def == 0,
         SRX028960_wild_suf > 0,
         SRX028956_wild_suf > 0)

❓ How many genes are expressed only in the sufficient copper group?

🎬 Now you find any genes that are expressed only in the deficient copper group.

❓ How many genes are expressed only in the deficient copper group?

❓ Do the results make sense to you in light of what you know about the biology?

🎬 Write all the genes that are expressed one group only to file (saved in results)

2. Create DESeqDataSet object

🎬 Load the DESeq2 package:

A DEseqDataSet object is a custom data type that is used by DESeq2. Custom data types are common in the Bioconductor2 packages. They are used to store data in a way that is useful for the analysis. These data types typically have data, transformed data, metadata and experimental designs within them.

To create a DESeqDataSet object, we need to provide three things:

  • The raw counts - these are in wild_filtered
  • The meta data which gives information about the samples and which treatment groups they belong to
  • A design matrix which captures the design of the statistical model.

The counts must in a matrix rather than a dataframe. Unlike a dataframe, a matrix has columns of all the same type. That is, it will contain only the counts. The gene ids are given as row names rather than a column. The matrix() function will create a matrix from a dataframe of columns of the same type and the select() function can be used to remove the gene ids column.

🎬 Create a matrix of the counts:

wild_count_mat <- wild_filtered |>
  select(-gene_id) |>
  as.matrix()

🎬 Add the gene ids as row names to the matrix:

# add the row names to the matrix
rownames(wild_count_mat) <- wild_filtered$gene_id

You might want to view the matrix (click on it in your environment pane).

The metadata are in a file, arab_meta_data.txt. This is a tab-delimited file. The first column is the sample name and the other columns give the β€œtreatments”. In this case, the treatments genotype (with two levels) and copper (with two levels).

🎬 Make a folder called meta and save the file to it.

🎬 Read the metadata into a dataframe:

meta <- read_table("meta/arab_meta_data.txt")

🎬 Examine the resulting dataframe.

We need to add the sample names as row names to the metadata dataframe. This is because the DESeqDataSet object will use the row names to match the samples in the metadata to the samples in the counts matrix.

🎬 Add the sample names as row names to the metadata dataframe:

meta <- meta |>
  column_to_rownames("sample_id")

We are dealing only with the wild data so we need to remove the samples that are not in the wild data.

🎬 Filter the metadata to keep only the wild information:

meta_wild <- meta |>
  filter(genotype == "wt")

We can now create the DESeqDataSet object. The design formula describes the statistical model. You should recognise the form from previous work. The ~ can be read as β€œexplain by” and on its right hand side are the explanatory variables. That is, the model is counts explained by copper status.

Note that:

  • The names of the columns in the count matrix have to exactly match the names of the rows in the metadata dataframe. They also need to be in the same order.
  • The names of the explanatory variables in the design formula have to match the names of columns in the metadata.

🎬 Create the DESeqDataSet object:

dds <- DESeqDataSetFromMatrix(wild_count_mat,
                              colData = meta_wild,
                              design = ~ copper)

The warning β€œWarning: some variables in design formula are characters, converting to factors” just means that the variable type of copper in the metadata dataframe is β€œchar” and it has been converted into a factor type.

To help you understand what the DESeqDataSet object we have called dds contains, we can look its contents

The counts are in dds@assays@data@listData[["counts"]] and the metadata are in dds@colData but the easiest way to see them is to use the counts() and colData() functions from the DESeq2 package.

🎬 View the counts:

counts(dds) |> View()

You should be able to see that this is the same as in wild_count_mat.

🎬 View the column information:

colData(dds)
DataFrame with 4 rows and 2 columns
                      genotype     copper
                   <character>   <factor>
SRX028956_wild_suf          wt sufficient
SRX028957_wild_def          wt deficient 
SRX028960_wild_suf          wt sufficient
SRX028961_wild_def          wt deficient 

You should be able to see this is the same as in meta_wild.

3. Prepare the normalised counts

The normalised counts are the counts that have been transformed to account for the library size (i.e., the total number of reads in a sample) and the gene length. We have to first estimate the normalisation factors and store them in the DESeqDataSet object and then we can get the normalised counts.

🎬 Estimate the factors for normalisation and store them in the DESeqDataSet object:

dds <- estimateSizeFactors(dds)

🎬 Look at the factors (just for information):

sizeFactors(dds)
SRX028956_wild_suf SRX028957_wild_def SRX028960_wild_suf SRX028961_wild_def 
         0.8200020          0.4653024          2.3002428          1.1965924 

The normalised counts will be useful to use later. To get the normalised counts we again used the counts() function but this time we use the normalized=TRUE argument.

🎬 Save the normalised to a matrix:

normalised_counts <- counts(dds, normalized = TRUE)

🎬 Make a dataframe of the normalised counts, adding a column for the gene ids at the same time:

wild_normalised_counts <- data.frame(normalised_counts,
                                    gene_id = row.names(normalised_counts))

4. Differential expression analysis

We use the DESeq() function to do the differential expression analysis. This function fits the statistical model to the data and then uses the model to calculate the significance of the difference between the treatments. It again stores the results in the DESseqDataSet object. Note that the differential expression needs the raw (unnormalised counts) as it does its own normalisation as part of the process.

🎬 Run the differential expression analysis and store the results in the same object:

dds <- DESeq(dds)

The function will take only a few moments to run on this data but can take longer for bigger datasets.

We need to define the contrasts we want to test. We want to test the difference between the treatments so we will define the contrast as sufficient and deficient.

🎬 Define the contrast:

contrast_suf <- c("copper", "sufficient", "deficient")

Note that copper is the name of the column in the metadata dataframe and sufficient and deficient are the names of the levels in the copper column. By putting them in the order sufficient , deficient we are saying the fold change will be sufficient / deficient. This means:

  • positive log fold changes indicate sufficient > deficient and
  • negative log fold changes indicates deficient > sufficient.

If we had put them in the order deficient, sufficient we would have the reverse.

🎬 Extract the results from the DESseqDataSet object:

results_suf <- results(dds,
                       contrast = contrast_suf)

This will give us the log2 fold change, the p-value and the adjusted p-value for the comparison between the sufficient- and
deficient-copper for each gene.

🎬 Put the results in a dataframe and add the gene ids as a column:

wild_results <- data.frame(results_suf,
                          gene_id = row.names(results_suf))

It is useful to have the normalised counts and the statistical results in one dataframe.

🎬 Merge the two dataframes:

# merge the results with the normalised counts
wild_results <- wild_normalised_counts |>
  left_join(wild_results, by = "gene_id")

Now go to Add gene information.

πŸ’‰ Leishmania

These are the steps we will take

  1. Find the genes that are expressed in only one treatment group.
  2. Create a DESeqDataSet object. This is a special object that is used by the DESeq2 package
  3. Prepare the normalised counts from the DESeqDataSet object.
  4. Do differential expression analysis on the genes. This needs to be done on the raw counts.

All but the first step are done with the DESeq2 package

1. Genes expressed in one treatment

The genes expressed in only one treatment group are those with zeros in all replicates in one group and non-zero values in all replicates in the other group.

We will use filter() to find these genes.

🎬 Find the genes that are expressed only at the procyclic-promastigote stage:

pro_meta_pro_only <- pro_meta_filtered  |>
  filter(lm_pro_1 > 0,
         lm_pro_2 > 0,
         lm_pro_3 > 0,
         lm_meta_1 == 0,
         lm_meta_2 == 0,
         lm_meta_2 == 0)

❓ How many genes are expressed only in the procyclic-promastigote stage group?

🎬 Now you find any genes that are expressed only at the metacyclic stage

❓ How many genes are expressed only at the metacyclic stage?

❓ Do the results make sense to you in light of what you know about the biology?

🎬 Write all the genes that are expressed one group only to file (saved in results)

2. Create DESeqDataSet object

🎬 Load the DESeq2 package:

A DEseqDataSet object is a custom data type that is used by DESeq2. Custom data types are common in the Bioconductor3 packages. They are used to store data in a way that is useful for the analysis. These data types typically have data, transformed data, metadata and experimental designs within them.

To create a DESeqDataSet object, we need to provide three things:

  • The raw counts - these are in pro_meta_filtered
  • The meta data which gives information about the samples and which treatment groups they belong to
  • A design matrix which captures the design of the statistical model.

The counts must in a matrix rather than a dataframe. Unlike a dataframe, a matrix has columns of all the same type. That is, it will contain only the counts. The gene ids are given as row names rather than a column. The matrix() function will create a matrix from a dataframe of columns of the same type and the select() function can be used to remove the gene ids column.

🎬 Create a matrix of the counts:

pro_meta_count_mat <- pro_meta_filtered  |>
  select(-gene_id) |>
  as.matrix()

🎬 Add the gene ids as row names to the matrix:

# add the row names to the matrix
rownames(pro_meta_count_mat) <- pro_meta_filtered$gene_id

You might want to view the matrix (click on it in your environment pane).

The metadata are in a file, leish_meta_data.txt. This is a tab-delimited file. The first column is the sample name and the other columns give the β€œtreatments”. In this case, the treatment is stage (with three levels).

🎬 Make a folder called meta and save the file to it.

🎬 Read the metadata into a dataframe:

meta <- read_table("meta/leish_meta_data.txt")

🎬 Examine the resulting dataframe.

We need to add the sample names as row names to the metadata dataframe. This is because the DESeqDataSet object will use the row names to match the samples in the metadata to the samples in the counts matrix.

🎬 Add the sample names as row names to the metadata dataframe:

meta <- meta |>
  column_to_rownames("sample_id")

We are dealing only with the wild data so we need to remove the samples that are not in the wild data.

🎬 Filter the metadata to keep only the procyclic and metacyclic information:

meta_pro_meta <- meta |>
  filter(stage != "amastigotes")

We can now create the DESeqDataSet object. The design formula describes the statistical model. You should recognise the form from previous work. The ~ can be read as β€œexplain by” and on its right hand side are the explanatory variables. That is, the model is counts explained by stage status.

Note that:

  • The names of the columns in the count matrix have to exactly match the names of the rows in the metadata dataframe. They also need to be in the same order.
  • The names of the explanatory variables in the design formula have to match the names of columns in the metadata.

🎬 Create the DESeqDataSet object:

dds <- DESeqDataSetFromMatrix(pro_meta_count_mat,
                              colData = meta_pro_meta,
                              design = ~ stage)

The warning β€œWarning: some variables in design formula are characters, converting to factors” just means that the variable type of stage in the metadata dataframe is β€œchar” and it has been converted into a factor type.

To help you understand what the DESeqDataSet object we have called dds contains, we can look its contents

The counts are in dds@assays@data@listData[["counts"]] and the metadata are in dds@colData but the easiest way to see them is to use the counts() and colData() functions from the DESeq2 package.

🎬 View the counts:

counts(dds) |> View()

You should be able to see that this is the same as in pro_meta_count_mat.

🎬 View the column information:

colData(dds)
DataFrame with 6 rows and 2 columns
               stage replicate
            <factor> <numeric>
lm_pro_1  procyclic          1
lm_pro_2  procyclic          2
lm_pro_3  procyclic          3
lm_meta_1 metacyclic         1
lm_meta_2 metacyclic         2
lm_meta_3 metacyclic         3

You should be able to see this is the same as in meta_pro_meta.

3. Prepare the normalised counts

The normalised counts are the counts that have been transformed to account for the library size (i.e., the total number of reads in a sample) and the gene length. We have to first estimate the normalisation factors and store them in the DESeqDataSet object and then we can get the normalised counts.

🎬 Estimate the factors for normalisation and store them in the DESeqDataSet object:

dds <- estimateSizeFactors(dds)

🎬 Look at the factors (just for information):

sizeFactors(dds)
 lm_pro_1  lm_pro_2  lm_pro_3 lm_meta_1 lm_meta_2 lm_meta_3 
1.3029351 0.9158157 0.9943186 0.7849299 0.8443586 1.3250409 

The normalised counts will be useful to use later. To get the normalised counts we again used the counts() function but this time we use the normalized=TRUE argument.

🎬 Save the normalised to a matrix:

normalised_counts <- counts(dds, normalized = TRUE)

🎬 Make a dataframe of the normalised counts, adding a column for the gene ids at the same time:

pro_meta_normalised_counts <- data.frame(normalised_counts,
                                    gene_id = row.names(normalised_counts))

4. Differential expression analysis

We use the DESeq() function to do the differential expression analysis. This function fits the statistical model to the data and then uses the model to calculate the significance of the difference between the treatments. It again stores the results in the DESseqDataSet object. Note that the differential expression needs the raw (unnormalised counts) as it does its own normalisation as part of the process.

🎬 Run the differential expression analysis and store the results in the same object:

dds <- DESeq(dds)

The function will take only a few moments to run on this data but can take longer for bigger datasets.

We need to define the contrasts we want to test. We want to test the difference between the treatments so we will define the contrast as procyclic and metacyclic.

🎬 Define the contrast:

contrast_pro_meta <- c("stage", "procyclic", "metacyclic")

Note that stage is the name of the column in the metadata dataframe and procyclic and metacyclic are the names of the levels in the stage column. By putting them in the order procyclic , metacyclic we are saying the fold change will be procyclic / metacyclic. This means:

  • positive log fold changes indicate procyclic > metacyclic and
  • negative log fold changes indicates metacyclic > procyclic.

If we had put them in the order metacyclic, procyclic we would have the reverse.

🎬 Extract the results from the DESseqDataSet object:

results_pro_meta <- results(dds,
                       contrast = contrast_pro_meta)

This will give us the log2 fold change, the p-value and the adjusted p-value for the comparison between procyclic and metacyclic stage for each gene

🎬 Put the results in a dataframe and add the gene ids as a column:

pro_meta_results <- data.frame(results_pro_meta,
                          gene_id = row.names(results_pro_meta))

It is useful to have the normalised counts and the statistical results in one dataframe.

🎬 Merge the two dataframes:

# merge the results with the normalised counts
pro_meta_results <- pro_meta_normalised_counts |>
  left_join(pro_meta_results, by = "gene_id")

Now go to Add gene information.

🐭 Stem cells

These are the steps we will take

  1. Find the genes that are expressed in only one cell type (the prog or the hspc).
  2. Prepare the data for differential expression analysis with the scran package.
  3. Do differential expression analysis on the genes using the scran package. This needs to be done on the logged normalised counts.

1. Genes expressed in one cell type

The genes expressed in only cell type are those with zeros in all the cells of the other type. We can find these by summing the expression values for each gene across the cells of one type and filtering for those that are zero.

To do row wise aggregates such as the sum across rows we can use the rowwise() function. c_across() allows us to use the colon notation to select columns. This is very useful when you have a lot of columns because it would be annoying to have to list all of them. HSPC_001:HSPC_852 means all the columns from HSPC_001 to HSPC_852.

🎬 Find the genes expressed only in progenitor cells (i.e., those that are 0 in every HSPC cell):

hspc_prog |> 
  rowwise() |> 
  filter(sum(c_across(HSPC_001:HSPC_852)) == 0)
# A tibble: 0 Γ— 1,500
# Rowwise: 
# β„Ή 1,500 variables: ensembl_gene_id <chr>, HSPC_001 <dbl>, HSPC_002 <dbl>,
#   HSPC_003 <dbl>, HSPC_004 <dbl>, HSPC_006 <dbl>, HSPC_008 <dbl>,
#   HSPC_009 <dbl>, HSPC_011 <dbl>, HSPC_012 <dbl>, HSPC_014 <dbl>,
#   HSPC_015 <dbl>, HSPC_016 <dbl>, HSPC_017 <dbl>, HSPC_018 <dbl>,
#   HSPC_020 <dbl>, HSPC_021 <dbl>, HSPC_022 <dbl>, HSPC_023 <dbl>,
#   HSPC_024 <dbl>, HSPC_025 <dbl>, HSPC_026 <dbl>, HSPC_027 <dbl>,
#   HSPC_028 <dbl>, HSPC_030 <dbl>, HSPC_031 <dbl>, HSPC_033 <dbl>, …

We already know from last week’s work that there are no genes that are zero across all the cells (both types). If we did not know that, we would need to add |> filter(sum(c_across(Prog_001:Prog_852)) != 0)

meaning zero in all the HSPC but not zero in all the Prog

❓ How many genes are expressed only in the progenitor cells only

🎬 Now you find any genes that are expressed only in the HSPC cells.

❓ How many genes are expressed only in the HSPC cells?

❓ Do the results make sense to you in light of what you know about the biology?

🎬 Write all the genes that are expressed one cell type only to file (saved in results)

2. Prepare the data for analysis with scran

scran can use a matrix or a dataframe of counts but these must be log normalised counts. If using a dataframe, the columns must only contain the expression values (not the gene ids). The rows can be named to retain the gene ids.

hspc_prog is a dataframe so we will use the ensembl gene ids to name the rows and remove the gene ids from the dataframe.

🎬 Add the gene ids as the row names:

hspc_prog <- hspc_prog |>
  column_to_rownames("ensembl_gene_id")

Like DESeq2, scran needs metadata to define which columns were in which group. Instead of having this is a file, we will create a vector that indicates which column belongs to which cell type.

🎬 Create a vector that indicates which column belongs to which cell type:

n_hspc <- 701
n_prog <- 798

cell_type <- rep(c("hspc","prog"), 
                 times = c(n_hspc, n_prog))

The number of times each cell type is repeated is the number of cells of that type. Do check that the length of the cell_type vector is the same as the number of columns in the hspc_prog dataframe.

3. Differential expression analysis

🎬 Load the scran package:

Differential expression is carried out with the findMarkers() function. It takes two arguments. The first argument is the dataframe containing the data and the second argument is the vector indicating which columns are in which cell type. Make sure that the order of the cell types in the vector is appropriate for the order of the columns in the dataframe.

🎬 Run the differential expression analysis:

results_hspc_prog <- findMarkers(hspc_prog, 
                             cell_type)

The output is a list object which, rather unnecessarily, includes two dataframes. This is not really necessary as the results are the same except for the fold change having a different sign.

  • The dataframe results_hspc_prog$prog is log prog - log hspc (i.e.,Prog/HSPC). This means:

    • Positive fold change: prog is higher than hspc
    • Negative fold change: hspc is higher than prog
The results_hspc_prog$prog dataframe
Top p.value FDR summary.logFC logFC.hspc ensembl_gene_id
ENSMUSG00000028639 1 0 0 1.596910 1.596910 ENSMUSG00000028639
ENSMUSG00000024053 2 0 0 3.035165 3.035165 ENSMUSG00000024053
ENSMUSG00000041329 3 0 0 3.261056 3.261056 ENSMUSG00000041329
ENSMUSG00000030336 4 0 0 -2.146491 -2.146491 ENSMUSG00000030336
ENSMUSG00000016494 5 0 0 -3.056730 -3.056730 ENSMUSG00000016494
ENSMUSG00000002808 6 0 0 3.000810 3.000810 ENSMUSG00000002808
  • The dataframe results_hspc_prog$hspc is log hspc - log prog (i.e., HSPC/Prog). This means:

    • Positive fold change: hspc is higher than prog
    • Negative fold change: prog is higher than hspc
The results_hspc_prog$hspc dataframe. Notice the sign of the fold change is the other way
Top p.value FDR summary.logFC logFC.prog ensembl_gene_id
ENSMUSG00000028639 1 0 0 -1.596910 -1.596910 ENSMUSG00000028639
ENSMUSG00000024053 2 0 0 -3.035165 -3.035165 ENSMUSG00000024053
ENSMUSG00000041329 3 0 0 -3.261056 -3.261056 ENSMUSG00000041329
ENSMUSG00000030336 4 0 0 2.146491 2.146491 ENSMUSG00000030336
ENSMUSG00000016494 5 0 0 3.056730 3.056730 ENSMUSG00000016494
ENSMUSG00000002808 6 0 0 -3.000810 -3.000810 ENSMUSG00000002808

We only need one of these results dataframes and we will use the prog one. It does not matter which one you use but you do need keep the direction of the foldchange in mind when interpreting the results.

Notice that the dataframes are ordered by significance rather than the original gene order.

It is useful to have the normalised counts and the statistical results in one dataframe to which we will add the gene information from Ensembl. Having all the information together will make it easier to interpret the results and select genes of interest. We will need to extract the results from the list object, add the gene ids and then join with the normalised counts.

🎬 Extract the results dataframe from the list object and add the gene ids as a column:

hspc_prog_results <- data.frame(results_hspc_prog$prog, 
                                ensembl_gene_id = row.names(results_hspc_prog$prog)) 

🎬 Return the ensembl gene ids as a column to the normalised counts:

hspc_prog <- hspc_prog |>
  rownames_to_column(var = "ensembl_gene_id")

🎬 Merge the results dataframe with the normalised counts:

# merge the results with the normalised counts
hspc_prog_results <- hspc_prog_results |>
  left_join(hspc_prog, by = "ensembl_gene_id")

Now go to Add gene information.

Add gene information

🐸 Frog development

If you want to emulate what I did you can use the following commands in the terminal after downloading the file:

gunzip xenbase.gpi.gz
less xenbase.gpi
q

gunzip unzips the file and less allows you to view the file. q quits the viewer. You will see the header lines and that the file contains both Xenopus tropicalis and Xenopus laevis. I read the file in with read_tsv (skipping the first header lines) then filtered out the Xenopus tropicalis entries, dropped some columns and saved the file as an excel file.

However, I have already done this for you and saved the file as xenbase_info.xlsx in the meta folder. We will import this file and join it to the results dataframe.

🎬 Load the readxl (Wickham and Bryan 2023) package:

🎬 Import the Xenbase gene information file:

gene_info <- read_excel("meta/xenbase_info.xlsx") 

You should view the resulting dataframe to see what information is available. You can use glimpse() or View().

🎬 Merge the gene information with the results:

# join the gene info with the results
s30_results <- s30_results |>
  left_join(gene_info, by = "xenbase_gene_id")

🎬 Save the results to a file:

write_csv(s30_results, file = "results/s30_results.csv")

πŸŽ„ Arabidopisis

Ensembl (Martin et al. 2023; Birney et al. 2004)is a bioinformatics project to organise all the biological information around the sequences of large genomes. The are a large number of databases and BioMart (Smedley et al. 2009) provides a consistent interface to the material. There are web-based tools to use these but the R package biomaRt (Durinck et al. 2009, 2005) gives you programmatic access making it easier to integrate information into R dataframes.

🎬 Load the biomaRt (Durinck et al. 2009, 2005) package:

The biomaRt package includes a function to list all the available datasets

🎬 List the Ensembl β€œmarts” available:

              biomart                        version
1       protists_mart      Ensembl Protists Genes 60
2 protists_variations Ensembl Protists Variations 60
3          fungi_mart         Ensembl Fungi Genes 60
4    fungi_variations    Ensembl Fungi Variations 60
5        metazoa_mart       Ensembl Metazoa Genes 60
6  metazoa_variations  Ensembl Metazoa Variations 60
7         plants_mart        Ensembl Plants Genes 60
8   plants_variations   Ensembl Plants Variations 60

plants_mart looks like the one we want. We can see what genomes are available with names like β€œArabidopsis” in this mart using the searchDatasets() function.

🎬

searchDatasets(useEnsemblGenomes(biomart = "plants_mart"), 
               pattern = "Arabidopsis")
             dataset                         description version
4   ahalleri_eg_gene Arabidopsis halleri genes (Ahal2.2) Ahal2.2
6    alyrata_eg_gene    Arabidopsis lyrata genes (v.1.0)   v.1.0
11 athaliana_eg_gene Arabidopsis thaliana genes (TAIR10)  TAIR10

athaliana_eg_gene is the Arabidopsis thaliana genes (TAIR10) dataset we want.

🎬 Connect to the athaliana_eg_gene database in plants_mart:

ensembl <- useEnsemblGenomes(biomart = "plants_mart",
                             dataset = "athaliana_eg_gene")

🎬 See the the types of information we can retrieve:

listAttributes(mart = ensembl) |> View()

There are many (1,714!) possible bits of information (attributes) that can be obtained.

We use the getBM() function to retrieve information from the database. The filters argument is used to specified what kind of identifier we are supplying in values to retrieve information. The attributes argument is used to select the information we want to retrieve. The values argument is used to specify the identifiers. The mart argument is used to specify the connection we created.

🎬 Get the the gene name and a description. We also retreive the gene id so we can later join the information with the results:

gene_info <- getBM(filters = "ensembl_gene_id",
                   attributes = c("ensembl_gene_id",
                                  "external_gene_name",
                                  "description"),
                   values = wild_results$gene_id,
                   mart = ensembl)

You should view the resulting dataframe to see what information is available. You can use glimpse() or View().

🎬 Merge the gene information with the results:

# join the gene info with the results
wild_results <- wild_results |>
  left_join(gene_info,
            by = join_by(gene_id == ensembl_gene_id))

🎬 Save the results to a file:

write_csv(wild_results, file = "results/wild_results.csv")

πŸ’‰ Leishmania

We will import this file and join it to the results dataframe.

🎬 Load the readxl (Wickham and Bryan 2023) package:

🎬 Import the Xenbase gene information file:

gene_info <- read_excel("meta/leishmania_mex.xlsx") 

You should view the resulting dataframe to see what information is available. You can use glimpse() or View().

🎬 Merge the gene information with the results:

# join the gene info with the results
pro_meta_results <- pro_meta_results |>
  left_join(gene_info, by = "gene_id")

🎬 Save the results to a file:

write_csv(pro_meta_results, file = "results/pro_meta_results.csv")

🐭 Stem cells

Ensembl (Martin et al. 2023; Birney et al. 2004)is a bioinformatics project to organise all the biological information around the sequences of large genomes. The are a large number of databases but BioMart (Smedley et al. 2009) provides a consistent interface to the material. There are web-based tools to use these but the R package biomaRt (Durinck et al. 2009, 2005) gives you programmatic access making it easier to integrate information into R dataframes

🎬 Load the biomaRt (Durinck et al. 2009, 2005) package:

🎬 Connect to the mouse database

# Connect to the mouse database

ensembl <- useMart(biomart = "ensembl", 
                   dataset = "mmusculus_gene_ensembl")

🎬 See information we can retrieve:

# See what information we can retrieve
listAttributes(mart = ensembl) |> View()

There are many (2,988!) possible bits of information (attributes) that can be obtained.

We use the getBM() function to retrieve information from the database. The filters argument is used to specified what kind of identifier we are supplying in values to retrieve information. The attributes argument is used to select the information we want to retrieve. The values argument is used to specify the identifiers. The mart argument is used to specify the connection we created.

🎬 Get the the gene name and a description. We also retrieve the gene id so we can later join the information with the results:

gene_info <- getBM(filters = "ensembl_gene_id",
                   attributes = c("ensembl_gene_id",
                                  "external_gene_name",
                                  "description"),
                   values = hspc_prog_results$ensembl_gene_id,
                   mart = ensembl)

You should view the resulting dataframe to see what information is available. You can use glimpse() or View(). Notice the dataframe returned only has 279 rows - one of the ids does not have information.

🎬 Merge the gene information with the results:

# join the gene info with the results
hspc_prog_results <- hspc_prog_results |> 
  left_join(gene_info, by = "ensembl_gene_id")

🎬 Save the results to a file:

write_csv(hspc_prog_results, file = "results/hspc_prog_results.csv")

πŸ€— Look after future you!

🎬 Go through you script and tidy up. You might need to :

  • collect together library statements at the top of the script
  • remove code that you needed to start today but which wouldn’t be needed if you were running the script from the top (e.g., importing)
  • edit your comments for clarity
  • rename variables for consistency or clarity
  • remove house keeping or exploratory code
  • restyle code, add code section headers etc

πŸ₯³ Finished

Well Done!

Independent study following the workshop

Consolidate

The Code file

These contain all the code needed in the workshop even where it is not visible on the webpage.

The workshop.qmd file is the file I use to compile the practical. Qmd stands for Quarto markdown. It allows code and ordinary text to be interleaved to produce well-formatted reports including webpages. Right-click on the link and choose Save-As to download. You will be able to open the Qmd file in RStudio. Alternatively, View in Browser. Coding and thinking answers are marked with #---CODING ANSWER--- and #---THINKING ANSWER---

Pages made with R (R Core Team 2024), Quarto (Allaire et al. 2024), knitr (Xie 2024, 2015, 2014), kableExtra (Zhu 2021)

References

Allaire, J. J., Charles Teague, Carlos Scheidegger, Yihui Xie, and Christophe Dervieux. 2024. β€œQuarto.” https://doi.org/10.5281/zenodo.5960048.
Birney, Ewan, T. Daniel Andrews, Paul Bevan, Mario Caccamo, Yuan Chen, Laura Clarke, Guy Coates, et al. 2004. β€œAn Overview of Ensembl.” Genome Research 14 (5): 925–28. https://doi.org/10.1101/gr.1860604.
Durinck, Steffen, Yves Moreau, Arek Kasprzyk, Sean Davis, Bart De Moor, Alvis Brazma, and Wolfgang Huber. 2005. β€œBioMart and Bioconductor: A Powerful Link Between Biological Databases and Microarray Data Analysis.” Bioinformatics 21: 3439–40.
Durinck, Steffen, Paul T. Spellman, Ewan Birney, and Wolfgang Huber. 2009. β€œMapping Identifiers for the Integration of Genomic Datasets with the r/Bioconductor Package biomaRt.” Nature Protocols 4: 1184–91.
Love, Michael I., Wolfgang Huber, and Simon Anders. 2014. β€œModerated Estimation of Fold Change and Dispersion for RNA-Seq Data with DESeq2.” Genome Biology 15: 550. https://doi.org/10.1186/s13059-014-0550-8.
Lun, Aaron T. L., Davis J. McCarthy, and John C. Marioni. 2016. β€œA Step-by-Step Workflow for Low-Level Analysis of Single-Cell RNA-Seq Data with Bioconductor.” F1000Res. 5: 2122. https://doi.org/10.12688/f1000research.9501.2.
Martin, Fergal J, M Ridwan Amode, Alisha Aneja, Olanrewaju Austine-Orimoloye, Andrey G Azov, If Barnes, Arne Becker, et al. 2023. β€œEnsembl 2023.” Nucleic Acids Research 51 (D1): D933–41. https://doi.org/10.1093/nar/gkac958.
R Core Team. 2024. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.
Smedley, Damian, Syed Haider, Benoit Ballester, Richard Holland, Darin London, Gudmundur Thorisson, and Arek Kasprzyk. 2009. β€œBioMart Biological Queries Made Easy.” BMC Genomics 10 (1): 22. https://doi.org/10.1186/1471-2164-10-22.
Wickham, Hadley. 2023. Conflicted: An Alternative Conflict Resolution Strategy. https://conflicted.r-lib.org/.
Wickham, Hadley, Mara Averick, Jennifer Bryan, Winston Chang, Lucy D’Agostino McGowan, Romain FranΓ§ois, Garrett Grolemund, et al. 2019. β€œWelcome to the tidyverse.” Journal of Open Source Software 4 (43): 1686. https://doi.org/10.21105/joss.01686.
Wickham, Hadley, and Jennifer Bryan. 2023. Readxl: Read Excel Files. https://readxl.tidyverse.org.
Xie, Yihui. 2014. β€œKnitr: A Comprehensive Tool for Reproducible Research in R.” In Implementing Reproducible Computational Research, edited by Victoria Stodden, Friedrich Leisch, and Roger D. Peng. Chapman; Hall/CRC.
β€”β€”β€”. 2015. Dynamic Documents with R and Knitr. 2nd ed. Boca Raton, Florida: Chapman; Hall/CRC. https://yihui.org/knitr/.
β€”β€”β€”. 2024. Knitr: A General-Purpose Package for Dynamic Report Generation in r. https://yihui.org/knitr/.
Zhu, Hao. 2021. β€œkableExtra: Construct Complex Table with ’Kable’ and Pipe Syntax.” https://CRAN.R-project.org/package=kableExtra.

Footnotes

  1. Bioconductor is a project that develops and supports R packages for bioinformatics.β†©οΈŽ

  2. Bioconductor is a project that develops and supports R packages for bioinformatics.β†©οΈŽ

  3. Bioconductor is a project that develops and supports R packages for bioinformatics.β†©οΈŽ