Workshop
Transcripttranscriptomics 2: Statistical Analysis
Introduction
Session overview
In the workshop, you will learn how to perform differential expression analysis on raw counts using DESeq2
(Love, Huber, and Anders 2014) or on logged normalised expression values using scran
(Lun, McCarthy, and Marioni 2016) or both.
Set up
πΈ Frog development
π¬ Open the frogs-88H
RStudio Project and the cont-fgf-s30.R
script.
π Arabidopisis
π¬ Open the arab-88H
RStudio Project and the suff-def-wild.R
script.
π Leishmania
π¬ Open the leish-88H
RStudio Project and the pro-meta.R
script.
π Stem cells
π¬ Open the mice-88H
RStudio Project and the hspc-prog.R
script.
Everyone
π¬ Make a new folder results
in the project directory.
This is where we will save our results.
π¬ Load tidyverse
(Wickham et al. 2019) You most likely have this code at the top of `your script already.
ββ Attaching core tidyverse packages βββββββββββββββββββββββββββββββββββββββββββββββ tidyverse 2.0.0 ββ
β dplyr 1.1.3 β readr 2.1.4
β forcats 1.0.0 β stringr 1.5.0
β ggplot2 3.4.3 β tibble 3.2.1
β lubridate 1.9.3 β tidyr 1.3.0
β purrr 1.0.2
ββ Conflicts βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ tidyverse_conflicts() ββ
β dplyr::filter() masks stats::filter()
β dplyr::lag() masks stats::lag()
βΉ Use the conflicted package to force all conflicts to become errors
Have you ever stopped to think about this message? It is telling us that there are functions in the dplyr
package that have the same name as functions in the stats
package and that R will use the dplyr
version. As this is what you want, this has always been fine. It still is fine in this case. However, as you start to load more packages, you will want to know if you are using a function from a package that has the same name as a function in another loaded package. This is where the conflicted
(Wickham 2023) package comes in. Conflicted
will warn you when you are using a function that has the same name as a function in another package. You can then choose which function to use.
π¬ Load the conflicted
package:
Instead of getting a warning every time you are using a function that has a function with the same name in another package, we can declare a preference for one function over another. This is useful for the functions you use a lot or ones where you are certain you always want to use a particular function.
For example, to always use the dplyr
version of filter()
by default you can add this to the top of your script:
conflicts_prefer(dplyr::filter)
We will also want to ensure that we are using the setdiff()
function from the GenomicRanges
package which is used by DESeq2
.
conflicts_prefer(GenomicRanges::setdiff)
Import
πΈ Frog development
We need to import the S30 data that were filtered to remove genes with 4, 5 or 6 zeros and those where the total counts was less than 20.
π¬ Import the data from the data-processed
folder.
Now go to Differential Expression Analysis.
π Arabidopisis
We need to import the wildtype data that were filtered to remove genes with 3 or 4 zeros and those where the total counts was less than 20.
π¬ Import the data from the data-processed
folder.
Now go to Differential Expression Analysis.
π Leishmania
We need to import the procyclic- and metacyclic-promastigote data that were filtered to remove genes with 4, 5 or 6 zeros and those where the total counts was less than 20.
π¬ Import the data from the data-processed
folder.
Now go to Differential Expression Analysis.
π Stem cells
We need to import the progenitor and HSPC cell data that were combined.
π¬ Import the data from the data-processed
folder..
Now go to Differential Expression Analysis.
Differential Expression Analysis
πΈ Frog development
These are the steps we will take
- Find the genes that are expressed in only one treatment group.
- Create a DESeqDataSet object. This is a special object that is used by the
DESeq2
package - Prepare the normalised counts from the DESeqDataSet object.
- Do differential expression analysis on the genes. This needs to be done on the raw counts.
All but the first step are done with the DESeq2
package
1. Genes expressed in one treatment
The genes expressed in only one treatment group are those with zeros in all three replicates in one group and non-zero values in all three replicates in the other group. For example, those shown here:
xenbase_gene_id | S30_C_1 | S30_C_2 | S30_C_3 | S30_F_1 | S30_F_2 | S30_F_3 |
---|---|---|---|---|---|---|
XB-GENE-1018260 | 0 | 0 | 0 | 10 | 2 | 16 |
XB-GENE-17330117 | 0 | 0 | 0 | 13 | 4 | 17 |
XB-GENE-17332184 | 0 | 0 | 0 | 6 | 19 | 6 |
We will use filter()
to find these genes.
π¬ Find the genes that are expressed only in the FGF-treated group:
s30_fgf_only <- s30_filtered |>
filter(S30_C_1 == 0,
S30_C_2 == 0,
S30_C_3 == 0,
S30_F_1 > 0,
S30_F_2 > 0,
S30_F_3 > 0)
β How many genes are expressed only in the FGF-treated group?
π¬ Now you find any genes that are expressed only in the control group.
β How many genes are expressed only in the control group?
β Do the results make sense to you in light of what you know about the biology?
π¬ Write all the genes that are expressed one group only to file (saved in results
)
2. Create DESeqDataSet object
π¬ Load the DESeq2 package:
A DEseqDataSet object is a custom data type that is used by DESeq2
. Custom data types are common in the Bioconductor1 packages. They are used to store data in a way that is useful for the analysis. These data types typically have data, transformed data, metadata and experimental designs within them.
To create a DESeqDataSet object, we need to provide three things:
- The raw counts - these are in
s30_filtered
- The meta data which gives information about the samples and which treatment groups they belong to
- A design matrix which captures the design of the statistical model.
The counts must in a matrix rather than a dataframe. Unlike a dataframe, a matrix has columns of all the same type. That is, it will contain only the counts. The gene ids are given as row names rather than a column. The matrix()
function will create a matrix from a dataframe of columns of the same type and the select()
function can be used to remove the gene ids column.
π¬ Create a matrix of the counts:
π¬ Add the gene ids as row names to the matrix:
# add the row names to the matrix
rownames(s30_count_mat) <- s30_filtered$xenbase_gene_id
You might want to view the matrix (click on it in your environment pane).
The metadata are in a file, frog_meta_data.txt. This is a tab-delimited file. The first column is the sample name and the other columns give the βtreatmentsβ. In this case, the treatments stage (with three levels) and treatment (with two levels).
π¬ Make a folder called meta
and save the file to it.
π¬ Read the metadata into a dataframe:
meta <- read_table("meta/frog_meta_data.txt")
π¬ Examine the resulting dataframe.
We need to add the sample names as row names to the metadata dataframe. This is because the DESeqDataSet object will use the row names to match the samples in the metadata to the samples in the counts matrix.
π¬ Add the sample names as row names to the metadata dataframe:
meta <- meta |>
column_to_rownames("sample_id")
We are dealing only with the S30 data so we need to remove the samples that are not in the S30 data.
π¬ Filter the metadata to keep only the S30 information:
meta_s30 <- meta |>
filter(stage == "stage_30")
We can now create the DESeqDataSet object. The design formula describes the statistical model. You should recognise the form from previous work. The ~
can be read as βexplain byβ and on its right hand side are the explanatory variables. That is, the model is counts explained by treatment and sibling_rep. We are interested in the difference between the treatments but we include sibling_rep
to account for the fact that the data are paired. The condition of interest should go at the end of the design formula.
Note that:
- The names of the columns in the count matrix have to exactly match the names of the rows in the metadata dataframe. They also need to be in the same order.
- The names of the explanatory variables in the design formula have to match the names of columns in the metadata.
π¬ Create the DESeqDataSet object:
dds <- DESeqDataSetFromMatrix(countData = s30_count_mat,
colData = meta_s30,
design = ~ sibling_rep + treatment)
The warning βWarning: some variables in design formula are characters, converting to factorsβ just means that the variable type of treatment and sibling_rep in the metadata dataframe are βcharβ and they have been converted into the factors.
To help you understand what the DESeqDataSet
object we have called dds
contains, we can look its contents
The counts are in dds@assays@data@listData[["counts"]]
and the metadata are in dds@colData
but the easiest way to see them is to use the counts()
and colData()
functions from the DESeq2
package.
π¬ View the counts:
counts(dds) |> View()
You should be able to see that this is the same as in s30_count_mat
.
π¬ View the column information:
colData(dds)
DataFrame with 6 rows and 3 columns
stage treatment sibling_rep
<character> <factor> <factor>
S30_C_1 stage_30 control one
S30_C_2 stage_30 control two
S30_C_3 stage_30 control three
S30_F_1 stage_30 FGF one
S30_F_2 stage_30 FGF two
S30_F_3 stage_30 FGF three
You should be able to see this is the same as in meta_s30
.
3. Prepare the normalised counts
The normalised counts are the counts that have been transformed to account for the library size (i.e., the total number of reads in a sample) and the gene length. We have to first estimate the normalisation factors and store them in the DESeqDataSet object and then we can get the normalised counts.
π¬ Estimate the factors for normalisation and store them in the DESeqDataSet object:
dds <- estimateSizeFactors(dds)
π¬ Look at the factors (just for information):
sizeFactors(dds)
S30_C_1 S30_C_2 S30_C_3 S30_F_1 S30_F_2 S30_F_3
0.8812200 0.9454600 1.2989886 1.0881870 1.0518961 0.8322894
The normalised counts will be useful to use later. To get the normalised counts we again used the counts()
function but this time we use the normalized=TRUE
argument.
π¬ Save the normalised to a matrix:
normalised_counts <- counts(dds, normalized = TRUE)
π¬ Make a dataframe of the normalised counts, adding a column for the gene ids at the same time:
s30_normalised_counts <- data.frame(normalised_counts,
xenbase_gene_id = row.names(normalised_counts))
4. Differential expression analysis
We use the DESeq()
function to do the differential expression analysis. This function fits the statistical model to the data and then uses the model to calculate the significance of the difference between the treatments. It again stores the results in the DESseqDataSet object. Note that the differential expression needs the raw (unnormalised counts) as it does its own normalisation as part of the process.
π¬ Run the differential expression analysis and store the results in the same object:
dds <- DESeq(dds)
The function will take only a few moments to run on this data but can take longer for bigger datasets.
We need to define the contrasts we want to test. We want to test the difference between the treatments so we will define the contrast as FGF
and control
.
π¬ Define the contrast:
contrast_fgf <- c("treatment", "FGF", "control")
Note that treatment
is the name of the column in the metadata dataframe and FGF
and control
are the names of the levels in the treatment
column. By putting them in the order FGF
, control
we are saying the fold change will be FGF / control. This means:
- positive log fold changes indicate FGF > control and
- negative log fold changes indicates control > FGF.
If we had put them in the order control
, FGF
we would have the reverse.
π¬ Extract the results from the DESseqDataSet object:
results_fgf <- results(dds,
contrast = contrast_fgf)
This will give us the log2 fold change, the p-value and the adjusted p-value for the comparison between the control and the FGF-treatment for each gene.
π¬ Put the results in a dataframe and add the gene ids as a column:
s30_results <- data.frame(results_fgf,
xenbase_gene_id = row.names(results_fgf))
It is useful to have the normalised counts and the statistical results in one dataframe.
π¬ Merge the two dataframes:
# merge the results with the normalised counts
s30_results <- s30_normalised_counts |>
left_join(s30_results, by = "xenbase_gene_id")
Now go to Add gene information.
π Arabidopisis
These are the steps we will take
- Find the genes that are expressed in only one treatment group.
- Create a DESeqDataSet object. This is a special object that is used by the
DESeq2
package - Prepare the normalised counts from the DESeqDataSet object.
- Do differential expression analysis on the genes. This needs to be done on the raw counts.
All but the first step are done with the DESeq2
package
1. Genes expressed in one treatment
The genes expressed in only one treatment group are those with zeros in both replicates in one group and non-zero values in both replicates in the other group. For example, those shown here:
gene_id | SRX028956_wild_suf | SRX028957_wild_def | SRX028960_wild_suf | SRX028961_wild_def |
---|---|---|---|---|
AT1G04513 | 11 | 0 | 25 | 0 |
AT1G22610 | 36 | 0 | 52 | 0 |
AT1G26290 | 12 | 0 | 23 | 0 |
AT1G59810 | 5 | 0 | 16 | 0 |
AT2G44130 | 28 | 0 | 18 | 0 |
We will use filter()
to find these genes.
π¬ Find the genes that are expressed only in the sufficient copper group:
wild_suf_only <- wild_filtered |>
filter(SRX028961_wild_def == 0,
SRX028957_wild_def == 0,
SRX028960_wild_suf > 0,
SRX028956_wild_suf > 0)
β How many genes are expressed only in the sufficient copper group?
π¬ Now you find any genes that are expressed only in the deficient copper group.
β How many genes are expressed only in the deficient copper group?
β Do the results make sense to you in light of what you know about the biology?
π¬ Write all the genes that are expressed one group only to file (saved in results
)
2. Create DESeqDataSet object
π¬ Load the DESeq2 package:
A DEseqDataSet object is a custom data type that is used by DESeq2
. Custom data types are common in the Bioconductor2 packages. They are used to store data in a way that is useful for the analysis. These data types typically have data, transformed data, metadata and experimental designs within them.
To create a DESeqDataSet object, we need to provide three things:
- The raw counts - these are in
wild_filtered
- The meta data which gives information about the samples and which treatment groups they belong to
- A design matrix which captures the design of the statistical model.
The counts must in a matrix rather than a dataframe. Unlike a dataframe, a matrix has columns of all the same type. That is, it will contain only the counts. The gene ids are given as row names rather than a column. The matrix()
function will create a matrix from a dataframe of columns of the same type and the select()
function can be used to remove the gene ids column.
π¬ Create a matrix of the counts:
π¬ Add the gene ids as row names to the matrix:
# add the row names to the matrix
rownames(wild_count_mat) <- wild_filtered$gene_id
You might want to view the matrix (click on it in your environment pane).
The metadata are in a file, arab_meta_data.txt. This is a tab-delimited file. The first column is the sample name and the other columns give the βtreatmentsβ. In this case, the treatments genotype (with two levels) and copper (with two levels).
π¬ Make a folder called meta
and save the file to it.
π¬ Read the metadata into a dataframe:
meta <- read_table("meta/arab_meta_data.txt")
π¬ Examine the resulting dataframe.
We need to add the sample names as row names to the metadata dataframe. This is because the DESeqDataSet object will use the row names to match the samples in the metadata to the samples in the counts matrix.
π¬ Add the sample names as row names to the metadata dataframe:
meta <- meta |>
column_to_rownames("sample_id")
We are dealing only with the wild data so we need to remove the samples that are not in the wild data.
π¬ Filter the metadata to keep only the wild information:
meta_wild <- meta |>
filter(genotype == "wt")
We can now create the DESeqDataSet object. The design formula describes the statistical model. You should recognise the form from previous work. The ~
can be read as βexplain byβ and on its right hand side are the explanatory variables. That is, the model is counts explained by copper status.
Note that:
- The names of the columns in the count matrix have to exactly match the names of the rows in the metadata dataframe. They also need to be in the same order.
- The names of the explanatory variables in the design formula have to match the names of columns in the metadata.
π¬ Create the DESeqDataSet object:
dds <- DESeqDataSetFromMatrix(wild_count_mat,
colData = meta_wild,
design = ~ copper)
The warning βWarning: some variables in design formula are characters, converting to factorsβ just means that the variable type of copper in the metadata dataframe is βcharβ and it has been converted into a factor type.
To help you understand what the DESeqDataSet
object we have called dds
contains, we can look its contents
The counts are in dds@assays@data@listData[["counts"]]
and the metadata are in dds@colData
but the easiest way to see them is to use the counts()
and colData()
functions from the DESeq2
package.
π¬ View the counts:
counts(dds) |> View()
You should be able to see that this is the same as in wild_count_mat
.
π¬ View the column information:
colData(dds)
DataFrame with 4 rows and 2 columns
genotype copper
<character> <factor>
SRX028956_wild_suf wt sufficient
SRX028957_wild_def wt deficient
SRX028960_wild_suf wt sufficient
SRX028961_wild_def wt deficient
You should be able to see this is the same as in meta_wild
.
3. Prepare the normalised counts
The normalised counts are the counts that have been transformed to account for the library size (i.e., the total number of reads in a sample) and the gene length. We have to first estimate the normalisation factors and store them in the DESeqDataSet object and then we can get the normalised counts.
π¬ Estimate the factors for normalisation and store them in the DESeqDataSet object:
dds <- estimateSizeFactors(dds)
π¬ Look at the factors (just for information):
sizeFactors(dds)
SRX028956_wild_suf SRX028957_wild_def SRX028960_wild_suf SRX028961_wild_def
0.8200020 0.4653024 2.3002428 1.1965924
The normalised counts will be useful to use later. To get the normalised counts we again used the counts()
function but this time we use the normalized=TRUE
argument.
π¬ Save the normalised to a matrix:
normalised_counts <- counts(dds, normalized = TRUE)
π¬ Make a dataframe of the normalised counts, adding a column for the gene ids at the same time:
wild_normalised_counts <- data.frame(normalised_counts,
gene_id = row.names(normalised_counts))
4. Differential expression analysis
We use the DESeq()
function to do the differential expression analysis. This function fits the statistical model to the data and then uses the model to calculate the significance of the difference between the treatments. It again stores the results in the DESseqDataSet object. Note that the differential expression needs the raw (unnormalised counts) as it does its own normalisation as part of the process.
π¬ Run the differential expression analysis and store the results in the same object:
dds <- DESeq(dds)
The function will take only a few moments to run on this data but can take longer for bigger datasets.
We need to define the contrasts we want to test. We want to test the difference between the treatments so we will define the contrast as sufficient
and deficient
.
π¬ Define the contrast:
contrast_suf <- c("copper", "sufficient", "deficient")
Note that copper
is the name of the column in the metadata dataframe and sufficient
and deficient
are the names of the levels in the copper
column. By putting them in the order sufficient
, deficient
we are saying the fold change will be sufficient / deficient. This means:
- positive log fold changes indicate sufficient > deficient and
- negative log fold changes indicates deficient > sufficient.
If we had put them in the order deficient
, sufficient
we would have the reverse.
π¬ Extract the results from the DESseqDataSet object:
results_suf <- results(dds,
contrast = contrast_suf)
This will give us the log2 fold change, the p-value and the adjusted p-value for the comparison between the sufficient- and
deficient-copper for each gene.
π¬ Put the results in a dataframe and add the gene ids as a column:
wild_results <- data.frame(results_suf,
gene_id = row.names(results_suf))
It is useful to have the normalised counts and the statistical results in one dataframe.
π¬ Merge the two dataframes:
# merge the results with the normalised counts
wild_results <- wild_normalised_counts |>
left_join(wild_results, by = "gene_id")
Now go to Add gene information.
π Leishmania
These are the steps we will take
- Find the genes that are expressed in only one treatment group.
- Create a DESeqDataSet object. This is a special object that is used by the
DESeq2
package - Prepare the normalised counts from the DESeqDataSet object.
- Do differential expression analysis on the genes. This needs to be done on the raw counts.
All but the first step are done with the DESeq2
package
1. Genes expressed in one treatment
The genes expressed in only one treatment group are those with zeros in all replicates in one group and non-zero values in all replicates in the other group.
We will use filter()
to find these genes.
π¬ Find the genes that are expressed only at the procyclic-promastigote stage:
pro_meta_pro_only <- pro_meta_filtered |>
filter(lm_pro_1 > 0,
lm_pro_2 > 0,
lm_pro_3 > 0,
lm_meta_1 == 0,
lm_meta_2 == 0,
lm_meta_2 == 0)
β How many genes are expressed only in the procyclic-promastigote stage group?
π¬ Now you find any genes that are expressed only at the metacyclic stage
β How many genes are expressed only at the metacyclic stage?
β Do the results make sense to you in light of what you know about the biology?
π¬ Write all the genes that are expressed one group only to file (saved in results
)
2. Create DESeqDataSet object
π¬ Load the DESeq2 package:
A DEseqDataSet object is a custom data type that is used by DESeq2
. Custom data types are common in the Bioconductor3 packages. They are used to store data in a way that is useful for the analysis. These data types typically have data, transformed data, metadata and experimental designs within them.
To create a DESeqDataSet object, we need to provide three things:
- The raw counts - these are in
pro_meta_filtered
- The meta data which gives information about the samples and which treatment groups they belong to
- A design matrix which captures the design of the statistical model.
The counts must in a matrix rather than a dataframe. Unlike a dataframe, a matrix has columns of all the same type. That is, it will contain only the counts. The gene ids are given as row names rather than a column. The matrix()
function will create a matrix from a dataframe of columns of the same type and the select()
function can be used to remove the gene ids column.
π¬ Create a matrix of the counts:
π¬ Add the gene ids as row names to the matrix:
# add the row names to the matrix
rownames(pro_meta_count_mat) <- pro_meta_filtered$gene_id
You might want to view the matrix (click on it in your environment pane).
The metadata are in a file, leish_meta_data.txt. This is a tab-delimited file. The first column is the sample name and the other columns give the βtreatmentsβ. In this case, the treatment is stage (with three levels).
π¬ Make a folder called meta
and save the file to it.
π¬ Read the metadata into a dataframe:
meta <- read_table("meta/leish_meta_data.txt")
π¬ Examine the resulting dataframe.
We need to add the sample names as row names to the metadata dataframe. This is because the DESeqDataSet object will use the row names to match the samples in the metadata to the samples in the counts matrix.
π¬ Add the sample names as row names to the metadata dataframe:
meta <- meta |>
column_to_rownames("sample_id")
We are dealing only with the wild data so we need to remove the samples that are not in the wild data.
π¬ Filter the metadata to keep only the procyclic and metacyclic information:
meta_pro_meta <- meta |>
filter(stage != "amastigotes")
We can now create the DESeqDataSet object. The design formula describes the statistical model. You should recognise the form from previous work. The ~
can be read as βexplain byβ and on its right hand side are the explanatory variables. That is, the model is counts explained by stage status.
Note that:
- The names of the columns in the count matrix have to exactly match the names of the rows in the metadata dataframe. They also need to be in the same order.
- The names of the explanatory variables in the design formula have to match the names of columns in the metadata.
π¬ Create the DESeqDataSet object:
dds <- DESeqDataSetFromMatrix(pro_meta_count_mat,
colData = meta_pro_meta,
design = ~ stage)
The warning βWarning: some variables in design formula are characters, converting to factorsβ just means that the variable type of stage in the metadata dataframe is βcharβ and it has been converted into a factor type.
To help you understand what the DESeqDataSet
object we have called dds
contains, we can look its contents
The counts are in dds@assays@data@listData[["counts"]]
and the metadata are in dds@colData
but the easiest way to see them is to use the counts()
and colData()
functions from the DESeq2
package.
π¬ View the counts:
counts(dds) |> View()
You should be able to see that this is the same as in pro_meta_count_mat
.
π¬ View the column information:
colData(dds)
DataFrame with 6 rows and 2 columns
stage replicate
<factor> <numeric>
lm_pro_1 procyclic 1
lm_pro_2 procyclic 2
lm_pro_3 procyclic 3
lm_meta_1 metacyclic 1
lm_meta_2 metacyclic 2
lm_meta_3 metacyclic 3
You should be able to see this is the same as in meta_pro_meta
.
3. Prepare the normalised counts
The normalised counts are the counts that have been transformed to account for the library size (i.e., the total number of reads in a sample) and the gene length. We have to first estimate the normalisation factors and store them in the DESeqDataSet object and then we can get the normalised counts.
π¬ Estimate the factors for normalisation and store them in the DESeqDataSet object:
dds <- estimateSizeFactors(dds)
π¬ Look at the factors (just for information):
sizeFactors(dds)
lm_pro_1 lm_pro_2 lm_pro_3 lm_meta_1 lm_meta_2 lm_meta_3
1.3029351 0.9158157 0.9943186 0.7849299 0.8443586 1.3250409
The normalised counts will be useful to use later. To get the normalised counts we again used the counts()
function but this time we use the normalized=TRUE
argument.
π¬ Save the normalised to a matrix:
normalised_counts <- counts(dds, normalized = TRUE)
π¬ Make a dataframe of the normalised counts, adding a column for the gene ids at the same time:
pro_meta_normalised_counts <- data.frame(normalised_counts,
gene_id = row.names(normalised_counts))
4. Differential expression analysis
We use the DESeq()
function to do the differential expression analysis. This function fits the statistical model to the data and then uses the model to calculate the significance of the difference between the treatments. It again stores the results in the DESseqDataSet object. Note that the differential expression needs the raw (unnormalised counts) as it does its own normalisation as part of the process.
π¬ Run the differential expression analysis and store the results in the same object:
dds <- DESeq(dds)
The function will take only a few moments to run on this data but can take longer for bigger datasets.
We need to define the contrasts we want to test. We want to test the difference between the treatments so we will define the contrast as procyclic
and metacyclic
.
π¬ Define the contrast:
contrast_pro_meta <- c("stage", "procyclic", "metacyclic")
Note that stage
is the name of the column in the metadata dataframe and procyclic
and metacyclic
are the names of the levels in the stage
column. By putting them in the order procyclic
, metacyclic
we are saying the fold change will be procyclic / metacyclic. This means:
- positive log fold changes indicate procyclic > metacyclic and
- negative log fold changes indicates metacyclic > procyclic.
If we had put them in the order metacyclic
, procyclic
we would have the reverse.
π¬ Extract the results from the DESseqDataSet object:
results_pro_meta <- results(dds,
contrast = contrast_pro_meta)
This will give us the log2 fold change, the p-value and the adjusted p-value for the comparison between procyclic and metacyclic stage for each gene
π¬ Put the results in a dataframe and add the gene ids as a column:
pro_meta_results <- data.frame(results_pro_meta,
gene_id = row.names(results_pro_meta))
It is useful to have the normalised counts and the statistical results in one dataframe.
π¬ Merge the two dataframes:
# merge the results with the normalised counts
pro_meta_results <- pro_meta_normalised_counts |>
left_join(pro_meta_results, by = "gene_id")
Now go to Add gene information.
π Stem cells
These are the steps we will take
- Find the genes that are expressed in only one cell type (the prog or the hspc).
- Prepare the data for differential expression analysis with the
scran
package. - Do differential expression analysis on the genes using the
scran
package. This needs to be done on the logged normalised counts.
1. Genes expressed in one cell type
The genes expressed in only cell type are those with zeros in all the cells of the other type. We can find these by summing the expression values for each gene across the cells of one type and filtering for those that are zero.
To do row wise aggregates such as the sum across rows we can use the rowwise()
function. c_across()
allows us to use the colon notation to select columns. This is very useful when you have a lot of columns because it would be annoying to have to list all of them. HSPC_001:HSPC_852
means all the columns from HSPC_001
to HSPC_852
.
π¬ Find the genes expressed only in progenitor cells (i.e., those that are 0 in every HSPC cell):
# A tibble: 0 Γ 1,500
# Rowwise:
# βΉ 1,500 variables: ensembl_gene_id <chr>, HSPC_001 <dbl>, HSPC_002 <dbl>,
# HSPC_003 <dbl>, HSPC_004 <dbl>, HSPC_006 <dbl>, HSPC_008 <dbl>,
# HSPC_009 <dbl>, HSPC_011 <dbl>, HSPC_012 <dbl>, HSPC_014 <dbl>,
# HSPC_015 <dbl>, HSPC_016 <dbl>, HSPC_017 <dbl>, HSPC_018 <dbl>,
# HSPC_020 <dbl>, HSPC_021 <dbl>, HSPC_022 <dbl>, HSPC_023 <dbl>,
# HSPC_024 <dbl>, HSPC_025 <dbl>, HSPC_026 <dbl>, HSPC_027 <dbl>,
# HSPC_028 <dbl>, HSPC_030 <dbl>, HSPC_031 <dbl>, HSPC_033 <dbl>, β¦
We already know from last weekβs work that there are no genes that are zero across all the cells (both types). If we did not know that, we would need to add |> filter(sum(c_across(Prog_001:Prog_852)) != 0)
meaning zero in all the HSPC but not zero in all the Prog
β How many genes are expressed only in the progenitor cells only
π¬ Now you find any genes that are expressed only in the HSPC cells.
β How many genes are expressed only in the HSPC cells?
β Do the results make sense to you in light of what you know about the biology?
π¬ Write all the genes that are expressed one cell type only to file (saved in results
)
2. Prepare the data for analysis with scran
scran
can use a matrix or a dataframe of counts but these must be log normalised counts. If using a dataframe, the columns must only contain the expression values (not the gene ids). The rows can be named to retain the gene ids.
hspc_prog
is a dataframe so we will use the ensembl gene ids to name the rows and remove the gene ids from the dataframe.
π¬ Add the gene ids as the row names:
hspc_prog <- hspc_prog |>
column_to_rownames("ensembl_gene_id")
Like DESeq2
, scran
needs metadata to define which columns were in which group. Instead of having this is a file, we will create a vector that indicates which column belongs to which cell type.
π¬ Create a vector that indicates which column belongs to which cell type:
The number of times each cell type is repeated is the number of cells of that type. Do check that the length of the cell_type
vector is the same as the number of columns in the hspc_prog
dataframe.
3. Differential expression analysis
π¬ Load the scran
package:
Differential expression is carried out with the findMarkers()
function. It takes two arguments. The first argument is the dataframe containing the data and the second argument is the vector indicating which columns are in which cell type. Make sure that the order of the cell types in the vector is appropriate for the order of the columns in the dataframe.
π¬ Run the differential expression analysis:
results_hspc_prog <- findMarkers(hspc_prog,
cell_type)
The output is a list object which, rather unnecessarily, includes two dataframes. This is not really necessary as the results are the same except for the fold change having a different sign.
-
The dataframe
results_hspc_prog$prog
is log prog - log hspc (i.e.,Prog/HSPC). This means:- Positive fold change: prog is higher than hspc
- Negative fold change: hspc is higher than prog
Top | p.value | FDR | summary.logFC | logFC.hspc | ensembl_gene_id | |
---|---|---|---|---|---|---|
ENSMUSG00000028639 | 1 | 0 | 0 | 1.596910 | 1.596910 | ENSMUSG00000028639 |
ENSMUSG00000024053 | 2 | 0 | 0 | 3.035165 | 3.035165 | ENSMUSG00000024053 |
ENSMUSG00000041329 | 3 | 0 | 0 | 3.261056 | 3.261056 | ENSMUSG00000041329 |
ENSMUSG00000030336 | 4 | 0 | 0 | -2.146491 | -2.146491 | ENSMUSG00000030336 |
ENSMUSG00000016494 | 5 | 0 | 0 | -3.056730 | -3.056730 | ENSMUSG00000016494 |
ENSMUSG00000002808 | 6 | 0 | 0 | 3.000810 | 3.000810 | ENSMUSG00000002808 |
-
The dataframe
results_hspc_prog$hspc
is log hspc - log prog (i.e., HSPC/Prog). This means:- Positive fold change: hspc is higher than prog
- Negative fold change: prog is higher than hspc
Top | p.value | FDR | summary.logFC | logFC.prog | ensembl_gene_id | |
---|---|---|---|---|---|---|
ENSMUSG00000028639 | 1 | 0 | 0 | -1.596910 | -1.596910 | ENSMUSG00000028639 |
ENSMUSG00000024053 | 2 | 0 | 0 | -3.035165 | -3.035165 | ENSMUSG00000024053 |
ENSMUSG00000041329 | 3 | 0 | 0 | -3.261056 | -3.261056 | ENSMUSG00000041329 |
ENSMUSG00000030336 | 4 | 0 | 0 | 2.146491 | 2.146491 | ENSMUSG00000030336 |
ENSMUSG00000016494 | 5 | 0 | 0 | 3.056730 | 3.056730 | ENSMUSG00000016494 |
ENSMUSG00000002808 | 6 | 0 | 0 | -3.000810 | -3.000810 | ENSMUSG00000002808 |
We only need one of these results dataframes and we will use the prog
one. It does not matter which one you use but you do need keep the direction of the foldchange in mind when interpreting the results.
Notice that the dataframes are ordered by significance rather than the original gene order.
It is useful to have the normalised counts and the statistical results in one dataframe to which we will add the gene information from Ensembl. Having all the information together will make it easier to interpret the results and select genes of interest. We will need to extract the results from the list object, add the gene ids and then join with the normalised counts.
π¬ Extract the results dataframe from the list object and add the gene ids as a column:
hspc_prog_results <- data.frame(results_hspc_prog$prog,
ensembl_gene_id = row.names(results_hspc_prog$prog))
π¬ Return the ensembl gene ids as a column to the normalised counts:
hspc_prog <- hspc_prog |>
rownames_to_column(var = "ensembl_gene_id")
π¬ Merge the results dataframe with the normalised counts:
# merge the results with the normalised counts
hspc_prog_results <- hspc_prog_results |>
left_join(hspc_prog, by = "ensembl_gene_id")
Now go to Add gene information.
Add gene information
πΈ Frog development
I got the information from the Xenbase information pages under Data Reports | Gene Information
This is listed: Xenbase Gene Product Information [readme] gzipped gpi (tab separated)
Click on the readme link to see the file format and columns
I downloaded xenbase.gpi.gz, unzipped it, removed header lines and the Xenopus tropicalis (taxon:8364) entries and saved it as xenbase_info.xlsx
If you want to emulate what I did you can use the following commands in the terminal after downloading the file:
gunzip xenbase.gpi.gz
less xenbase.gpi
q
gunzip
unzips the file and less
allows you to view the file. q
quits the viewer. You will see the header lines and that the file contains both Xenopus tropicalis and Xenopus laevis. I read the file in with read_tsv
(skipping the first header lines) then filtered out the Xenopus tropicalis entries, dropped some columns and saved the file as an excel file.
However, I have already done this for you and saved the file as xenbase_info.xlsx in the meta
folder. We will import this file and join it to the results dataframe.
π¬ Load the readxl
(Wickham and Bryan 2023) package:
π¬ Import the Xenbase gene information file:
gene_info <- read_excel("meta/xenbase_info.xlsx")
You should view the resulting dataframe to see what information is available. You can use glimpse()
or View()
.
π¬ Merge the gene information with the results:
# join the gene info with the results
s30_results <- s30_results |>
left_join(gene_info, by = "xenbase_gene_id")
π¬ Save the results to a file:
write_csv(s30_results, file = "results/s30_results.csv")
π Arabidopisis
Ensembl (Martin et al. 2023; Birney et al. 2004)is a bioinformatics project to organise all the biological information around the sequences of large genomes. The are a large number of databases and BioMart (Smedley et al. 2009) provides a consistent interface to the material. There are web-based tools to use these but the R package biomaRt
(Durinck et al. 2009, 2005) gives you programmatic access making it easier to integrate information into R dataframes.
π¬ Load the biomaRt
(Durinck et al. 2009, 2005) package:
The biomaRt
package includes a function to list all the available datasets
π¬ List the Ensembl βmartsβ available:
biomart version
1 protists_mart Ensembl Protists Genes 60
2 protists_variations Ensembl Protists Variations 60
3 fungi_mart Ensembl Fungi Genes 60
4 fungi_variations Ensembl Fungi Variations 60
5 metazoa_mart Ensembl Metazoa Genes 60
6 metazoa_variations Ensembl Metazoa Variations 60
7 plants_mart Ensembl Plants Genes 60
8 plants_variations Ensembl Plants Variations 60
plants_mart
looks like the one we want. We can see what genomes are available with names like βArabidopsisβ in this mart using the searchDatasets()
function.
π¬
searchDatasets(useEnsemblGenomes(biomart = "plants_mart"),
pattern = "Arabidopsis")
dataset description version
4 ahalleri_eg_gene Arabidopsis halleri genes (Ahal2.2) Ahal2.2
6 alyrata_eg_gene Arabidopsis lyrata genes (v.1.0) v.1.0
11 athaliana_eg_gene Arabidopsis thaliana genes (TAIR10) TAIR10
athaliana_eg_gene
is the Arabidopsis thaliana genes (TAIR10) dataset we want.
π¬ Connect to the athaliana_eg_gene
database in plants_mart
:
ensembl <- useEnsemblGenomes(biomart = "plants_mart",
dataset = "athaliana_eg_gene")
π¬ See the the types of information we can retrieve:
listAttributes(mart = ensembl) |> View()
There are many (1,714!) possible bits of information (attributes) that can be obtained.
We use the getBM()
function to retrieve information from the database. The filters
argument is used to specified what kind of identifier we are supplying in values
to retrieve information. The attributes
argument is used to select the information we want to retrieve. The values
argument is used to specify the identifiers. The mart argument is used to specify the connection we created.
π¬ Get the the gene name and a description. We also retreive the gene id so we can later join the information with the results:
You should view the resulting dataframe to see what information is available. You can use glimpse()
or View()
.
π¬ Merge the gene information with the results:
π¬ Save the results to a file:
write_csv(wild_results, file = "results/wild_results.csv")
π Leishmania
I got the information from TriTrypDB
which is a functional genomic resource for the Trypanosomatidae and PlasmodidaeI downloaded the L. mexicana MHOM/GT/2001/U1103 Full GFF and extracted the gene information and saved it as leishmania_mex.xlsx
We will import this file and join it to the results dataframe.
π¬ Load the readxl
(Wickham and Bryan 2023) package:
π¬ Import the Xenbase gene information file:
gene_info <- read_excel("meta/leishmania_mex.xlsx")
You should view the resulting dataframe to see what information is available. You can use glimpse()
or View()
.
π¬ Merge the gene information with the results:
# join the gene info with the results
pro_meta_results <- pro_meta_results |>
left_join(gene_info, by = "gene_id")
π¬ Save the results to a file:
write_csv(pro_meta_results, file = "results/pro_meta_results.csv")
π Stem cells
Ensembl (Martin et al. 2023; Birney et al. 2004)is a bioinformatics project to organise all the biological information around the sequences of large genomes. The are a large number of databases but BioMart (Smedley et al. 2009) provides a consistent interface to the material. There are web-based tools to use these but the R package biomaRt
(Durinck et al. 2009, 2005) gives you programmatic access making it easier to integrate information into R dataframes
π¬ Load the biomaRt
(Durinck et al. 2009, 2005) package:
π¬ Connect to the mouse database
# Connect to the mouse database
ensembl <- useMart(biomart = "ensembl",
dataset = "mmusculus_gene_ensembl")
π¬ See information we can retrieve:
# See what information we can retrieve
listAttributes(mart = ensembl) |> View()
There are many (2,988!) possible bits of information (attributes) that can be obtained.
We use the getBM()
function to retrieve information from the database. The filters
argument is used to specified what kind of identifier we are supplying in values
to retrieve information. The attributes
argument is used to select the information we want to retrieve. The values
argument is used to specify the identifiers. The mart argument is used to specify the connection we created.
π¬ Get the the gene name and a description. We also retrieve the gene id so we can later join the information with the results:
You should view the resulting dataframe to see what information is available. You can use glimpse()
or View()
. Notice the dataframe returned only has 279 rows - one of the ids does not have information.
π¬ Merge the gene information with the results:
# join the gene info with the results
hspc_prog_results <- hspc_prog_results |>
left_join(gene_info, by = "ensembl_gene_id")
π¬ Save the results to a file:
write_csv(hspc_prog_results, file = "results/hspc_prog_results.csv")
π€ Look after future you!
π¬ Go through you script and tidy up. You might need to :
- collect together library statements at the top of the script
- remove code that you needed to start today but which wouldnβt be needed if you were running the script from the top (e.g., importing)
- edit your comments for clarity
- rename variables for consistency or clarity
- remove house keeping or exploratory code
- restyle code, add code section headers etc
π₯³ Finished
Well Done!
Independent study following the workshop
The Code file
These contain all the code needed in the workshop even where it is not visible on the webpage.
The workshop.qmd file is the file I use to compile the practical. Qmd stands for Quarto markdown. It allows code and ordinary text to be interleaved to produce well-formatted reports including webpages. Right-click on the link and choose Save-As to download. You will be able to open the Qmd file in RStudio. Alternatively, View in Browser. Coding and thinking answers are marked with #---CODING ANSWER---
and #---THINKING ANSWER---
Pages made with R (R Core Team 2024), Quarto (Allaire et al. 2024), knitr
(Xie 2024, 2015, 2014), kableExtra
(Zhu 2021)
References
Footnotes
Bioconductor is a project that develops and supports R packages for bioinformatics.β©οΈ
Bioconductor is a project that develops and supports R packages for bioinformatics.β©οΈ
Bioconductor is a project that develops and supports R packages for bioinformatics.β©οΈ