Workshop
Transcripttranscriptomics 2: Statistical Analysis
Introduction
Session overview
In the workshop, you will learn how to perform differential expression analysis on raw counts using DESeq2
(Love, Huber, and Anders 2014) or on logged normalised expression values using scran
(Lun, McCarthy, and Marioni 2016) or both.
Set up
π Arabidopisis
π¬ Open the arab-88H
RStudio Project and the cont-low-root.R
script.
π Leishmania
π¬ Open the leish-88H
RStudio Project and the pro-meta.R
script.
π Stem cells
π¬ Open the mice-88H
RStudio Project and the hspc-prog.R
script.
Everyone
π¬ Make a new folder results
in the project directory.
This is where we will save our results.
π¬ Load tidyverse
(Wickham et al. 2019) You most likely have this code at the top of your script already.
ββ Attaching core tidyverse packages βββββββββββββββββββββββββββββββββββββββββββββββ tidyverse 2.0.0 ββ
β dplyr 1.1.3 β readr 2.1.4
β forcats 1.0.0 β stringr 1.5.0
β ggplot2 3.4.3 β tibble 3.2.1
β lubridate 1.9.3 β tidyr 1.3.0
β purrr 1.0.2
ββ Conflicts βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ tidyverse_conflicts() ββ
β dplyr::filter() masks stats::filter()
β dplyr::lag() masks stats::lag()
βΉ Use the conflicted package to force all conflicts to become errors
Have you ever stopped to think about this message? It is telling us that there are functions in the dplyr
package that have the same name as functions in the stats
package and that R will use the dplyr
version. As this is what you want, this has always been fine. It still is fine in this case. However, as you start to load more packages, you will want to know if you are using a function from a package that has the same name as a function in another loaded package. This is where the conflicted
(Wickham 2023) package comes in. Conflicted
will warn you when you are using a function that has the same name as a function in another package. You can then choose which function to use.
π¬ Load the conflicted
package:
Instead of getting a warning every time you are using a function that has a function with the same name in another package, we can declare a preference for one function over another. This is useful for the functions you use a lot or ones where you are certain you always want to use a particular function.
For example, to always use the dplyr
version of filter()
by default you can add this to the top of your script:
conflicts_prefer(dplyr::filter)
We will also want to ensure that we are using the setdiff()
function from the GenomicRanges
package which is used by DESeq2
.
conflicts_prefer(GenomicRanges::setdiff)
Import
π Arabidopisis
We need to import the root tissue data that were filtered to remove genes with 3 or 4 zeros and those where the total counts was less than 20.
π¬ Import the data from the data-processed
folder.
Now go to Differential Expression Analysis.
π Leishmania
We need to import the procyclic- and metacyclic-promastigote data that were filtered to remove genes with 4, 5 or 6 zeros and those where the total counts was less than 20.
π¬ Import the data from the data-processed
folder.
Now go to Differential Expression Analysis.
π Stem cells
We need to import the progenitor and HSPC cell data that were combined.
π¬ Import the data from the data-processed
folder..
Now go to Differential Expression Analysis.
Differential Expression Analysis
π Arabidopisis
These are the steps we will take
- Find the genes that are expressed in only one treatment group.
- Create a DESeqDataSet object. This is a special object that is used by the
DESeq2
package - Prepare the normalised counts from the DESeqDataSet object.
- Do differential expression analysis on the genes. This needs to be done on the raw counts.
All but the first step are done with the DESeq2
package
1. Genes expressed in one treatment
The genes expressed in only one treatment group are those with zeros in all replicates in one group and non-zero values in all replicates in the other group. For example, those shown here:
gene_id | gene_name | CTR1 | CTR2 | CTR3 | CTR4 | CTR5 | CTR6 | LWR1 | LWR2 | LWR3 | LWR4 | LWR5 | LWR6 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
AT1G63820 | AT1G63820 | 0 | 0 | 0 | 0 | 0 | 0 | 13 | 4 | 4 | 7 | 10 | 9 |
AT1G27090 | AT1G27090 | 0 | 0 | 0 | 0 | 0 | 0 | 16 | 6 | 6 | 9 | 13 | 14 |
AT1G14950 | AT1G14950 | 0 | 0 | 0 | 0 | 0 | 0 | 14 | 7 | 4 | 9 | 13 | 17 |
AT1G18350 | MKK7 | 0 | 0 | 0 | 0 | 0 | 0 | 6 | 8 | 4 | 4 | 6 | 8 |
AT1G65910 | NAC028 | 0 | 0 | 0 | 0 | 0 | 0 | 15 | 11 | 5 | 1 | 3 | 15 |
We will use filter()
to find these genes.
π¬ Find the genes that are expressed only in the control group:
root_lowni_only <- root_filtered |>
filter(if_all(starts_with("CTR"), \(x) x == 0),
if_all(starts_with("LWR"), \(x) x > 0) )
This code creates a new data frame called root_lowni_only
that contains only the rows from root_filtered where:
- All CTR columns have a value of 0
- All LWR columns have a value greater than 0
filter()
keeps rows that match the conditions you give it. Here we give two conditions:
- All columns whose names start with βCTRβ must have a value of 0.
- All columns whose names start with βLWRβ must have a value greater than 0.
The if_all()
function applies a test to each selected column (those starting with βCTRβ or βLWRβ) for each row. It returns TRUE
only if all the selected columns pass the test.
The bit \(x) x == 0
is an anonymous function β a function without a name. In base R syntax, itβs equivalent to: function(x) x == 0
Here, x
will be the values from one column at a time, and the function checks whether they are equal to 0. We use an anonymous function because the condition needs to be applied separately to each column, but itβs simple enough that thereβs no need to define a full named function elsewhere in the script.
β How many genes are expressed only in the low nickel group?
Note: There may have been more genes with counts in the low nickel group that were removed when we filtered on the total number of counts being less that 50.
π¬ Now you find any genes that are expressed only in the control group.
β How many genes are expressed only in the control group?
β Do the results make sense to you in light of what you know about the biology?
π¬ Write all the genes that are expressed one group only to file (saved in results
)
2. Create DESeqDataSet object
π¬ Load the DESeq2 package:
A DEseqDataSet object is a custom data type that is used by DESeq2
. Custom data types are common in the Bioconductor1 packages. They are used to store data in a way that is useful for the analysis. These data types typically have data, transformed data, metadata and experimental designs within them.
To create a DESeqDataSet object, we need to provide three things:
- The raw counts - these are in
root_filtered
- The meta data which gives information about the samples and which treatment groups they belong to
- A design matrix which captures the design of the statistical model.
The counts must in a matrix rather than a dataframe. Unlike a dataframe, a matrix has columns of all the same type. That is, it will contain only the counts. The gene ids are given as row names rather than a column. The matrix()
function will create a matrix from a dataframe of columns of the same type and the select()
function can be used to remove the gene ids column.
π¬ Create a matrix of the counts:
π¬ Add the gene ids as row names to the matrix:
# add the row names to the matrix
rownames(root_count_mat) <- root_filtered$gene_id
You might want to view the matrix (click on it in your environment pane).
The metadata are in a file, arab_meta_data.txt. This is a tab-delimited file. The first column is the sample name and the other columns give the βtreatmentsβ. In this case, the treatments are tissue
(with two levels) and nickel
(with two levels).
π¬ Make a folder called meta
and save the file to it.
π¬ Read the metadata into a dataframe:
meta <- read_table("meta/arab_meta_data.txt")
π¬ Examine the resulting dataframe.
We need to add the sample names as row names to the metadata dataframe. This is because the DESeqDataSet object will use the row names to match the samples in the metadata to the samples in the counts matrix.
π¬ Add the sample names as row names to the metadata dataframe:
meta <- meta |>
column_to_rownames("sample_id")
We are dealing only with the root data so we need to remove the samples that are not in the root data.
π¬ Filter the metadata to keep only the root information:
meta_root <- meta |>
filter(tissue == "root")
We can now create the DESeqDataSet object. The design formula describes the statistical model. You should recognise the form from previous work. The ~
can be read as βexplain byβ and on its right hand side are the explanatory variables. That is, the model is counts explained by nickel status.
Note that:
- The names of the columns in the count matrix have to exactly match the names of the rows in the metadata dataframe. They also need to be in the same order.
- The names of the explanatory variables in the design formula have to match the names of columns in the metadata.
π¬ Create the DESeqDataSet object:
dds <- DESeqDataSetFromMatrix(root_count_mat,
colData = meta_root,
design = ~ nickel)
The warning βWarning: some variables in design formula are characters, converting to factorsβ just means that the variable type of nickel in the metadata dataframe is βcharβ and it has been converted into a factor type.
To help you understand what the DESeqDataSet
object we have called dds
contains, we can look its contents
The counts are in dds@assays@data@listData[["counts"]]
and the metadata are in dds@colData
but the easiest way to see them is to use the counts()
and colData()
functions from the DESeq2
package.
π¬ View the counts:
counts(dds) |> View()
You should be able to see that this is the same as in root_count_mat
.
π¬ View the column information:
colData(dds)
DataFrame with 12 rows and 2 columns
tissue nickel
<character> <factor>
CTR1 root control
CTR2 root control
CTR3 root control
CTR4 root control
CTR5 root control
... ... ...
LWR2 root low_ni
LWR3 root low_ni
LWR4 root low_ni
LWR5 root low_ni
LWR6 root low_ni
You should be able to see this is the same as in meta_root
.
3. Prepare the normalised counts
The normalised counts are the counts that have been transformed to account for the library size (i.e., the total number of reads in a sample) and the gene length. We have to first estimate the normalisation factors and store them in the DESeqDataSet object and then we can get the normalised counts.
π¬ Estimate the factors for normalisation and store them in the DESeqDataSet object:
dds <- estimateSizeFactors(dds)
π¬ Look at the factors (just for information):
sizeFactors(dds)
CTR1 CTR2 CTR3 CTR4 CTR5 CTR6 LWR1 LWR2
1.0241774 1.3312269 1.0264881 1.1672503 1.0274692 1.0104721 0.9390444 0.8343701
LWR3 LWR4 LWR5 LWR6
0.8624386 0.9117912 1.0260560 0.9740408
The normalised counts will be useful to use later. To get the normalised counts we again used the counts()
function but this time we use the normalized=TRUE
argument.
π¬ Save the normalised to a matrix:
normalised_counts <- counts(dds, normalized = TRUE)
π¬ Make a dataframe of the normalised counts, adding a column for the gene ids at the same time:
root_normalised_counts <- data.frame(normalised_counts,
gene_id = row.names(normalised_counts))
4. Differential expression analysis
We use the DESeq()
function to do the differential expression analysis. This function fits the statistical model to the data and then uses the model to calculate the significance of the difference between the treatments. It again stores the results in the DESseqDataSet object. Note that the differential expression needs the raw (unnormalised counts) as it does its own normalisation as part of the process.
π¬ Run the differential expression analysis and store the results in the same object:
dds <- DESeq(dds)
The function will take only a few moments to run on this data but can take longer for bigger datasets.
We need to define the contrasts we want to test. We want to test the difference between the treatments so we will define the contrast as control
and low_ni
.
π¬ Define the contrast:
contrast_suf <- c("nickel", "control", "low_ni")
Note that nickel
is the name of the column in the metadata dataframe and control
and low_ni
are the names of the levels in the nickel
column. By putting them in the order control
, low_ni
we are saying the fold change will be control / low_ni. This means:
- positive log fold changes indicate control > low_ni and
- negative log fold changes indicates low_ni > control.
If we had put them in the order low_ni
, control
we would have the reverse.
π¬ Extract the results from the DESseqDataSet object:
results_suf <- results(dds,
contrast = contrast_suf)
This will give us the log2 fold change, the p-value and the adjusted p-value for the comparison between the control and low_ni for each gene.
π¬ Put the results in a dataframe and add the gene ids as a column:
root_results <- data.frame(results_suf,
gene_id = row.names(results_suf))
It is useful to have the normalised counts and the statistical results in one dataframe.
π¬ Merge the two dataframes:
# merge the results with the normalised counts
root_results <- root_normalised_counts |>
left_join(root_results, by = "gene_id")
Now go to Add gene information.
π Leishmania
These are the steps we will take
- Find the genes that are expressed in only one treatment group.
- Create a DESeqDataSet object. This is a special object that is used by the
DESeq2
package - Prepare the normalised counts from the DESeqDataSet object.
- Do differential expression analysis on the genes. This needs to be done on the raw counts.
All but the first step are done with the DESeq2
package
1. Genes expressed in one treatment
The genes expressed in only one treatment group are those with zeros in all replicates in one group and non-zero values in all replicates in the other group.
We will use filter()
to find these genes.
π¬ Find the genes that are expressed only at the procyclic-promastigote stage:
pro_meta_pro_only <- pro_meta_filtered |>
filter(lm_pro_1 > 0,
lm_pro_2 > 0,
lm_pro_3 > 0,
lm_meta_1 == 0,
lm_meta_2 == 0,
lm_meta_2 == 0)
β How many genes are expressed only in the procyclic-promastigote stage group?
π¬ Now you find any genes that are expressed only at the metacyclic stage
β How many genes are expressed only at the metacyclic stage?
β Do the results make sense to you in light of what you know about the biology?
π¬ Write all the genes that are expressed one group only to file (saved in results
)
2. Create DESeqDataSet object
π¬ Load the DESeq2 package:
A DEseqDataSet object is a custom data type that is used by DESeq2
. Custom data types are common in the Bioconductor2 packages. They are used to store data in a way that is useful for the analysis. These data types typically have data, transformed data, metadata and experimental designs within them.
To create a DESeqDataSet object, we need to provide three things:
- The raw counts - these are in
pro_meta_filtered
- The meta data which gives information about the samples and which treatment groups they belong to
- A design matrix which captures the design of the statistical model.
The counts must in a matrix rather than a dataframe. Unlike a dataframe, a matrix has columns of all the same type. That is, it will contain only the counts. The gene ids are given as row names rather than a column. The matrix()
function will create a matrix from a dataframe of columns of the same type and the select()
function can be used to remove the gene ids column.
π¬ Create a matrix of the counts:
π¬ Add the gene ids as row names to the matrix:
# add the row names to the matrix
rownames(pro_meta_count_mat) <- pro_meta_filtered$gene_id
You might want to view the matrix (click on it in your environment pane).
The metadata are in a file, leish_meta_data.txt. This is a tab-delimited file. The first column is the sample name and the other columns give the βtreatmentsβ. In this case, the treatment is stage (with three levels).
π¬ Make a folder called meta
and save the file to it.
π¬ Read the metadata into a dataframe:
meta <- read_table("meta/leish_meta_data.txt")
π¬ Examine the resulting dataframe.
We need to add the sample names as row names to the metadata dataframe. This is because the DESeqDataSet object will use the row names to match the samples in the metadata to the samples in the counts matrix.
π¬ Add the sample names as row names to the metadata dataframe:
meta <- meta |>
column_to_rownames("sample_id")
We are dealing only with the wild data so we need to remove the samples that are not in the wild data.
π¬ Filter the metadata to keep only the procyclic and metacyclic information:
meta_pro_meta <- meta |>
filter(stage != "amastigotes")
We can now create the DESeqDataSet object. The design formula describes the statistical model. You should recognise the form from previous work. The ~
can be read as βexplain byβ and on its right hand side are the explanatory variables. That is, the model is counts explained by stage status.
Note that:
- The names of the columns in the count matrix have to exactly match the names of the rows in the metadata dataframe. They also need to be in the same order.
- The names of the explanatory variables in the design formula have to match the names of columns in the metadata.
π¬ Create the DESeqDataSet object:
dds <- DESeqDataSetFromMatrix(pro_meta_count_mat,
colData = meta_pro_meta,
design = ~ stage)
The warning βWarning: some variables in design formula are characters, converting to factorsβ just means that the variable type of stage in the metadata dataframe is βcharβ and it has been converted into a factor type.
To help you understand what the DESeqDataSet
object we have called dds
contains, we can look its contents
The counts are in dds@assays@data@listData[["counts"]]
and the metadata are in dds@colData
but the easiest way to see them is to use the counts()
and colData()
functions from the DESeq2
package.
π¬ View the counts:
counts(dds) |> View()
You should be able to see that this is the same as in pro_meta_count_mat
.
π¬ View the column information:
colData(dds)
DataFrame with 6 rows and 2 columns
stage replicate
<factor> <numeric>
lm_pro_1 procyclic 1
lm_pro_2 procyclic 2
lm_pro_3 procyclic 3
lm_meta_1 metacyclic 1
lm_meta_2 metacyclic 2
lm_meta_3 metacyclic 3
You should be able to see this is the same as in meta_pro_meta
.
3. Prepare the normalised counts
The normalised counts are the counts that have been transformed to account for the library size (i.e., the total number of reads in a sample) and the gene length. We have to first estimate the normalisation factors and store them in the DESeqDataSet object and then we can get the normalised counts.
π¬ Estimate the factors for normalisation and store them in the DESeqDataSet object:
dds <- estimateSizeFactors(dds)
π¬ Look at the factors (just for information):
sizeFactors(dds)
lm_pro_1 lm_pro_2 lm_pro_3 lm_meta_1 lm_meta_2 lm_meta_3
1.3029351 0.9158157 0.9943186 0.7849299 0.8443586 1.3250409
The normalised counts will be useful to use later. To get the normalised counts we again used the counts()
function but this time we use the normalized=TRUE
argument.
π¬ Save the normalised to a matrix:
normalised_counts <- counts(dds, normalized = TRUE)
π¬ Make a dataframe of the normalised counts, adding a column for the gene ids at the same time:
pro_meta_normalised_counts <- data.frame(normalised_counts,
gene_id = row.names(normalised_counts))
4. Differential expression analysis
We use the DESeq()
function to do the differential expression analysis. This function fits the statistical model to the data and then uses the model to calculate the significance of the difference between the treatments. It again stores the results in the DESseqDataSet object. Note that the differential expression needs the raw (unnormalised counts) as it does its own normalisation as part of the process.
π¬ Run the differential expression analysis and store the results in the same object:
dds <- DESeq(dds)
The function will take only a few moments to run on this data but can take longer for bigger datasets.
We need to define the contrasts we want to test. We want to test the difference between the treatments so we will define the contrast as procyclic
and metacyclic
.
π¬ Define the contrast:
contrast_pro_meta <- c("stage", "procyclic", "metacyclic")
Note that stage
is the name of the column in the metadata dataframe and procyclic
and metacyclic
are the names of the levels in the stage
column. By putting them in the order procyclic
, metacyclic
we are saying the fold change will be procyclic / metacyclic. This means:
- positive log fold changes indicate procyclic > metacyclic and
- negative log fold changes indicates metacyclic > procyclic.
If we had put them in the order metacyclic
, procyclic
we would have the reverse.
π¬ Extract the results from the DESseqDataSet object:
results_pro_meta <- results(dds,
contrast = contrast_pro_meta)
This will give us the log2 fold change, the p-value and the adjusted p-value for the comparison between procyclic and metacyclic stage for each gene
π¬ Put the results in a dataframe and add the gene ids as a column:
pro_meta_results <- data.frame(results_pro_meta,
gene_id = row.names(results_pro_meta))
It is useful to have the normalised counts and the statistical results in one dataframe.
π¬ Merge the two dataframes:
# merge the results with the normalised counts
pro_meta_results <- pro_meta_normalised_counts |>
left_join(pro_meta_results, by = "gene_id")
Now go to Add gene information.
π Stem cells
These are the steps we will take
- Find the genes that are expressed in only one cell type (the prog or the hspc).
- Prepare the data for differential expression analysis with the
scran
package. - Do differential expression analysis on the genes using the
scran
package. This needs to be done on the logged normalised counts.
1. Genes expressed in one cell type
The genes expressed in only cell type are those with zeros in all the cells of the other type. We can find these by summing the expression values for each gene across the cells of one type and filtering for those that are zero.
To do row wise aggregates such as the sum across rows we can use the rowwise()
function. c_across()
allows us to use the colon notation to select columns. This is very useful when you have a lot of columns because it would be annoying to have to list all of them. HSPC_001:HSPC_852
means all the columns from HSPC_001
to HSPC_852
.
π¬ Find the genes expressed only in progenitor cells (i.e., those that are 0 in every HSPC cell):
# A tibble: 0 Γ 1,500
# Rowwise:
# βΉ 1,500 variables: ensembl_gene_id <chr>, HSPC_001 <dbl>, HSPC_002 <dbl>,
# HSPC_003 <dbl>, HSPC_004 <dbl>, HSPC_006 <dbl>, HSPC_008 <dbl>,
# HSPC_009 <dbl>, HSPC_011 <dbl>, HSPC_012 <dbl>, HSPC_014 <dbl>,
# HSPC_015 <dbl>, HSPC_016 <dbl>, HSPC_017 <dbl>, HSPC_018 <dbl>,
# HSPC_020 <dbl>, HSPC_021 <dbl>, HSPC_022 <dbl>, HSPC_023 <dbl>,
# HSPC_024 <dbl>, HSPC_025 <dbl>, HSPC_026 <dbl>, HSPC_027 <dbl>,
# HSPC_028 <dbl>, HSPC_030 <dbl>, HSPC_031 <dbl>, HSPC_033 <dbl>, β¦
We already know from last weekβs work that there are no genes that are zero across all the cells (both types). If we did not know that, we would need to add |> filter(sum(c_across(Prog_001:Prog_852)) != 0)
meaning zero in all the HSPC but not zero in all the Prog
β How many genes are expressed only in the progenitor cells only
π¬ Now you find any genes that are expressed only in the HSPC cells.
β How many genes are expressed only in the HSPC cells?
β Do the results make sense to you in light of what you know about the biology?
π¬ Write all the genes that are expressed one cell type only to file (saved in results
)
2. Prepare the data for analysis with scran
scran
can use a matrix or a dataframe of counts but these must be log normalised counts. If using a dataframe, the columns must only contain the expression values (not the gene ids). The rows can be named to retain the gene ids.
hspc_prog
is a dataframe so we will use the ensembl gene ids to name the rows and remove the gene ids from the dataframe.
π¬ Add the gene ids as the row names:
hspc_prog <- hspc_prog |>
column_to_rownames("ensembl_gene_id")
Like DESeq2
, scran
needs metadata to define which columns were in which group. Instead of having this is a file, we will create a vector that indicates which column belongs to which cell type.
π¬ Create a vector that indicates which column belongs to which cell type:
The number of times each cell type is repeated is the number of cells of that type. Do check that the length of the cell_type
vector is the same as the number of columns in the hspc_prog
dataframe.
3. Differential expression analysis
π¬ Load the scran
package:
Differential expression is carried out with the findMarkers()
function. It takes two arguments. The first argument is the dataframe containing the data and the second argument is the vector indicating which columns are in which cell type. Make sure that the order of the cell types in the vector is appropriate for the order of the columns in the dataframe.
π¬ Run the differential expression analysis:
results_hspc_prog <- findMarkers(hspc_prog,
cell_type)
The output is a list object which, rather unnecessarily, includes two dataframes. This is not really necessary as the results are the same except for the fold change having a different sign.
-
The dataframe
results_hspc_prog$prog
is log prog - log hspc (i.e.,Prog/HSPC). This means:- Positive fold change: prog is higher than hspc
- Negative fold change: hspc is higher than prog
Top | p.value | FDR | summary.logFC | logFC.hspc | ensembl_gene_id | |
---|---|---|---|---|---|---|
ENSMUSG00000024399 | 1 | 0 | 0 | -4.3900364 | -4.3900364 | ENSMUSG00000024399 |
ENSMUSG00000079563 | 2 | 0 | 0 | -4.0005511 | -4.0005511 | ENSMUSG00000079563 |
ENSMUSG00000042817 | 3 | 0 | 0 | -3.6990453 | -3.6990453 | ENSMUSG00000042817 |
ENSMUSG00000063856 | 4 | 0 | 0 | 0.9115206 | 0.9115206 | ENSMUSG00000063856 |
ENSMUSG00000024053 | 5 | 0 | 0 | 3.0351647 | 3.0351647 | ENSMUSG00000024053 |
ENSMUSG00000069792 | 6 | 0 | 0 | -3.4659544 | -3.4659544 | ENSMUSG00000069792 |
-
The dataframe
results_hspc_prog$hspc
is log hspc - log prog (i.e., HSPC/Prog). This means:- Positive fold change: hspc is higher than prog
- Negative fold change: prog is higher than hspc
Top | p.value | FDR | summary.logFC | logFC.prog | ensembl_gene_id | |
---|---|---|---|---|---|---|
ENSMUSG00000024399 | 1 | 0 | 0 | 4.3900364 | 4.3900364 | ENSMUSG00000024399 |
ENSMUSG00000079563 | 2 | 0 | 0 | 4.0005511 | 4.0005511 | ENSMUSG00000079563 |
ENSMUSG00000042817 | 3 | 0 | 0 | 3.6990453 | 3.6990453 | ENSMUSG00000042817 |
ENSMUSG00000063856 | 4 | 0 | 0 | -0.9115206 | -0.9115206 | ENSMUSG00000063856 |
ENSMUSG00000024053 | 5 | 0 | 0 | -3.0351647 | -3.0351647 | ENSMUSG00000024053 |
ENSMUSG00000069792 | 6 | 0 | 0 | 3.4659544 | 3.4659544 | ENSMUSG00000069792 |
We only need one of these results dataframes and we will use the prog
one. It does not matter which one you use but you do need keep the direction of the foldchange in mind when interpreting the results.
Notice that the dataframes are ordered by significance rather than the original gene order.
It is useful to have the normalised counts and the statistical results in one dataframe to which we will add the gene information from Ensembl. Having all the information together will make it easier to interpret the results and select genes of interest. We will need to extract the results from the list object, add the gene ids and then join with the normalised counts.
π¬ Extract the results dataframe from the list object and add the gene ids as a column:
hspc_prog_results <- data.frame(results_hspc_prog$prog,
ensembl_gene_id = row.names(results_hspc_prog$prog))
π¬ Return the ensembl gene ids as a column to the normalised counts:
hspc_prog <- hspc_prog |>
rownames_to_column(var = "ensembl_gene_id")
π¬ Merge the results dataframe with the normalised counts:
# merge the results with the normalised counts
hspc_prog_results <- hspc_prog_results |>
left_join(hspc_prog, by = "ensembl_gene_id")
Now go to Add gene information.
Add gene information
π Arabidopisis
Ensembl (Martin et al. 2023; Birney et al. 2004)is a bioinformatics project to organise all the biological information around the sequences of large genomes. The are a large number of databases and BioMart (Smedley et al. 2009) provides a consistent interface to the material. There are web-based tools to use these but the R package biomaRt
(Durinck et al. 2009, 2005) gives you programmatic access making it easier to integrate information into R dataframes.
π¬ Load the biomaRt
(Durinck et al. 2009, 2005) package:
The biomaRt
package includes a function to list all the available datasets
π¬ List the Ensembl βmartsβ available:
biomart version
1 protists_mart Ensembl Protists Genes 61
2 protists_variations Ensembl Protists Variations 61
3 fungi_mart Ensembl Fungi Genes 61
4 fungi_variations Ensembl Fungi Variations 61
5 metazoa_mart Ensembl Metazoa Genes 61
6 metazoa_variations Ensembl Metazoa Variations 61
7 plants_mart Ensembl Plants Genes 61
8 plants_variations Ensembl Plants Variations 61
plants_mart
looks like the one we want. We can see what genomes are available with names like βArabidopsisβ in this mart using the searchDatasets()
function.
π¬
searchDatasets(useEnsemblGenomes(biomart = "plants_mart"),
pattern = "Arabidopsis")
dataset description version
4 ahalleri_eg_gene Arabidopsis halleri genes (Ahal2.2) Ahal2.2
6 alyrata_eg_gene Arabidopsis lyrata genes (v.1.0) v.1.0
11 athaliana_eg_gene Arabidopsis thaliana genes (TAIR10) TAIR10
athaliana_eg_gene
is the Arabidopsis thaliana genes (TAIR10) dataset we want.
π¬ Connect to the athaliana_eg_gene
database in plants_mart
:
ensembl <- useEnsemblGenomes(biomart = "plants_mart",
dataset = "athaliana_eg_gene")
π¬ See the the types of information we can retrieve:
listAttributes(mart = ensembl) |> View()
There are many (1,796!) possible bits of information (attributes) that can be obtained.
We use the getBM()
function to retrieve information from the database. The filters
argument is used to specified what kind of identifier we are supplying in values
to retrieve information. The attributes
argument is used to select the information we want to retrieve. The values
argument is used to specify the identifiers. The mart argument is used to specify the connection we created.
π¬ Get the the gene name and a description. We also retreive the gene id so we can later join the information with the results:
You should view the resulting dataframe to see what information is available. You can use glimpse()
or View()
.
π¬ Merge the gene information with the results:
π¬ Save the results to a file:
write_csv(root_results, file = "results/root_results.csv")
Now go to Look after future you.
π Leishmania
I got the information from TriTrypDB
which is a functional genomic resource for the Trypanosomatidae and PlasmodidaeI downloaded the L. mexicana MHOM/GT/2001/U1103 Full GFF and extracted the gene information and saved it as leishmania_mex.xlsx
We will import this file and join it to the results dataframe.
π¬ Load the readxl
(Wickham and Bryan 2023) package:
π¬ Import the Xenbase gene information file:
gene_info <- read_excel("meta/leishmania_mex.xlsx")
Error in read_excel("meta/leishmania_mex.xlsx"): could not find function "read_excel"
You should view the resulting dataframe to see what information is available. You can use glimpse()
or View()
.
π¬ Merge the gene information with the results:
# join the gene info with the results
pro_meta_results <- pro_meta_results |>
left_join(gene_info, by = "gene_id")
Error in `left_join()`:
! Join columns in `y` must be present in the data.
β Problem with `gene_id`.
π¬ Save the results to a file:
write_csv(pro_meta_results, file = "results/pro_meta_results.csv")
Now go to Look after future you.
π Stem cells
Ensembl (Martin et al. 2023; Birney et al. 2004)is a bioinformatics project to organise all the biological information around the sequences of large genomes. The are a large number of databases but BioMart (Smedley et al. 2009) provides a consistent interface to the material. There are web-based tools to use these but the R package biomaRt
(Durinck et al. 2009, 2005) gives you programmatic access making it easier to integrate information into R dataframes
π¬ Load the biomaRt
(Durinck et al. 2009, 2005) package:
π¬ Connect to the mouse database
# Connect to the mouse database
ensembl <- useMart(biomart = "ensembl",
dataset = "mmusculus_gene_ensembl")
π¬ See information we can retrieve:
# See what information we can retrieve
listAttributes(mart = ensembl) |> View()
There are many (~3000!) possible bits of information (attributes) that can be obtained.
We use the getBM()
function to retrieve information from the database. The filters
argument is used to specified what kind of identifier we are supplying in values
to retrieve information. The attributes
argument is used to select the information we want to retrieve. The values
argument is used to specify the identifiers. The mart argument is used to specify the connection we created.
π¬ Get the the gene name and a description. We also retrieve the gene id so we can later join the information with the results:
You should view the resulting dataframe to see what information is available. You can use glimpse()
or View()
. Notice the dataframe returned only has 422 rows - one of the ids does not have information.
π¬ Merge the gene information with the results:
# join the gene info with the results
hspc_prog_results <- hspc_prog_results |>
left_join(gene_info, by = "ensembl_gene_id")
π¬ Save the results to a file:
write_csv(hspc_prog_results, file = "results/hspc_prog_results.csv")
Now go to Look after future you.
π€ Look after future you!
π¬ Go through you script and tidy up. You might need to :
- collect together library statements at the top of the script
- remove code that you needed to start today but which wouldnβt be needed if you were running the script from the top (e.g., importing)
- edit your comments for clarity
- rename variables for consistency or clarity
- remove house keeping or exploratory code
- restyle code, add code section headers etc
π₯³ Finished
Well Done!
Independent study following the workshop
The Code file
These contain all the code needed in the workshop even where it is not visible on the webpage.
The workshop.qmd file is the file I use to compile the practical. Qmd stands for Quarto markdown. It allows code and ordinary text to be interleaved to produce well-formatted reports including webpages. View the source code for this workshop using the </> Code
button at the top of the page. Coding and thinking answers are marked with #---CODING ANSWER---
and #---THINKING ANSWER---
Pages made with R (R Core Team 2024), Quarto (Allaire et al. 2024), knitr
(Xie 2024, 2015, 2014), kableExtra
(Zhu 2021)
References
Footnotes
Bioconductor is a project that develops and supports R packages for bioinformatics.β©οΈ
Bioconductor is a project that develops and supports R packages for bioinformatics.β©οΈ