install.packages("BiocManager")
BiocManager::install(version = "3.19")
Transcriptomics 2: Statistical Analysis
6 November, 2024
In these slides we will:
Check where you are following week 3
learn some concepts in differential expression
Find out what packages we will use
After the Transcriptomics 1: 👋 Hello data! Workshop including:
the Independent Study to consolidate, you should have:
An RStudio Project called frogs-88H
which contains:
xlaevis_counts_S14.csv
, xlaevis_counts_S20.csv
, xlaevis_counts_S30.csv
s30_filtered.csv
, s20_filtered.csv
cont-fgf-s30.R
, cont-fgf-s20.R
An RStudio Project called arab-88H
which contains:
arabidopsis-wild.csv
, arabidopsis-spl7.csv
wild_filtered.csv
, spl7_filtered.csv
suff-def-wild.R
, suff-def-spl7.R
An RStudio Project called leish-88H
which contains:
leishmania-mex-ama.csv
, leishmania-mex-pro.csv
, leishmania-mex-meta.csv
pro_meta_filtered.csv
, pro_ama_filtered.csv
pro_meta.R
, pro_ama.R
An RStudio Project called mice-88H
which contains
surfaceome_hspc.csv
, surfaceome_prog.csv
, surfaceome_lthsc.csv
hspc_prog.csv
, hspc_lthsc.csv
hspc-prog.R
, hspc-lthsc.R
Files should be organised into folders. Code should well commented and easy to read. You should have curated your code to remove unnecessary commands that were useful to troubleshoot or understand objects in your environment but which are not needed for the final analysis.
If you are missing files, go through:
The goal of differential expression is to test whether there is a significant difference in gene expression between groups.
A large number of computational methods have been developed for differential expression analysis
R is the leading language for differential expression analysis
the statistical concepts are very similar to those you have already encountered in stages 1 and 2
you are essentially doing paired- or independent-samples tests
but you are doing a lot of them! One for every gene
data need normalisation before comparison
Like familiar tests:
the type of test (the function) you use depends on the type of data you have and the type of assumptions you want to make
the tests work by comparing the variation between groups to the variation within groups.
you will get: the difference between groups, a test statistic, and a p-value
you also get an adjusted p-value which is the ‘correction’ for multiple testing
The difference between groups is given as the log2 fold change in expression between groups
A fold change is the expression in one group divided by the expression in the other group: \(\frac{A}{B}\)
we use fold changes because the absolute expression values may not be accurate and relative changes are what matters
we use log2 fold changes because they are symmetrical around 0
log2 means log to the base 2
Suppose the expression in group A is 5 and the expression in group B is 8
\(\frac{A}{B} = \frac{5}{8}\) = 0.625 and \(\frac{B}{A} = \frac{8}{5}\) = 1.6
If B > A the range of \(\frac{A}{B}\) is 0 - 1 but the range of \(\frac{B}{A}\) is 1 - \(\infty\)
However, if we take the log2 of \(\frac{A}{B}\) we get -0.678 and the log2 of \(\frac{B}{A}\) is 0.678.
The p-value has to be adjusted because of the number of tested being done
In stage 1, we used Tukey’s HSD to adjust for multiple testing following an ANOVA
Here the Benjamini-Hochberg procedure (Benjamini and Hochberg 1995) is used to adjust for multiple testing
BH controls the False Discovery Rate (FDR)
The FDR is the proportion of false positives among the genes called significant
Normalisation adjusts raw counts to account for factors that prevent direct comparisons
Normalisation usually influences the experimental design as well as the analysis
🐭 mice data are normalised
🐸 frog, 🎄 Arabidopisis and 💉 Leishmania data are raw counts (not normalised) because the differential expression method will do this.
Normalisation is a big topic. See Düren, Lederer, and Qin (2022); Bullard et al. (2010); Lytal, Ran, and An (2020); Abrams et al. (2019); Vallejos et al. (2017); Evans, Hardin, and Stoebel (2017)
A large number of computational methods have been developed for differential expression analysis
Methods vary in the types of normalisation they do, the statistical model they use, and the assumptions they make
Some of the most well-known methods are provided by: DESeq2
(Love, Huber, and Anders 2014), edgeR
(Robinson, McCarthy, and Smyth 2010; McCarthy, Chen, and Smyth 2012; Chen, Lun, and Smyth 2016), limma
(Ritchie et al. 2015) and scran
(Lun, McCarthy, and Marioni 2016)
DESeq2
and edgeR
DESeq2
uses the median of ratios method; edgeR
uses the trimmed mean of M values (TMM) methodscran
DE methods require two types of data: the expression data and the meta data
The meta data gives the information about the samples
It says which samples (which columns of data) are in which treatment group (s)
Meta data is usually stored in a separate file
Expression for the whole transcriptome X. laevis v10.1 genome assembly
Values are raw counts
The statistical analysis method we will use DESeq2
(Love, Huber, and Anders 2014) requires raw counts and performs the normalisation itself
Expression for the whole transcriptome ENSEMBL Arabidopsis TAIR10(Yates et al. 2022)
Values are raw counts
The statistical analysis method we will use DESeq2
(Love, Huber, and Anders 2014) requires raw counts and performs the normalisation itself
Expression for the whole transcriptome L. mexicana MHOM/GT/2001/U1103(Rogers et al. 2011)
Values are raw counts
The statistical analysis method we will use DESeq2
(Love, Huber, and Anders 2014) requires raw counts and performs the normalisation itself
Expression for a subset of the transcriptome, the surfaceome
Values are log2 normalised values
The statistical analysis method we will use scran
(Lun, McCarthy, and Marioni 2016) requires normalised values
The gene id is difficult to interpret
Therefore we need to add information such as the gene name and a description to the results
🐸 Frog data information comes from Xenbase (Fisher et al. 2023)
🎄 Arabidopisis information comes from TAIR10 (Yates et al. 2022)
💉 Leishmania information comes TriTrypDB (Rogers et al. 2011)
🐭 Stem cell information comes from Ensembl (Birney et al. 2004)
Xenbase is a model organism database that provides genomic, molecular, and developmental biology information about Xenopus laevis and Xenopus tropicalis.
It took me some time to find the information you need.
I got the information from the Xenbase information pages under Data Reports | Gene Information
This is listed: Xenbase Gene Product Information [readme] gzipped gpi (tab separated)
Click on the readme link to see the file format and columns
I downloaded xenbase.gpi.gz, unzipped it, removed header lines and the Xenopus tropicalis (taxon:8364) entries and saved it as xenbase_info.xlsx
In the workshop you will import this file and merge the information with the results file
Ensembl creates, integrates and distributes reference datasets and analysis tools that enable genomics
BioMart (Smedley et al. 2009) provides uniform access to these large datasets
biomaRt
(Durinck et al. 2009, 2005) is a Bioconductor package gives you programmatic access to BioMart.
In the workshop you use this package to get information you can merge with the results file
I got the information from TriTrypDB
which is a functional genomic resource for the Trypanosomatidae and Plasmodidae
I downloaded the L. mexicana MHOM/GT/2001/U1103 Full GFF and extracted the gene information and saved it as leishmania_mex.xlsx
In the workshop you will import this file and merge the information with the results file
Ensembl creates, integrates and distributes reference datasets and analysis tools that enable genomics
BioMart (Smedley et al. 2009) provides uniform access to these large datasets
biomaRt
(Durinck et al. 2009, 2005) is a Bioconductor package gives you programmatic access to BioMart.
In the workshop you use this package to get information you can merge with the results file
These packages are all on the University computers which you can access on campus or remotely using the VDS
If you want to use your own machine you will need to install the packages.
Install BiocManager
from CRAN in the the normal way and set the version of Bioconductor packages to install:
install.packages("BiocManager")
BiocManager::install(version = "3.19")
Install DESeq2
from Bioconductor using BiocManager:
BiocManager::install("DESeq2")
Install scran
from Bioconductor using BiocManager:
BiocManager::install("scran")
Install biomaRt
from Bioconductor using BiocManager:
BiocManager::install("biomaRt")
Transcriptomics 1: Hello data. Getting to know the data. Checking the distributions of values overall, across rows and columns to check things are as we expect and detect rows/columns that need to be removed
Transcriptomics 2: Statistical Analysis. Identifying which genes are differentially expressed between treatments. This is the main analysis step. We will use different methods for bulk and single cell data.
Transcriptomics 3: Visualising. Principal Component Analysis (PCA) volcano plots to visualise the results of the
Pages made with R (R Core Team 2024), Quarto (Allaire et al. 2024), knitr
(Xie 2024, 2015, 2014), kableExtra
(Zhu 2021)