install.packages("BiocManager")
BiocManager::install(version = "3.19")
Transcriptomics 2: Statistical Analysis
18 September, 2024
In these slides we will:
Check where you are
learn some concepts in differential expression
Find out what packages to install before the workshop
After the Transcriptomics 1: 👋 Hello data! Workshop including:
the Independent Study to consolidate, you should have:
frogs-88H
which contains:
s30_filtered.csv
, s30_summary_gene.csv
, s30_summary_gene_filtered.csv
, s30_summary_samp.csv
and equivalents for S14 OR S20)cont-fgf-s30.R
and cont-fgf-s20.R
OR cont-fgf-s14.R
Files should be organised into folders. Code should well commented and easy to read.
mice-88H
which contains
hspc_summary_gene.csv
, hspc_summary_samp.csv
, prog_summary_gene.csv
, prog_summary_samp.csv
)hspc-prog.R
Files should be organised into folders. Code should well commented and easy to read.
Either of the other examples.
Go through:
The goal of differential expression is to test whether there is a significant difference in gene expression between groups.
A large number of computational methods have been developed for differential expression analysis
R is the leading language for differential expression analysis
the statistical concepts are very similar to those you have already encountered in stages 1 and 2
you are essentially doing paired- or independent-samples tests
but you are doing a lot of them! One for every gene
data need normalisation before comparison
Like familiar tests:
the type of test (the function) you use depends on the type of data you have and the type of assumptions you want to make
the tests work by comparing the variation between groups to the variation within groups.
you will get: the difference between groups, a test statistic, and a p-value
you also get an adjusted p-value which is the ‘correction’ for multiple testing
The difference between groups is given as the log2 fold change in expression between groups
A fold change is the expression in one group divided by the expression in the other group
we use fold changes because the absolute expression values may not be accurate and relative changes are what matters
we use log2 fold changes because they are symmetrical around 0
log2 means log to the base 2
Suppose the expression in group A is 5 and the expression in group B is 8
A/B = 5/8 = 0.625 and B/A = 8/5 = 1.6
If B is greater than A the range of A/B is 0 to 1 but the range of B/A is 1 to infinity
However, if we take the log2 of A/B we get -0.678 and the log2 of B/A is 0.678.
The p-value has to be adjusted because of the number of tested being done
In stage 1, we used Tukey’s HSD to adjust for multiple testing following an ANOVA
Here the Benjamini-Hochberg procedure (Benjamini and Hochberg 1995) is used to adjust for multiple testing
BH controls the False Discovery Rate (FDR)
The FDR is the proportion of false positives among the genes called significant
Normalisation adjusts raw counts to account for factors that prevent direct comparisons
Normalisation usually influences the experimental design as well as the analysis
The 🐭 mouse data have been normalised to simplify the analysis for you; the 🐸 frog data have not but the DE method will do this for you.
Normalisation is a big topic. See Düren, Lederer, and Qin (2022); Bullard et al. (2010); Lytal, Ran, and An (2020); Abrams et al. (2019); Vallejos et al. (2017); Evans, Hardin, and Stoebel (2017)
A large number of computational methods have been developed for differential expression analysis
Methods vary in the types of normalisation they do, the statistical model they use, and the assumptions they make
Some of the most well-known methods are provided by: DESeq2
(Love, Huber, and Anders 2014), edgeR
(Robinson, McCarthy, and Smyth 2010; McCarthy, Chen, and Smyth 2012; Chen, Lun, and Smyth 2016), limma
(Ritchie et al. 2015) and scran
(Lun, McCarthy, and Marioni 2016)
DESeq2
and edgeR
DESeq2
uses the median of ratios method; edgeR
uses the trimmed mean of M values (TMM) methodscran
DE methods require two types of data: the expression data and the meta data
The meta data is the information about the samples
It says which samples (columns) are in which group (s)
It is usually stored in a separate file
Expression for the whole transcriptome X. laevis v10.1 genome assembly
Values are raw counts
The statistical analysis method we will use DESeq2
(Love, Huber, and Anders 2014) requires raw counts and performs the normalisation itself
Expression for a subset of genes, the surfaceome
Values are log2 normalised values
The statistical analysis method we will use scran
(Lun, McCarthy, and Marioni 2016) requires normalised values
BiocManager
from CRAN in the the normal way and set the version of Bioconductor packages to install:
install.packages("BiocManager")
BiocManager::install(version = "3.19")
DESeq2
from Bioconductor using BiocManager:
BiocManager::install("DESeq2")
scran
from Bioconductor using BiocManager:
BiocManager::install("scran")
Transcriptomics 1: Hello data Getting to know the data. Checking the distributions of values
Transcriptomics 2: Statistical Analysis Identifying which genes are differentially expressed between treatments.
Transcriptomics 3: Visualising and Interpreting. PCA, Volcano plots and heatmaps to visualise results. Interpreting the results and finding out more about genes of interest.
Pages made with R (R Core Team 2024), Quarto (Allaire et al. 2024), knitr
[Xie (2024); knitr2; knitr3], kableExtra
(Zhu 2021)