Independent Study to prepare for workshop

Transcriptomics 2: Statistical Analysis

Emma Rand

11 February, 2025

Overview

In these slides we will:

Check where you are following week 3
learn some concepts in differential expression
- log₂ fold changes
- Multiple correction
- normalisation
- statistical model
Find out what packages we will use

Where should you be?

What we did in Transcriptomics 1: 👋 Hello data!

Discovered how many rows and columns we had in our datasets and what these were.
Examined the distribution of values
- across the whole dataset
- across the samples/cells (i.e., averaged over genes) to see variation between samples/cells
- across the genes (i.e., averaged over samples/cells) to see variation between genes
Filtered data for quality control and wrote to file (except 🐭)

Where should you be?

After the Transcriptomics 1: 👋 Hello data! Workshop including:

🤗 Look after future you! and
the Independent Study to consolidate, you should have:

🐸 Frog development

An RStudio Project called frogs-88H which contains:
- data-raw: xlaevis_counts_S14.csv, xlaevis_counts_S20.csv, xlaevis_counts_S30.csv
- data-processed: s30_filtered.csv, s20_filtered.csv
- Two scripts: cont-fgf-s30.R, cont-fgf-s20.R

🎄 Arabidopsis

An RStudio Project called arab-88H which contains:
- data-raw: arabidopsis-wild.csv, arabidopsis-spl7.csv
- data-processed: wild_filtered.csv, spl7_filtered.csv
- Two scripts: suff-def-wild.R, suff-def-spl7.R

💉 Leishmania

An RStudio Project called leish-88H which contains:
- data-raw: leishmania-mex-ama.csv, leishmania-mex-pro.csv, leishmania-mex-meta.csv
- data-processed: pro_meta_filtered.csv, pro_ama_filtered.csv
- Two scripts: pro_meta.R, pro_ama.R

🐭 Stem cells

An RStudio Project called mice-88H which contains
- data-raw: surfaceome_hspc.csv, surfaceome_prog.csv, surfaceome_lthsc.csv
- data-processed: hspc_prog.csv, hspc_lthsc.csv
- Two scripts: hspc-prog.R, hspc-lthsc.R

Additionally…

Files should be organised into folders. Code should well commented and easy to read. You should have curated your code to remove unnecessary commands that were useful to troubleshoot or understand objects in your environment but which are not needed for the final analysis.

If you are missing files, go through:

Transcriptomics 1: 👋 Hello data! Workshop including:
🤗 Look after future you! and
the Independent Study to consolidate

Differential expression

The goal of differential expression is to test whether there is a significant difference in gene expression between groups.
A large number of computational methods have been developed for differential expression analysis
R is the leading language for differential expression analysis

Differential expression

the statistical concepts are very similar to those you have already encountered in stages 1 and 2
you are essentially doing paired- or independent-samples tests
but you are doing a lot of them! One for every gene
data need normalisation before comparison

Statistical concepts

Like familiar tests:

the type of test (the function) you use depends on the type of data you have and the type of assumptions you want to make
the tests work by comparing the variation between groups to the variation within groups.
you will get: the difference between groups, a test statistic, and a p-value
you also get an adjusted p-value which is the ‘correction’ for multiple testing

The difference between groups

The difference between groups is given as the log₂ fold change in expression between groups
A fold change is the expression in one group divided by the expression in the other group: \(\frac{A}{B}\)
we use fold changes because the absolute expression values may not be accurate and relative changes are what matters
we use log₂ fold changes because they are symmetrical around 0

Why log₂ fold change?

log₂ means log to the base 2
Suppose the expression in group A is 5 and the expression in group B is 8
\(\frac{A}{B} = \frac{5}{8}\) = 0.625 and \(\frac{B}{A} = \frac{8}{5}\) = 1.6
If B > A the range of \(\frac{A}{B}\) is 0 - 1 but the range of \(\frac{B}{A}\) is 1 - \(\infty\)
However, if we take the log₂ of \(\frac{A}{B}\) we get -0.678 and the log₂ of \(\frac{B}{A}\) is 0.678.

Adjusted p-value

The p-value has to be adjusted because of the number of tested being done
In stage 1, we used Tukey’s HSD to adjust for multiple testing following an ANOVA
Here the Benjamini-Hochberg procedure (Benjamini and Hochberg 1995) is used to adjust for multiple testing
BH controls the False Discovery Rate (FDR)
The FDR is the proportion of false positives among the genes called significant

Normalisation

Normalisation adjusts raw counts to account for factors that prevent direct comparisons
Normalisation usually influences the experimental design as well as the analysis

Normalisation

🐭 mice data are normalised
🐸 frog, 🎄 Arabidopisis and 💉 Leishmania data are raw counts (not normalised) because the differential expression method will do this.
Normalisation is a big topic. See Düren, Lederer, and Qin (2022); Bullard et al. (2010); Lytal, Ran, and An (2020); Abrams et al. (2019); Vallejos et al. (2017); Evans, Hardin, and Stoebel (2017)

Type of DE tests

A large number of computational methods have been developed for differential expression analysis
Methods vary in the types of normalisation they do, the statistical model they use, and the assumptions they make
Some of the most well-known methods are provided by: DESeq2 (Love, Huber, and Anders 2014), edgeR (Robinson, McCarthy, and Smyth 2010; McCarthy, Chen, and Smyth 2012; Chen, Lun, and Smyth 2016), limma (Ritchie et al. 2015) and scran (Lun, McCarthy, and Marioni 2016)

Type of test (the function)

DESeq2 and edgeR
- both require raw counts as input
- both assume that most genes are not DE
- both use a negative binomial distribution¹ to model the data
- use slightly different normalisation methods: DESeq2 uses the median of ratios method; edgeR uses the trimmed mean of M values (TMM) method

Type of test (the function)

scran
- works on normalized log-expression values
- performs Welch t-tests

Meta data

DE methods require two types of data: the expression data and the meta data
The meta data gives the information about the samples
It says which samples (which columns of data) are in which treatment group (s)
Meta data is usually stored in a separate file

🐸 Frog development

Expression for the whole transcriptome X. laevis v10.1 genome assembly
Values are raw counts
The statistical analysis method we will use DESeq2 (Love, Huber, and Anders 2014) requires raw counts and performs the normalisation itself

🎄 Arabidopisis

Expression for the whole transcriptome ENSEMBL Arabidopsis TAIR10(Yates et al. 2022)
Values are raw counts
The statistical analysis method we will use DESeq2 (Love, Huber, and Anders 2014) requires raw counts and performs the normalisation itself

💉 Leishmania

Expression for the whole transcriptome L. mexicana MHOM/GT/2001/U1103(Rogers et al. 2011)
Values are raw counts
The statistical analysis method we will use DESeq2 (Love, Huber, and Anders 2014) requires raw counts and performs the normalisation itself

🐭 Stem cells

Expression for a subset of the transcriptome, the surfaceome
Values are log2 normalised values
The statistical analysis method we will use scran (Lun, McCarthy, and Marioni 2016) requires normalised values

Adding gene information

The gene id is difficult to interpret
Therefore we need to add information such as the gene name and a description to the results

🐸 Frog data information comes from Xenbase (Fisher et al. 2023)
🎄 Arabidopisis information comes from TAIR10 (Yates et al. 2022)
💉 Leishmania information comes TriTrypDB (Rogers et al. 2011)
🐭 Stem cell information comes from Ensembl (Birney et al. 2004)

🐸 Xenbase

xenbase logo

Xenbase is a model organism database that provides genomic, molecular, and developmental biology information about Xenopus laevis and Xenopus tropicalis.

It took me some time to find the information you need.

🐸 Xenbase

I got the information from the Xenbase information pages under Data Reports | Gene Information
This is listed: Xenbase Gene Product Information [readme] gzipped gpi (tab separated)
Click on the readme link to see the file format and columns
I downloaded xenbase.gpi.gz, unzipped it, removed header lines and the Xenopus tropicalis (taxon:8364) entries and saved it as xenbase_info.xlsx
In the workshop you will import this file and merge the information with the results file

🎄 TAIR10 through Ensembl

Ensembl creates, integrates and distributes reference datasets and analysis tools that enable genomics
BioMart (Smedley et al. 2009) provides uniform access to these large datasets
biomaRt (Durinck et al. 2009, 2005) is a Bioconductor package gives you programmatic access to BioMart.
In the workshop you use this package to get information you can merge with the results file

💉 TriTrypDB

I got the information from TriTrypDB
which is a functional genomic resource for the Trypanosomatidae and Plasmodidae
https://tritrypdb.org/tritrypdb/app/downloads section
I downloaded the L. mexicana MHOM/GT/2001/U1103 Full GFF and extracted the gene information and saved it as leishmania_mex.xlsx
In the workshop you will import this file and merge the information with the results file

🐭 Ensembl

Ensembl creates, integrates and distributes reference datasets and analysis tools that enable genomics
BioMart (Smedley et al. 2009) provides uniform access to these large datasets
biomaRt (Durinck et al. 2009, 2005) is a Bioconductor package gives you programmatic access to BioMart.
In the workshop you use this package to get information you can merge with the results file

Packages

These packages are all on the University computers which you can access on campus or remotely using the VDS

If you want to use your own machine you will need to install the packages.

Install BiocManager from CRAN in the the normal way and set the version of Bioconductor packages to install:

install.packages("BiocManager")
BiocManager::install(version = "3.19")

Install DESeq2 from Bioconductor using BiocManager:

BiocManager::install("DESeq2")

Install scran from Bioconductor using BiocManager:

BiocManager::install("scran")

Install biomaRt from Bioconductor using BiocManager:

BiocManager::install("biomaRt")

Workshops

Transcriptomics 1: Hello data. Getting to know the data. Checking the distributions of values overall, across rows and columns to check things are as we expect and detect rows/columns that need to be removed
Transcriptomics 2: Statistical Analysis. Identifying which genes are differentially expressed between treatments. This is the main analysis step. We will use different methods for bulk and single cell data.
Transcriptomics 3: Visualising. Principal Component Analysis (PCA) volcano plots to visualise the results of the

References

Pages made with R (R Core Team 2024), Quarto (Allaire et al. 2024), knitr (Xie 2024, 2015, 2014), kableExtra (Zhu 2021)

Abrams, Zachary B., Travis S. Johnson, Kun Huang, Philip R. O. Payne, and Kevin Coombes. 2019. “A Protocol to Evaluate RNA Sequencing Normalization Methods.” BMC Bioinformatics 20 (24): 679. https://doi.org/10.1186/s12859-019-3247-x.

Allaire, J. J., Charles Teague, Carlos Scheidegger, Yihui Xie, and Christophe Dervieux. 2024. “Quarto.” https://doi.org/10.5281/zenodo.5960048.

Benjamini, Yoav, and Yosef Hochberg. 1995. “Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing.” J. R. Stat. Soc. Series B Stat. Methodol. 57 (1): 289–300. http://www.jstor.org/stable/2346101.

Birney, Ewan, T. Daniel Andrews, Paul Bevan, Mario Caccamo, Yuan Chen, Laura Clarke, Guy Coates, et al. 2004. “An Overview of Ensembl.” Genome Research 14 (5): 925–28. https://doi.org/10.1101/gr.1860604.

Bullard, James H., Elizabeth Purdom, Kasper D. Hansen, and Sandrine Dudoit. 2010. “Evaluation of Statistical Methods for Normalization and Differential Expression in mRNA-Seq Experiments.” BMC Bioinformatics 11 (1): 94. https://doi.org/10.1186/1471-2105-11-94.

Chen, Yunshun, Aaron T. L. Lun, and Gordon K. Smyth. 2016. “From Reads to Genes to Pathways: Differential Expression Analysis of RNA-Seq Experiments Using Rsubread and the edgeR Quasi-Likelihood Pipeline.” https://doi.org/10.12688/f1000research.8987.2.

Düren, Yannick, Johannes Lederer, and Li-Xuan Qin. 2022. “Depth Normalization of Small RNA Sequencing: Using Data and Biology to Select a Suitable Method.” Nucleic Acids Research 50 (10): e56. https://doi.org/10.1093/nar/gkac064.

Durinck, Steffen, Yves Moreau, Arek Kasprzyk, Sean Davis, Bart De Moor, Alvis Brazma, and Wolfgang Huber. 2005. “BioMart and Bioconductor: A Powerful Link Between Biological Databases and Microarray Data Analysis.” Bioinformatics 21: 3439–40.

Durinck, Steffen, Paul T. Spellman, Ewan Birney, and Wolfgang Huber. 2009. “Mapping Identifiers for the Integration of Genomic Datasets with the r/Bioconductor Package biomaRt.” Nature Protocols 4: 1184–91.

Evans, Ciaran, Johanna Hardin, and Daniel M Stoebel. 2017. “Selecting Between-Sample RNA-Seq Normalization Methods from the Perspective of Their Assumptions.” Briefings in Bioinformatics 19 (5): 776–92. https://doi.org/10.1093/bib/bbx008.

Fisher, Malcolm, Christina James-Zorn, Virgilio Ponferrada, Andrew J Bell, Nivitha Sundararaj, Erik Segerdell, Praneet Chaturvedi, et al. 2023. “Xenbase: Key Features and Resources of the Xenopus Model Organism Knowledgebase.” Genetics 224 (1): iyad018. https://doi.org/10.1093/genetics/iyad018.

Love, Michael I., Wolfgang Huber, and Simon Anders. 2014. “Moderated Estimation of Fold Change and Dispersion for RNA-Seq Data with DESeq2.” Genome Biology 15: 550. https://doi.org/10.1186/s13059-014-0550-8.

Lun, Aaron T. L., Davis J. McCarthy, and John C. Marioni. 2016. “A Step-by-Step Workflow for Low-Level Analysis of Single-Cell RNA-Seq Data with Bioconductor.” F1000Res. 5: 2122. https://doi.org/10.12688/f1000research.9501.2.

Lytal, Nicholas, Di Ran, and Lingling An. 2020. “Normalization Methods on Single-Cell RNA-Seq Data: An Empirical Survey.” Frontiers in Genetics 11. https://www.frontiersin.org/articles/10.3389/fgene.2020.00041.

McCarthy, Davis J., Yunshun Chen, and Gordon K. Smyth. 2012. “Differential Expression Analysis of Multifactor RNA-Seq Experiments with Respect to Biological Variation.” Nucleic Acids Research 40 (10): 4288–97. https://doi.org/10.1093/nar/gks042.

R Core Team. 2024. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.

Ritchie, Matthew E., Belinda Phipson, Di Wu, Yifang Hu, Charity W. Law, Wei Shi, and Gordon K. Smyth. 2015. “Limma Powers Differential Expression Analyses for RNA-Sequencing and Microarray Studies.” Nucleic Acids Research 43 (7): e47. https://doi.org/10.1093/nar/gkv007.

Robinson, Mark D., Davis J. McCarthy, and Gordon K. Smyth. 2010. “edgeR: A Bioconductor Package for Differential Expression Analysis of Digital Gene Expression Data.” Bioinformatics 26 (1): 139–40. https://doi.org/10.1093/bioinformatics/btp616.

Rogers, Matthew B., James D. Hilley, Nicholas J. Dickens, Jon Wilkes, Paul A. Bates, Daniel P. Depledge, David Harris, et al. 2011. “Chromosome and Gene Copy Number Variation Allow Major Structural Change Between Species and Strains of Leishmania.” Genome Research 21 (12): 2129–42. https://doi.org/10.1101/gr.122945.111.

Smedley, Damian, Syed Haider, Benoit Ballester, Richard Holland, Darin London, Gudmundur Thorisson, and Arek Kasprzyk. 2009. “BioMart Biological Queries Made Easy.” BMC Genomics 10 (1): 22. https://doi.org/10.1186/1471-2164-10-22.

Vallejos, Catalina A., Davide Risso, Antonio Scialdone, Sandrine Dudoit, and John C. Marioni. 2017. “Normalizing Single-Cell RNA Sequencing Data: Challenges and Opportunities.” Nature Methods 14 (6): 565–71. https://doi.org/10.1038/nmeth.4292.

Xie, Yihui. 2014. “Knitr: A Comprehensive Tool for Reproducible Research in R.” In Implementing Reproducible Computational Research, edited by Victoria Stodden, Friedrich Leisch, and Roger D. Peng. Chapman; Hall/CRC.

———. 2015. Dynamic Documents with R and Knitr. 2nd ed. Boca Raton, Florida: Chapman; Hall/CRC. https://yihui.org/knitr/.

———. 2024. Knitr: A General-Purpose Package for Dynamic Report Generation in r. https://yihui.org/knitr/.

Yates, Andrew D, James Allen, Ridwan M Amode, Andrey G Azov, Matthieu Barba, Andrés Becerra, Jyothish Bhai, et al. 2022. “Ensembl Genomes 2022: An Expanding Genome Resource for Non-Vertebrates.” Nucleic Acids Research 50 (D1): D996–1003. https://doi.org/10.1093/nar/gkab1007.

Zhu, Hao. 2021. “kableExtra: Construct Complex Table with ’Kable’ and Pipe Syntax.” https://CRAN.R-project.org/package=kableExtra.

Independent Study to prepare for workshop

Overview

Where should you be?

What we did in Transcriptomics 1: 👋 Hello data!

Where should you be?

🐸 Frog development

🎄 Arabidopsis

💉 Leishmania

🐭 Stem cells

Additionally…

Differential expression

Differential expression

Differential expression

Statistical concepts

The difference between groups

Why log2 fold change?

Adjusted p-value

Normalisation

Normalisation

Type of DE tests

Type of test (the function)

Type of test (the function)

Meta data

🐸 Frog development

🎄 Arabidopisis

💉 Leishmania

🐭 Stem cells

Adding gene information

Adding gene information

🐸 Xenbase

🐸 Xenbase

🎄 TAIR10 through Ensembl

💉 TriTrypDB

🐭 Ensembl

Packages

Workshops

Workshops

References

Why log₂ fold change?