Independent Study to prepare for workshop

Transcriptomics 2: Statistical Analysis

Emma Rand

6 November, 2024

Overview

In these slides we will:

  • Check where you are following week 3

  • learn some concepts in differential expression

    • log2 fold changes
    • Multiple correction
    • normalisation
    • statistical model
  • Find out what packages we will use

Where should you be?

What we did in Transcriptomics 1: 👋 Hello data!

  • Discovered how many rows and columns we had in our datasets and what these were.
  • Examined the distribution of values
    • across the whole dataset
    • across the samples/cells (i.e., averaged over genes) to see variation between samples/cells
    • across the genes (i.e., averaged over samples/cells) to see variation between genes
  • Filtered data for quality control and wrote to file (except 🐭)

Where should you be?

After the Transcriptomics 1: 👋 Hello data! Workshop including:

🐸 Frog development

  • An RStudio Project called frogs-88H which contains:

    • data-raw: xlaevis_counts_S14.csv, xlaevis_counts_S20.csv, xlaevis_counts_S30.csv
    • data-processed: s30_filtered.csv, s20_filtered.csv
    • Two scripts: cont-fgf-s30.R, cont-fgf-s20.R

🎄 Arabidopsis

  • An RStudio Project called arab-88H which contains:

    • data-raw: arabidopsis-wild.csv, arabidopsis-spl7.csv
    • data-processed: wild_filtered.csv, spl7_filtered.csv
    • Two scripts: suff-def-wild.R, suff-def-spl7.R

💉 Leishmania

  • An RStudio Project called leish-88H which contains:

    • data-raw: leishmania-mex-ama.csv, leishmania-mex-pro.csv, leishmania-mex-meta.csv
    • data-processed: pro_meta_filtered.csv, pro_ama_filtered.csv
    • Two scripts: pro_meta.R, pro_ama.R

🐭 Stem cells

  • An RStudio Project called mice-88H which contains

    • data-raw: surfaceome_hspc.csv, surfaceome_prog.csv, surfaceome_lthsc.csv
    • data-processed: hspc_prog.csv, hspc_lthsc.csv
    • Two scripts: hspc-prog.R, hspc-lthsc.R

Additionally…

Files should be organised into folders. Code should well commented and easy to read. You should have curated your code to remove unnecessary commands that were useful to troubleshoot or understand objects in your environment but which are not needed for the final analysis.

If you are missing files, go through:

Differential expression

Differential expression

  • The goal of differential expression is to test whether there is a significant difference in gene expression between groups.

  • A large number of computational methods have been developed for differential expression analysis

  • R is the leading language for differential expression analysis

Differential expression

  • the statistical concepts are very similar to those you have already encountered in stages 1 and 2

  • you are essentially doing paired- or independent-samples tests

  • but you are doing a lot of them! One for every gene

  • data need normalisation before comparison

Statistical concepts

Like familiar tests:

  • the type of test (the function) you use depends on the type of data you have and the type of assumptions you want to make

  • the tests work by comparing the variation between groups to the variation within groups.

  • you will get: the difference between groups, a test statistic, and a p-value

  • you also get an adjusted p-value which is the ‘correction’ for multiple testing

The difference between groups

  • The difference between groups is given as the log2 fold change in expression between groups

  • A fold change is the expression in one group divided by the expression in the other group: \(\frac{A}{B}\)

  • we use fold changes because the absolute expression values may not be accurate and relative changes are what matters

  • we use log2 fold changes because they are symmetrical around 0

Why log2 fold change?

  • log2 means log to the base 2

  • Suppose the expression in group A is 5 and the expression in group B is 8

  • \(\frac{A}{B} = \frac{5}{8}\) = 0.625 and \(\frac{B}{A} = \frac{8}{5}\) = 1.6

  • If B > A the range of \(\frac{A}{B}\) is 0 - 1 but the range of \(\frac{B}{A}\) is 1 - \(\infty\)

  • However, if we take the log2 of \(\frac{A}{B}\) we get -0.678 and the log2 of \(\frac{B}{A}\) is 0.678.

Adjusted p-value

  • The p-value has to be adjusted because of the number of tested being done

  • In stage 1, we used Tukey’s HSD to adjust for multiple testing following an ANOVA

  • Here the Benjamini-Hochberg procedure (Benjamini and Hochberg 1995) is used to adjust for multiple testing

  • BH controls the False Discovery Rate (FDR)

  • The FDR is the proportion of false positives among the genes called significant

Normalisation

  • Normalisation adjusts raw counts to account for factors that prevent direct comparisons

  • Normalisation usually influences the experimental design as well as the analysis

Normalisation

  • 🐭 mice data are normalised

  • 🐸 frog, 🎄 Arabidopisis and 💉 Leishmania data are raw counts (not normalised) because the differential expression method will do this.

  • Normalisation is a big topic. See Düren, Lederer, and Qin (2022); Bullard et al. (2010); Lytal, Ran, and An (2020); Abrams et al. (2019); Vallejos et al. (2017); Evans, Hardin, and Stoebel (2017)

Type of DE tests

Type of test (the function)

  • DESeq2 and edgeR
    • both require raw counts as input
    • both assume that most genes are not DE
    • both use a negative binomial distribution1 to model the data
    • use slightly different normalisation methods: DESeq2 uses the median of ratios method; edgeR uses the trimmed mean of M values (TMM) method

Type of test (the function)

  • scran
    • works on normalized log-expression values
    • performs Welch t-tests

Meta data

  • DE methods require two types of data: the expression data and the meta data

  • The meta data gives the information about the samples

  • It says which samples (which columns of data) are in which treatment group (s)

  • Meta data is usually stored in a separate file

🐸 Frog development

🎄 Arabidopisis

💉 Leishmania

🐭 Stem cells

  • Expression for a subset of the transcriptome, the surfaceome

  • Values are log2 normalised values

  • The statistical analysis method we will use scran (Lun, McCarthy, and Marioni 2016) requires normalised values

Adding gene information

Adding gene information

  • The gene id is difficult to interpret

  • Therefore we need to add information such as the gene name and a description to the results

🐸 Xenbase

xenbase logo

Xenbase is a model organism database that provides genomic, molecular, and developmental biology information about Xenopus laevis and Xenopus tropicalis.

It took me some time to find the information you need.

🐸 Xenbase

  • I got the information from the Xenbase information pages under Data Reports | Gene Information

  • This is listed: Xenbase Gene Product Information [readme] gzipped gpi (tab separated)

  • Click on the readme link to see the file format and columns

  • I downloaded xenbase.gpi.gz, unzipped it, removed header lines and the Xenopus tropicalis (taxon:8364) entries and saved it as xenbase_info.xlsx

  • In the workshop you will import this file and merge the information with the results file

🎄 TAIR10 through Ensembl

  • Ensembl creates, integrates and distributes reference datasets and analysis tools that enable genomics

  • BioMart (Smedley et al. 2009) provides uniform access to these large datasets

  • biomaRt (Durinck et al. 2009, 2005) is a Bioconductor package gives you programmatic access to BioMart.

  • In the workshop you use this package to get information you can merge with the results file

💉 TriTrypDB

🐭 Ensembl

  • Ensembl creates, integrates and distributes reference datasets and analysis tools that enable genomics

  • BioMart (Smedley et al. 2009) provides uniform access to these large datasets

  • biomaRt (Durinck et al. 2009, 2005) is a Bioconductor package gives you programmatic access to BioMart.

  • In the workshop you use this package to get information you can merge with the results file

Packages

These packages are all on the University computers which you can access on campus or remotely using the VDS

If you want to use your own machine you will need to install the packages.

Install BiocManager from CRAN in the the normal way and set the version of Bioconductor packages to install:

install.packages("BiocManager")
BiocManager::install(version = "3.19")

Install DESeq2 from Bioconductor using BiocManager:

BiocManager::install("DESeq2")

Install scran from Bioconductor using BiocManager:

BiocManager::install("scran")

Install biomaRt from Bioconductor using BiocManager:

BiocManager::install("biomaRt")

Workshops

Workshops

  • Transcriptomics 1: Hello data. Getting to know the data. Checking the distributions of values overall, across rows and columns to check things are as we expect and detect rows/columns that need to be removed

  • Transcriptomics 2: Statistical Analysis. Identifying which genes are differentially expressed between treatments. This is the main analysis step. We will use different methods for bulk and single cell data.

  • Transcriptomics 3: Visualising. Principal Component Analysis (PCA) volcano plots to visualise the results of the

References

Pages made with R (R Core Team 2024), Quarto (Allaire et al. 2024), knitr (Xie 2024, 2015, 2014), kableExtra (Zhu 2021)

Abrams, Zachary B., Travis S. Johnson, Kun Huang, Philip R. O. Payne, and Kevin Coombes. 2019. “A Protocol to Evaluate RNA Sequencing Normalization Methods.” BMC Bioinformatics 20 (24): 679. https://doi.org/10.1186/s12859-019-3247-x.
Allaire, J. J., Charles Teague, Carlos Scheidegger, Yihui Xie, and Christophe Dervieux. 2024. Quarto.” https://doi.org/10.5281/zenodo.5960048.
Benjamini, Yoav, and Yosef Hochberg. 1995. “Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing.” J. R. Stat. Soc. Series B Stat. Methodol. 57 (1): 289–300. http://www.jstor.org/stable/2346101.
Birney, Ewan, T. Daniel Andrews, Paul Bevan, Mario Caccamo, Yuan Chen, Laura Clarke, Guy Coates, et al. 2004. “An Overview of Ensembl.” Genome Research 14 (5): 925–28. https://doi.org/10.1101/gr.1860604.
Bullard, James H., Elizabeth Purdom, Kasper D. Hansen, and Sandrine Dudoit. 2010. “Evaluation of Statistical Methods for Normalization and Differential Expression in mRNA-Seq Experiments.” BMC Bioinformatics 11 (1): 94. https://doi.org/10.1186/1471-2105-11-94.
Chen, Yunshun, Aaron T. L. Lun, and Gordon K. Smyth. 2016. “From Reads to Genes to Pathways: Differential Expression Analysis of RNA-Seq Experiments Using Rsubread and the edgeR Quasi-Likelihood Pipeline.” https://doi.org/10.12688/f1000research.8987.2.
Düren, Yannick, Johannes Lederer, and Li-Xuan Qin. 2022. “Depth Normalization of Small RNA Sequencing: Using Data and Biology to Select a Suitable Method.” Nucleic Acids Research 50 (10): e56. https://doi.org/10.1093/nar/gkac064.
Durinck, Steffen, Yves Moreau, Arek Kasprzyk, Sean Davis, Bart De Moor, Alvis Brazma, and Wolfgang Huber. 2005. “BioMart and Bioconductor: A Powerful Link Between Biological Databases and Microarray Data Analysis.” Bioinformatics 21: 3439–40.
Durinck, Steffen, Paul T. Spellman, Ewan Birney, and Wolfgang Huber. 2009. “Mapping Identifiers for the Integration of Genomic Datasets with the r/Bioconductor Package biomaRt.” Nature Protocols 4: 1184–91.
Evans, Ciaran, Johanna Hardin, and Daniel M Stoebel. 2017. “Selecting Between-Sample RNA-Seq Normalization Methods from the Perspective of Their Assumptions.” Briefings in Bioinformatics 19 (5): 776–92. https://doi.org/10.1093/bib/bbx008.
Fisher, Malcolm, Christina James-Zorn, Virgilio Ponferrada, Andrew J Bell, Nivitha Sundararaj, Erik Segerdell, Praneet Chaturvedi, et al. 2023. “Xenbase: Key Features and Resources of the Xenopus Model Organism Knowledgebase.” Genetics 224 (1): iyad018. https://doi.org/10.1093/genetics/iyad018.
Love, Michael I., Wolfgang Huber, and Simon Anders. 2014. “Moderated Estimation of Fold Change and Dispersion for RNA-Seq Data with DESeq2.” Genome Biology 15: 550. https://doi.org/10.1186/s13059-014-0550-8.
Lun, Aaron T. L., Davis J. McCarthy, and John C. Marioni. 2016. “A Step-by-Step Workflow for Low-Level Analysis of Single-Cell RNA-Seq Data with Bioconductor.” F1000Res. 5: 2122. https://doi.org/10.12688/f1000research.9501.2.
Lytal, Nicholas, Di Ran, and Lingling An. 2020. “Normalization Methods on Single-Cell RNA-Seq Data: An Empirical Survey.” Frontiers in Genetics 11. https://www.frontiersin.org/articles/10.3389/fgene.2020.00041.
McCarthy, Davis J., Yunshun Chen, and Gordon K. Smyth. 2012. “Differential Expression Analysis of Multifactor RNA-Seq Experiments with Respect to Biological Variation.” Nucleic Acids Research 40 (10): 4288–97. https://doi.org/10.1093/nar/gks042.
R Core Team. 2024. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.
Ritchie, Matthew E., Belinda Phipson, Di Wu, Yifang Hu, Charity W. Law, Wei Shi, and Gordon K. Smyth. 2015. “Limma Powers Differential Expression Analyses for RNA-Sequencing and Microarray Studies.” Nucleic Acids Research 43 (7): e47. https://doi.org/10.1093/nar/gkv007.
Robinson, Mark D., Davis J. McCarthy, and Gordon K. Smyth. 2010. “edgeR: A Bioconductor Package for Differential Expression Analysis of Digital Gene Expression Data.” Bioinformatics 26 (1): 139–40. https://doi.org/10.1093/bioinformatics/btp616.
Rogers, Matthew B., James D. Hilley, Nicholas J. Dickens, Jon Wilkes, Paul A. Bates, Daniel P. Depledge, David Harris, et al. 2011. “Chromosome and Gene Copy Number Variation Allow Major Structural Change Between Species and Strains of Leishmania.” Genome Research 21 (12): 2129–42. https://doi.org/10.1101/gr.122945.111.
Smedley, Damian, Syed Haider, Benoit Ballester, Richard Holland, Darin London, Gudmundur Thorisson, and Arek Kasprzyk. 2009. “BioMart Biological Queries Made Easy.” BMC Genomics 10 (1): 22. https://doi.org/10.1186/1471-2164-10-22.
Vallejos, Catalina A., Davide Risso, Antonio Scialdone, Sandrine Dudoit, and John C. Marioni. 2017. “Normalizing Single-Cell RNA Sequencing Data: Challenges and Opportunities.” Nature Methods 14 (6): 565–71. https://doi.org/10.1038/nmeth.4292.
Xie, Yihui. 2014. “Knitr: A Comprehensive Tool for Reproducible Research in R.” In Implementing Reproducible Computational Research, edited by Victoria Stodden, Friedrich Leisch, and Roger D. Peng. Chapman; Hall/CRC.
———. 2015. Dynamic Documents with R and Knitr. 2nd ed. Boca Raton, Florida: Chapman; Hall/CRC. https://yihui.org/knitr/.
———. 2024. Knitr: A General-Purpose Package for Dynamic Report Generation in r. https://yihui.org/knitr/.
Yates, Andrew D, James Allen, Ridwan M Amode, Andrey G Azov, Matthieu Barba, Andrés Becerra, Jyothish Bhai, et al. 2022. “Ensembl Genomes 2022: An Expanding Genome Resource for Non-Vertebrates.” Nucleic Acids Research 50 (D1): D996–1003. https://doi.org/10.1093/nar/gkab1007.
Zhu, Hao. 2021. “kableExtra: Construct Complex Table with ’Kable’ and Pipe Syntax.” https://CRAN.R-project.org/package=kableExtra.