Independent Study to prepare for workshop

Transcriptomics 2: Statistical Analysis

Emma Rand

18 September, 2024

Overview

In these slides we will:

  • Check where you are

  • learn some concepts in differential expression

    • log2 fold changes
    • Multiple correction
    • normalisation
    • statistical model
  • Find out what packages to install before the workshop

Where should you be?

What we did in Transcriptomics 1: 👋 Hello data!

  • Discovered how many rows and columns we had in our datasets and what these were.
  • Examined the distribution
    • of values across the whole dataset
    • of values across the samples/cells (i.e., averaged across genes) to see variation between samples/cells
    • of values across the genes (i.e., averaged across samples/cells) to see variation between genes
  • Saved files of filtered or summarised data.

Where should you be?

After the Transcriptomics 1: 👋 Hello data! Workshop including:

🐸 Frogs

  • An RStudio Project called frogs-88H which contains:
    • Raw data (S14, S20 and S30)
    • Processed data (s30_filtered.csv, s30_summary_gene.csv, s30_summary_gene_filtered.csv, s30_summary_samp.csv and equivalents for S14 OR S20)
    • Two scripts called cont-fgf-s30.R and cont-fgf-s20.R OR cont-fgf-s14.R

Files should be organised into folders. Code should well commented and easy to read.

🐭 Mice

  • An RStudio Project called mice-88H which contains
    • Raw data (hspc, prog, lthsc)
    • Processed data (hspc_summary_gene.csv, hspc_summary_samp.csv, prog_summary_gene.csv, prog_summary_samp.csv)
  • One script called hspc-prog.R

Files should be organised into folders. Code should well commented and easy to read.

🍂

Either of the other examples.

If you do not have those

Go through:

Differential expression

Differential expression

  • The goal of differential expression is to test whether there is a significant difference in gene expression between groups.

  • A large number of computational methods have been developed for differential expression analysis

  • R is the leading language for differential expression analysis

Differential expression

  • the statistical concepts are very similar to those you have already encountered in stages 1 and 2

  • you are essentially doing paired- or independent-samples tests

  • but you are doing a lot of them! One for every gene

  • data need normalisation before comparison

Statistical concepts

Like familiar tests:

  • the type of test (the function) you use depends on the type of data you have and the type of assumptions you want to make

  • the tests work by comparing the variation between groups to the variation within groups.

  • you will get: the difference between groups, a test statistic, and a p-value

  • you also get an adjusted p-value which is the ‘correction’ for multiple testing

The difference between groups

  • The difference between groups is given as the log2 fold change in expression between groups

  • A fold change is the expression in one group divided by the expression in the other group

  • we use fold changes because the absolute expression values may not be accurate and relative changes are what matters

  • we use log2 fold changes because they are symmetrical around 0

log2 fold change

  • log2 means log to the base 2

  • Suppose the expression in group A is 5 and the expression in group B is 8

  • A/B = 5/8 = 0.625 and B/A = 8/5 = 1.6

  • If B is greater than A the range of A/B is 0 to 1 but the range of B/A is 1 to infinity

  • However, if we take the log2 of A/B we get -0.678 and the log2 of B/A is 0.678.

Adjusted p-value

  • The p-value has to be adjusted because of the number of tested being done

  • In stage 1, we used Tukey’s HSD to adjust for multiple testing following an ANOVA

  • Here the Benjamini-Hochberg procedure (Benjamini and Hochberg 1995) is used to adjust for multiple testing

  • BH controls the False Discovery Rate (FDR)

  • The FDR is the proportion of false positives among the genes called significant

Normalisation

  • Normalisation adjusts raw counts to account for factors that prevent direct comparisons

  • Normalisation usually influences the experimental design as well as the analysis

  • The 🐭 mouse data have been normalised to simplify the analysis for you; the 🐸 frog data have not but the DE method will do this for you.

  • Normalisation is a big topic. See Düren, Lederer, and Qin (2022); Bullard et al. (2010); Lytal, Ran, and An (2020); Abrams et al. (2019); Vallejos et al. (2017); Evans, Hardin, and Stoebel (2017)

Type of test (the function)

Type of test (the function)

  • DESeq2 and edgeR
    • both require raw counts as input
    • both assume that most genes are not DE
    • both use a negative binomial distribution1 to model the data
    • use slightly different normalisation methods: DESeq2 uses the median of ratios method; edgeR uses the trimmed mean of M values (TMM) method

Type of test (the function)

  • scran
    • works on normalized log-expression values
    • performs Welch t-tests

Meta data

  • DE methods require two types of data: the expression data and the meta data

  • The meta data is the information about the samples

  • It says which samples (columns) are in which group (s)

  • It is usually stored in a separate file

🐸 Data

🐭 Data

  • Expression for a subset of genes, the surfaceome

  • Values are log2 normalised values

  • The statistical analysis method we will use scran (Lun, McCarthy, and Marioni 2016) requires normalised values

Packages to install before the workshop

BiocManager from CRAN in the the normal way and set the version of Bioconductor packages to install:

install.packages("BiocManager")
BiocManager::install(version = "3.19")

DESeq2 from Bioconductor using BiocManager:

BiocManager::install("DESeq2")

scran from Bioconductor using BiocManager:

BiocManager::install("scran")

Workshops

Workshops

  • Transcriptomics 1: Hello data Getting to know the data. Checking the distributions of values

  • Transcriptomics 2: Statistical Analysis Identifying which genes are differentially expressed between treatments.

  • Transcriptomics 3: Visualising and Interpreting. PCA, Volcano plots and heatmaps to visualise results. Interpreting the results and finding out more about genes of interest.

References

Pages made with R (R Core Team 2024), Quarto (Allaire et al. 2024), knitr [Xie (2024); knitr2; knitr3], kableExtra (Zhu 2021)

Abrams, Zachary B., Travis S. Johnson, Kun Huang, Philip R. O. Payne, and Kevin Coombes. 2019. “A Protocol to Evaluate RNA Sequencing Normalization Methods.” BMC Bioinformatics 20 (24): 679. https://doi.org/10.1186/s12859-019-3247-x.
Allaire, J. J., Charles Teague, Carlos Scheidegger, Yihui Xie, and Christophe Dervieux. 2024. Quarto.” https://doi.org/10.5281/zenodo.5960048.
Benjamini, Yoav, and Yosef Hochberg. 1995. “Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing.” J. R. Stat. Soc. Series B Stat. Methodol. 57 (1): 289–300. http://www.jstor.org/stable/2346101.
Bullard, James H., Elizabeth Purdom, Kasper D. Hansen, and Sandrine Dudoit. 2010. “Evaluation of Statistical Methods for Normalization and Differential Expression in mRNA-Seq Experiments.” BMC Bioinformatics 11 (1): 94. https://doi.org/10.1186/1471-2105-11-94.
Chen, Yunshun, Aaron T. L. Lun, and Gordon K. Smyth. 2016. “From Reads to Genes to Pathways: Differential Expression Analysis of RNA-Seq Experiments Using Rsubread and the edgeR Quasi-Likelihood Pipeline.” https://doi.org/10.12688/f1000research.8987.2.
Düren, Yannick, Johannes Lederer, and Li-Xuan Qin. 2022. “Depth Normalization of Small RNA Sequencing: Using Data and Biology to Select a Suitable Method.” Nucleic Acids Research 50 (10): e56. https://doi.org/10.1093/nar/gkac064.
Evans, Ciaran, Johanna Hardin, and Daniel M Stoebel. 2017. “Selecting Between-Sample RNA-Seq Normalization Methods from the Perspective of Their Assumptions.” Briefings in Bioinformatics 19 (5): 776–92. https://doi.org/10.1093/bib/bbx008.
Love, Michael I., Wolfgang Huber, and Simon Anders. 2014. “Moderated Estimation of Fold Change and Dispersion for RNA-Seq Data with DESeq2.” Genome Biology 15: 550. https://doi.org/10.1186/s13059-014-0550-8.
Lun, Aaron T. L., Davis J. McCarthy, and John C. Marioni. 2016. “A Step-by-Step Workflow for Low-Level Analysis of Single-Cell RNA-Seq Data with Bioconductor.” F1000Res. 5: 2122. https://doi.org/10.12688/f1000research.9501.2.
Lytal, Nicholas, Di Ran, and Lingling An. 2020. “Normalization Methods on Single-Cell RNA-Seq Data: An Empirical Survey.” Frontiers in Genetics 11. https://www.frontiersin.org/articles/10.3389/fgene.2020.00041.
McCarthy, Davis J., Yunshun Chen, and Gordon K. Smyth. 2012. “Differential Expression Analysis of Multifactor RNA-Seq Experiments with Respect to Biological Variation.” Nucleic Acids Research 40 (10): 4288–97. https://doi.org/10.1093/nar/gks042.
R Core Team. 2024. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.
Ritchie, Matthew E., Belinda Phipson, Di Wu, Yifang Hu, Charity W. Law, Wei Shi, and Gordon K. Smyth. 2015. “Limma Powers Differential Expression Analyses for RNA-Sequencing and Microarray Studies.” Nucleic Acids Research 43 (7): e47. https://doi.org/10.1093/nar/gkv007.
Robinson, Mark D., Davis J. McCarthy, and Gordon K. Smyth. 2010. “edgeR: A Bioconductor Package for Differential Expression Analysis of Digital Gene Expression Data.” Bioinformatics 26 (1): 139–40. https://doi.org/10.1093/bioinformatics/btp616.
Vallejos, Catalina A., Davide Risso, Antonio Scialdone, Sandrine Dudoit, and John C. Marioni. 2017. “Normalizing Single-Cell RNA Sequencing Data: Challenges and Opportunities.” Nature Methods 14 (6): 565–71. https://doi.org/10.1038/nmeth.4292.
Xie, Yihui. 2024. Knitr: A General-Purpose Package for Dynamic Report Generation in r. https://yihui.org/knitr/.
Zhu, Hao. 2021. “kableExtra: Construct Complex Table with ’Kable’ and Pipe Syntax.” https://CRAN.R-project.org/package=kableExtra.