Independent Study to prepare for workshop

Transcriptomics 3: Visualising and Interpreting

Emma Rand

18 September, 2024

Overview

In these slides we will:

  • Check where you are

  • learn some concepts used omics visualisation

    • Principle Component Analysis (PCA)
    • Volcano plots
    • Heatmaps
  • Find out what packages to install before the workshop

Where should you be?

What we did in Transcriptomics 2: Statistical Analysis

  • carried out differential expression analysis

  • found genes not expressed at all, or expressed in one group only

  • Saved results files

Where should you be?

After the Transcriptomics 2: 👋 Statistical Analysis Workshop including:

🐸 Frogs

  • An RStudio Project called frogs-88H which contains:
    • Raw data (S14, S20 and S30)
    • Processed data (s30_filtered.csv, s30_summary_gene.csv, s30_summary_gene_filtered.csv, s30_summary_samp.csv and equivalents for S14 OR S20)
    • Results files (s30_fgf_only.csv, S30_normalised_counts.csv, S30_results.csv and equivalents for S14 OR S20)
    • Two scripts called cont-fgf-s30.R and either cont-fgf-s20.R OR cont-fgf-s14.R

Files should be organised into folders. Code should well commented and easy to read.

🐭 Mice

  • An RStudio Project called mice-88H which contains
    • Raw data (hspc, prog, lthsc)
    • Processed data (hspc_summary_gene.csv, hspc_summary_samp.csv, prog_summary_gene.csv, prog_summary_samp.csv, lthsc_summary_gene.csv, lthsc_summary_samp.csv)
  • Results files (prog_hspc_results.csv and an equivalent for lthsc vs prog or hspc vs lthsc)
  • Two scripts called hspc-prog.R and either hspc-lthsc.R OR prog-lthsc.R

Files should be organised into folders. Code should well commented and easy to read.

🍂

Either of the other examples.

If you do not have those

Go through:

Examine the results files

Examine the results files

Remind yourself of the key columns you have in the results files:

  • a log2 fold change
  • an unadjusted p-value
  • a p value adjusted for multiple testing (FDR or padj)
  • a gene id

🐸 Frogs

Rows: 10,136
Columns: 7
$ baseMean        <dbl> 237.553928, 531.565700, 86.392830, 49.813502, 419.9983…
$ log2FoldChange  <dbl> 0.096601855, -0.089588528, -0.192811203, -0.008858703,…
$ lfcSE           <dbl> 0.2079396, 0.1557384, 0.3253216, 0.4342614, 0.1685420,…
$ stat            <dbl> 0.46456683, -0.57525007, -0.59267874, -0.02039947, -0.…
$ pvalue          <dbl> 0.64224169, 0.56512218, 0.55339617, 0.98372471, 0.8699…
$ padj            <dbl> 0.9998970, 0.9998970, 0.9998970, 0.9998970, 0.9998970,…
$ xenbase_gene_id <chr> "XB-GENE-1000007", "XB-GENE-1000023", "XB-GENE-1000062…
  • baseMean is the mean of the normalised counts for the gene across all samples
  • lfcSE standard error of the fold change
  • stat is the test statistic (the Wald statistic)
  • Generated by DESeq2 (Love, Huber, and Anders 2014)

🐭 Mice

Rows: 280
Columns: 6
$ Top             <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,…
$ p.value         <dbl> 7.038138e-117, 4.736622e-90, 1.832630e-88, 4.211954e-7…
$ FDR             <dbl> 1.970679e-114, 6.631271e-88, 1.710455e-86, 2.948368e-7…
$ summary.logFC   <dbl> 1.596910, 3.035165, 3.261056, -2.146491, -3.056730, 3.…
$ logFC.hspc      <dbl> 1.596910, 3.035165, 3.261056, -2.146491, -3.056730, 3.…
$ ensembl_gene_id <chr> "ENSMUSG00000028639", "ENSMUSG00000024053", "ENSMUSG00…
  • Top is the rank of the gene ordered by the p-value (smallest first)
  • summary.logFC and logFC.hspc give the same value (in this case since comparing two cell types)
  • generated by scran (Lun, McCarthy, and Marioni 2016)

Adding gene information

Adding gene information

  • The gene id is difficult to interpret in plots/tables

  • Therefore we need to add information such as the gene name and a description to the results

  • For the 🐸 Frog data information comes from Xenbase (Fisher et al. 2023)

  • For the 🐭 Mice data information comes from Ensembl (Birney et al. 2004)

🐸 Xenbase

xenbase logo

Xenbase is a model organism database that provides genomic, molecular, and developmental biology information about Xenopus laevis and Xenopus tropicalis.

It took me some time to find the information you need.

🐸 Xenbase

  • I got the information from the Xenbase information pages under Data Reports | Gene Information

  • This is listed: Xenbase Gene Product Information [readme] gzipped gpi (tab separated)

  • Click on the readme link to see the file format and columns

  • I downloaded xenbase.gpi.gz, unzipped it, removed header lines and the Xenopus tropicalis (taxon:8364) entries and saved it as xenbase_info.xlsx

  • In the workshop you will import this file and merge the information with the results file

🐭 Ensembl

  • Ensembl creates, integrates and distributes reference datasets and analysis tools that enable genomics

  • BioMart provides a access to these large datasets

  • biomaRt (biomaRt?) is a Bioconductor package gives you programmatic access to BioMart.

  • In the workshop you use this package to get information you can merge with the results file

Plots

What is the purpose of an Transcriptomics plot?

  • In general, we plot data to help us summarise and understand it

  • This is especially import for omics data where we have a very large number of variables and often a large number of observations

  • We will look at three plots very commonly used in omics analysis: Principal Component Analysis (PCA) plot, Heatmaps and Volcano Plots

Principal Component Analysis (PCA)

PCA

  • Principal Component Analysis is an unsupervised machine learning technique

  • Unsupervised methods1 are unsupervised in that they do not use/optimise to a particular output. The goal is to uncover structure. They do not test hypotheses

  • It is often used to visualise high dimensional data because it is a dimension reduction technique

PCA

  • It takes a large number of continuous variables (like gene expression) and reduces them to a smaller number of variables (called principal components) that explain most of the variation in the data

  • The principal components can be plotted to see how samples cluster together

PCA

  • To see if samples cluster as we would expect, we might plot the expression of one gene against another

Samples

Cells

This gives some insight but we have 280 (mice) or 10,000+(frogs) genes to consider. How do we know if the pair we use is typical? How can we consider al the genes at once?

PCA

  • PCA is a solution for this - It takes a large number of continuous variables (like gene expression) and reduces them to a smaller number of “principal components” that explain most of the variation in the data.

Samples

Cells

PCA

We have done PCA in Transcriptomics 3, but often PCA might be one of the first exploratory steps because it gives you an idea whether you expect general patterns in gene expression that distinguish groups.

Heatmaps

Heatmaps

  • are a grid of genes on one axis and samples on the other with each grid cell coloured by another variable

  • in this case the other variable is gene expression

  • they allow you to quickly get an overview of the expression patterns across genes and samples

  • we often couple them with clustering to group genes and samples with similar expression patterns together which helps us see which genes are responsible for distinguishing groups

Heat map for the frog data

See next slide for information

Heatmaps

  • On the vertical axis are genes which are differentially expressed at the 0.01 level

  • On the horizontal axis are samples

  • We can see that the FGF-treated samples cluster together and the control samples cluster together

  • We can also see two clusters of genes; one of these shows genes upregulated (more yellow) in the FGF-treated samples and the other shows genes downregulated (more blue) in the FGF-treated samples

Volcano plots

Volcano plots

  • Volcano plots often used to visualise the results of differential expression analysis

  • They are just a scatter of the corrected p value against the fold change….

  • almost - the we actually plot the negative log of the corrected p value against the fold change

Volcano plots

  • This is because just plotting the p-value means the axis is counter intuitive. Small p-values (i.e., significant values) are at the bottom of the axis)

  • And since p-values range from 1 to very tiny the points are all squashed at the bottom of the axis

Volcano plot FDR against fold change

Volcano plots

  • Plotting the negative log of the corrected p-value means that the values are spread out and the significant values are at the top of the axis

Volcano plot -log(FDR) against fold change

Visualisations

  • Should be done on normalised data so meaningful comparisons can be made

  • The 🐭 mouse data were already log2normalised

  • The 🐸 frog data were normalised by the DE method and saved to file. We will log2 transform before doing visualisations

Packages to install before the workshop

heatmaply (Galili et al. 2017) and ggrepel (Slowikowski 2024) from CRAN in the the normal way:

install.packages("heatmaply")
install.packages("ggrepel")

biomaRt (biomaRt?) from Bioconductor using BiocManager (Morgan and Ramos 2024)

BiocManager::install("biomaRt")

Workshops

Workshops

  • Transcriptomics 1: Hello data Getting to know the data. Checking the distributions of values

  • Transcriptomics 2: Statistical Analysis Identifying which genes are differentially expressed between treatments.

  • Transcriptomics 3: Visualising and Interpreting. PCA, Volcano plots and heatmaps to visualise results. Interpreting the results and finding out more about genes of interest.

References

Birney, Ewan, T. Daniel Andrews, Paul Bevan, Mario Caccamo, Yuan Chen, Laura Clarke, Guy Coates, et al. 2004. “An Overview of Ensembl.” Genome Research 14 (5): 925–28. https://doi.org/10.1101/gr.1860604.
Fisher, Malcolm, Christina James-Zorn, Virgilio Ponferrada, Andrew J Bell, Nivitha Sundararaj, Erik Segerdell, Praneet Chaturvedi, et al. 2023. “Xenbase: Key Features and Resources of the Xenopus Model Organism Knowledgebase.” Genetics 224 (1): iyad018. https://doi.org/10.1093/genetics/iyad018.
Galili, Tal, O’Callaghan, Alan, Sidi, Jonathan, Sievert, and Carson. 2017. “Heatmaply: An r Package for Creating Interactive Cluster Heatmaps for Online Publishing.” Bioinformatics. https://doi.org/10.1093/bioinformatics/btx657.
Love, Michael I., Wolfgang Huber, and Simon Anders. 2014. “Moderated Estimation of Fold Change and Dispersion for RNA-Seq Data with DESeq2.” Genome Biology 15: 550. https://doi.org/10.1186/s13059-014-0550-8.
Lun, Aaron T. L., Davis J. McCarthy, and John C. Marioni. 2016. “A Step-by-Step Workflow for Low-Level Analysis of Single-Cell RNA-Seq Data with Bioconductor.” F1000Res. 5: 2122. https://doi.org/10.12688/f1000research.9501.2.
Morgan, Martin, and Marcel Ramos. 2024. BiocManager: Access the Bioconductor Project Package Repository. https://bioconductor.github.io/BiocManager/.
Rand, Emma. 2021. Data Science Strand of BIO00058M. https://doi.org/10.5281/zenodo.5527705.
Slowikowski, Kamil. 2024. Ggrepel: Automatically Position Non-Overlapping Text Labels with ’Ggplot2’. https://ggrepel.slowkow.com/.