Independent Study to prepare for workshop

Transcriptomics 1: 👋 Hello data!

Emma Rand

18 September, 2024

Overview

  • Concise summary of the experimental design and aims

  • What the raw data consist of

  • What has been done to the data so far

  • What steps we will take in the workshop

The Data

There are three datasets

  • 🐸 transcriptomic data (bulk RNA-seq) from frog embryos.

  • 🐭 transcriptomic data (single cell RNA-seq) from stemcells

  • 🍂 ??????? Metabolomic / Metagenomic data from anaerobic digesters

Experimental design

🐸 Experimental design

Schematic of frog development experiment

🐸 Experimental design

Schematic of frog development experiment

  • 3 fertilisations

  • two siblings from each fertilisation one control, on FGF treated

  • sequenced at three time points

  • 3 x 2 x 3 = 18 groups

🐸 Experimental design

Schematic of frog development experiment

  • 3 fertilisations. These are the replicates, 1, 2, 3

  • two siblings from each fertilisation one control, one FGF treated. The treatments are paired

  • sequenced at three time points. S14, S20, S30

  • 3 x 2 x 3 = 18 groups

🐸 Aim

  • find genes important in frog development

  • Important means the genes that are differentially expressed between the control-treated and the FGF-treated siblings

  • Differentially expressed means the expression in one group is significantly higher than in the other

🐸 Guided analysis

  • The workshops will take you through comparing the control and FGF treated sibling at S30

  • This is the “least interesting” comparison

  • You will be guided to carefully document your work so you can apply the same methods to other comparisons

🐭 Experimental design

Schematic of stem cell experiment

🐭 Experimental design

Schematic of stem cell experiment

  • Cells were sorted using flow cytometry on the basis of cell surface markers

  • There are three cell types: LT-HSCs, HSPCs, Progs

  • Many cells of each cell type were sequenced

🐭 Experimental design

Schematic of stem cell experiment

  • There are three cell types: LT-HSCs, HSPCs, Progs These are the “treaments”

  • Many cells of each type were sequenced: These are the replicates

  • 155 LT-HSCs, 701 HSPCs, 798 Progs

🐭 Aim

  • find genes for cell surface proteins that are important in stem cell identity

  • Important means genes that are differentially expressed between at least two cell types

  • Differentially expressed means the expression in one group is significantly higher than in the other

🐭 Guided analysis

  • The workshops will take you through comparing the HSPC and Prog cells

  • This is the “least interesting” comparison

  • You will be guided to carefully document your work so you can apply the same methods to other comparisons

The raw data

Raw Sequence data

  • The raw data are “reads” from a sequencing machine.

  • A read is sequence of DNA or RNA shorter than the whole genome or transcriptome

  • The length of the reads depends on the type of sequencing machine

    • Short-read technologies e.g. Illumina have higher base accuracy but are harder to align
    • Long-read technologies e.g. Nanopore have lower base accuracy but are easier to align

The raw data

Raw Sequence data

Raw Sequence data

  • The RNA-seq data are from an Illumina machine 150-300bp

  • Reads are in FASTQ files

  • FASTQ files contain the sequence of each read and a quality score for each base

What has been done to the data so far

General steps

  • Reads are filtered and trimmed on the basis of the quality score

  • They are then aligned/pseudo-aligned to a reference genome/transcriptome

  • Reads are then counted to quantify the expression

  • Counts will need to be normalised to account for differences in sequencing depth and gene/transcript/ length before, or as part of, statistical analysis.

🐸 Data

🐭 Data

  • Published in Nestorowa et al. (2016)

  • Expression for a subset of genes, the surfaceome

  • Values are log2 normalised values

  • The statistical analysis method we will use scran (Lun, McCarthy, and Marioni 2016) requires normalised values

Workshops

Workshops

  • Transcriptomics 1: Hello data Getting to know the data. Checking the distributions of values overall, across samples and across genes to check things are as we expect and detect genes/samples that need to be removed

  • Transcriptomics 2: Statistical Analysis Identifying which genes are differentially expressed between treatments. This is the main analysis step. We will use different methods for bulk and single cell data.

  • Transcriptomics 3: Visualising and Interpreting Production of volcano plots and heatmaps to visualise the results of the statistical analysis. We will also look at how to interpret the results and how to find out more about the genes of interest.

References

Love, Michael I., Wolfgang Huber, and Simon Anders. 2014. “Moderated Estimation of Fold Change and Dispersion for RNA-Seq Data with DESeq2.” Genome Biology 15: 550. https://doi.org/10.1186/s13059-014-0550-8.
Lun, Aaron T. L., Davis J. McCarthy, and John C. Marioni. 2016. “A Step-by-Step Workflow for Low-Level Analysis of Single-Cell RNA-Seq Data with Bioconductor.” F1000Res. 5: 2122. https://doi.org/10.12688/f1000research.9501.2.
Nestorowa, Sonia, Fiona K. Hamey, Blanca Pijuan Sala, Evangelia Diamanti, Mairi Shepherd, Elisa Laurenti, Nicola K. Wilson, David G. Kent, and Berthold Göttgens. 2016. “A Single-Cell Resolution Map of Mouse Hematopoietic Stem and Progenitor Cell Differentiation.” Blood 128 (8): e20–31. https://doi.org/10.1182/blood-2016-05-716480.
Rand, Emma, and Sarah Forrester. 2022. “Statistically Useful Experimental Design.” https://cloud-span.github.io/experimental_design00-overview/.