Self study

Semester 2: Additional 🐭 stem cell labels

Author

Emma Rand

Published

11 February, 2025

Introduction

We have analysed the Nestorowa et al. (2016) stem cells categorised as LTHSC, HSPC, or Prog. The cell types were determined from flowcytometry markers. Additional markers allow the cells to be categorised into 12 other types (including unknown): CMP, GMP, LMPP, LTHSC, LTHSC-ESLAM, MEP, MPP_broad, MPP1, MPP2, MPP3, STHSC, unknown. These cell types are mutually exclusive, meaning that a cell can only be one of these types.

In this self-study, you will:

import all the expression data and gene information for the three cell types
import the data that gives the additional labels
combine the data and pivot it to have the expression values in a single column
plot the expression of one gene, Emb, in each cell type for the 3-type categorisation and the 12-type categorisation
conduct a PCA on all the data and plot the first two components using colour/shapes for the 12-type categorisation, the 3-type categorisation and the expression of Emb.

Set up

🎬 Open your mice-88H RStudio Project.

🎬 Save er_cell_types.csv.

🎬 Open a new or existing script

🎬 Load tidyverse (Wickham et al. 2019).

Import and combine data

The raw data

Import the data for the LT-HSC, HSPC and the Progenitor cells.

🎬 Import surfaceome_lthsc.csv, surfaceome_hspc.csv and surfaceome_hspc.csv

# 🐭 import the three datasets
lthsc <- read_csv("data-raw/surfaceome_lthsc.csv")
hspc <- read_csv("data-raw/surfaceome_hspc.csv")
prog <- read_csv("data-raw/surfaceome_prog.csv")

Note there are 155 LT-HSC + 701 HSPC + 798 Progenitor = 1654 total cells.

We will need to join the three datasets using ensembl_gene_id to match the rows.

🎬 Combine the three datasets by ensembl_gene_id and save the result as cell_expr.

#  combine the three datasets
cell_expr <- hspc |>
  left_join(prog, 
            by = "ensembl_gene_id") |> 
  left_join(lthsc, 
            by = "ensembl_gene_id")

Gene information

You will likely want the gene information in the dataframe along with the ensembl_gene_id. Rather the connecting again to Ensembl (Martin et al. 2023; Birney et al. 2004) and BioMart (Smedley et al. 2009) using the R package biomaRt (Durinck et al. 2009, 2005), which means loading more packages, managing conflicts and using more compute, we can use the information we already have in the results/hspc_prog_results.csv file. This works because the same 280 genes are in the three raw data files and this results file.

We need only the ensembl_gene_id, (to join the information), the external_gene_name and the description.

🎬 Import the gene information from results/hspc_prog_results.csv and save as gene_info.

# 🐭 import the gene information
gene_info <- read_csv("results/hspc_prog_results.csv") |> 
  select(ensembl_gene_id, external_gene_name, description)

🎬 Join the gene_info to the cell_expr data.

# 🐭 join the gene information to the cell_expr data
cell_expr <- cell_expr |> 
  left_join(gene_info, 
            by = "ensembl_gene_id")

12-type categorisation data

The information in the er_cell_types.csv has 14 columns. The first column gives the cell_id (e.g., HSPC_001) the next 12 columns are named with the cell type or unknown and contain a 0 or 1. These columns are useful if you want to compare (plot or DE), for example, CMP cells vs others. The last column gives the cell type which is useful to include all the labels in a plot.

🎬 Import er_cell_types.csv

cell <- read_csv("data-raw/er_cell_types.csv")

🎬 Check the dataframe has the number of rows and columns you were expecting and that column types and names are as expected.

🎬 Examine the number of cells of each type:

cell |> 
  group_by(cell_type_12) |>
  count()

# A tibble: 12 × 2
# Groups:   cell_type_12 [12]
   cell_type_12     n
   <chr>        <int>
 1 CMP            317
 2 GMP            120
 3 LMPP           246
 4 LTHSC          233
 5 LTHSC-ESLAM     38
 6 MEP            354
 7 MPP1            27
 8 MPP2            11
 9 MPP3            60
10 MPP_broad      202
11 STHSC           36
12 unknown         10

Expression of a specific gene in each cell type

Pivot to long form

The cell_expr data has the expression of each gene in each cell. The cell_ids are in columns and the gene are in rows.

We will need to pivot the cell_expr data to have the expression values in a single column with additional columns giving the cell_id and gene.

# 🐭 pivot the cell_expr data
cell_expr_long <- cell_expr |> 
  pivot_longer(cols = -c(ensembl_gene_id, 
                         external_gene_name, 
                         description),
               names_to = "cell_id", 
               values_to = "expression")

Note that this data frame will have: [280 genes * 1654 cells = 463120] rows.

Add the two types of categorisation to the expression data

We also want to add the two types of categorisation (the 3-type and the 12-types) to the cell_expr_long data. To add the 3-type categorisation we extract the information in the cell_id column into two columns: one with the cell type and one with the number. We did this before in making a PCA plot

🎬 Extract the cell type and cell number from the cell_id column (keeping the cell_id column):

cell_expr_long <- cell_expr_long |> 
  extract(cell_id, 
          remove = FALSE,
          c("cell_type_3", "cell_number"),
          "([a-zA-Z.]{4,6})_([0-9]{3})")

Let’s just check we have the right number of cells in each of the three groups:

cell_expr_long |> 
  group_by(cell_type_3) |> 
  summarise(n()/ 280)

# A tibble: 3 × 2
  cell_type_3 `n()/280`
  <chr>           <dbl>
1 HSPC              701
2 LT.HSC            155
3 Prog              798

To add the 12-type categorisation we need to join the cell_expr_long data to the cell data on the cell_id

🎬 Add the 12-type categorisation to the cell_expr_long data.

cell_expr_long <- cell_expr_long |> 
  left_join(cell |> select(cell_id, cell_type_12), 
            by = "cell_id")

Plot gene expression under 3-type categorisation

I have chosen the gene Emb to plot the expression in each cell type. I prefer the use of violin plots and points because multimodal distributions are more obvious than in boxplots

🎬 Plot the expression of Emb in each cell type for the 3-type categorisation.

# 🐭 plot the expression of Emb in each cell type
cell_expr_long |> 
  filter(external_gene_name == "Emb") |> 
  ggplot(aes(x = cell_type_3, y = expression)) +
  geom_violin(linewidth = 1) +
  geom_jitter(width = 0.4, alpha = 0.2, pch = 16) +
  theme_classic()

Plot expression under 12-type categorisation

# 🐭 plot the expression of Emb in each cell type
cell_expr_long |> 
  filter(external_gene_name == "Emb") |> 
  ggplot(aes(x = cell_type_12, y = expression)) +
  geom_violin(linewidth = 1) +
  geom_jitter(width = 0.2, alpha = 0.2, pch = 16) +
  theme_classic()

PCA

We will carry out the the same steps to do the PCA:

Select the log₂ transformed normalised counts. You can only use numerical data in PCA.
Transpose our data. We have genes in rows and samples in columns (this is common for gene expression data). However, to treat the genes as variables, PCA expects samples in rows and genes in columns.
Add the gene names as column names in the transposed data
Perform the PCA
Extract the scores on the first two principal components and label the data
Plot the the first two principal components as a scatter plot

Refer to the workshop for more explanation on PCA.

🎬 Transpose the log₂ transformed normalised counts:

cell_expr_trans <- cell_expr |> 
  dplyr::select(starts_with(c("HSPC_", "Prog_", "LT.HSC_"))) |>
  t() |> 
  data.frame()

🎬 Set the column names to the gene ids:

colnames(cell_expr_trans) <- cell_expr$ensembl_gene_id

🎬 Perform PCA on the log₂ transformed normalised counts:

pca <- cell_expr_trans |>
  prcomp(rank. = 8)

To plot the PC1 against PC2 we will need to extract the PC1 and PC2 “scores” from the PCA object and add labels for the cells. Our PCA object is called pca and the scores are in pca$x. The cells labels will come from the row names of the transformed data.

🎬 Create a dataframe of the PC1 and PC2 scores (in pca$x) and add the cell ids:

pca_labelled <- data.frame(pca$x,
                           cell_id = row.names(cell_expr_trans))

🎬 Extract the cell type and cell number from the cell_id column (keeping the cell_id column):

pca_labelled <- pca_labelled |> 
  extract(cell_id, 
          remove = FALSE,
          c("cell_type_3", "cell_number"),
          "([a-zA-Z.]{4,6})_([0-9]{3})")

🎬 Add the 12-type categorisation to the pca_labelled data.

pca_labelled <- pca_labelled |> 
  left_join(cell |> select(cell_id, cell_type_12), 
            by = "cell_id")

🎬 Add the expression of Emb to pca_labelled data. I’ve also renamed the expression column from the filtered cell_expr_long so it is Emb_expr in the pca_labelled data.

pca_labelled <- pca_labelled |> 
  left_join(cell_expr_long |> 
              filter(external_gene_name == "Emb") |> 
              select(cell_id, "Emb_expr" = expression), 
            by = "cell_id")

🎬 Plot PC2 against PC1 and colour by cell_type_3 type:

pca_labelled |> 
  ggplot(aes(x = PC1, y = PC2, 
             colour = cell_type_3)) +
  geom_point(alpha = 0.4) +
  scale_colour_viridis_d(end = 0.95, begin = 0.15,
                         name = NULL) +
  theme_classic()

🎬 Plot PC2 against PC1 and colour by cell_type_12 type:

pca_labelled |> 
  ggplot(aes(x = PC1, y = PC2, 
             colour = cell_type_12)) +
  geom_point(alpha = 0.5) +
  scale_colour_viridis_d(name = NULL) +
  theme_classic()

🎬 Plot PC2 against PC1 and colour by expression of Emb and shape by cell_type_3 type:

pca_labelled |> 
  ggplot(aes(x = PC1, y = PC2, 
             colour = Emb_expr,
             shape = cell_type_3)) +
  geom_point() +
  scale_colour_viridis_c(end = 0.95, begin = 0.15,
                         name = NULL) +
  theme_classic()

🎬 Easier to see as facets?:

pca_labelled |> 
  ggplot(aes(x = PC1, y = PC2, 
             colour = Emb_expr)) +
  geom_point() +
  scale_colour_viridis_c(end = 0.95, begin = 0.15,
                         name = NULL) +
  facet_wrap(~cell_type_3)

Pages made with R (R Core Team 2024), Quarto (Allaire et al. 2024), knitr (Xie 2024, 2015, 2014), kableExtra (Zhu 2021)

References

Allaire, J. J., Charles Teague, Carlos Scheidegger, Yihui Xie, and Christophe Dervieux. 2024. “Quarto.” https://doi.org/10.5281/zenodo.5960048.

Birney, Ewan, T. Daniel Andrews, Paul Bevan, Mario Caccamo, Yuan Chen, Laura Clarke, Guy Coates, et al. 2004. “An Overview of Ensembl.” Genome Research 14 (5): 925–28. https://doi.org/10.1101/gr.1860604.

Durinck, Steffen, Yves Moreau, Arek Kasprzyk, Sean Davis, Bart De Moor, Alvis Brazma, and Wolfgang Huber. 2005. “BioMart and Bioconductor: A Powerful Link Between Biological Databases and Microarray Data Analysis.” Bioinformatics 21: 3439–40.

Durinck, Steffen, Paul T. Spellman, Ewan Birney, and Wolfgang Huber. 2009. “Mapping Identifiers for the Integration of Genomic Datasets with the r/Bioconductor Package biomaRt.” Nature Protocols 4: 1184–91.

Martin, Fergal J, M Ridwan Amode, Alisha Aneja, Olanrewaju Austine-Orimoloye, Andrey G Azov, If Barnes, Arne Becker, et al. 2023. “Ensembl 2023.” Nucleic Acids Research 51 (D1): D933–41. https://doi.org/10.1093/nar/gkac958.

Nestorowa, Sonia, Fiona K. Hamey, Blanca Pijuan Sala, Evangelia Diamanti, Mairi Shepherd, Elisa Laurenti, Nicola K. Wilson, David G. Kent, and Berthold Göttgens. 2016. “A Single-Cell Resolution Map of Mouse Hematopoietic Stem and Progenitor Cell Differentiation.” Blood 128 (8): e20–31. https://doi.org/10.1182/blood-2016-05-716480.

R Core Team. 2024. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.

Smedley, Damian, Syed Haider, Benoit Ballester, Richard Holland, Darin London, Gudmundur Thorisson, and Arek Kasprzyk. 2009. “BioMart Biological Queries Made Easy.” BMC Genomics 10 (1): 22. https://doi.org/10.1186/1471-2164-10-22.

Wickham, Hadley, Mara Averick, Jennifer Bryan, Winston Chang, Lucy D’Agostino McGowan, Romain François, Garrett Grolemund, et al. 2019. “Welcome to the tidyverse.” Journal of Open Source Software 4 (43): 1686. https://doi.org/10.21105/joss.01686.

Xie, Yihui. 2014. “Knitr: A Comprehensive Tool for Reproducible Research in R.” In Implementing Reproducible Computational Research, edited by Victoria Stodden, Friedrich Leisch, and Roger D. Peng. Chapman; Hall/CRC.

———. 2015. Dynamic Documents with R and Knitr. 2nd ed. Boca Raton, Florida: Chapman; Hall/CRC. https://yihui.org/knitr/.

———. 2024. Knitr: A General-Purpose Package for Dynamic Report Generation in r. https://yihui.org/knitr/.

Zhu, Hao. 2021. “kableExtra: Construct Complex Table with ’Kable’ and Pipe Syntax.” https://CRAN.R-project.org/package=kableExtra.