Workshop

Transcriptomics 3: Visualising

Author

Emma Rand

Published

17 December, 2024

Introduction

Session overview

In the workshop, you will learn how to conduct and plot a Principle Component Analysis (PCA) as well as how to create a nicely formatted Volcano plot. You will also save significant genes to file to make it easier to identify genes of interest. and perform Gene Ontology (GO) term enrichment analysis.

Set up

🐸 Frog development

🎬 Open the frogs-88H RStudio Project and the cont-fgf-s30.R script.

πŸŽ„ Arabidopsis

🎬 Open the arabi-88H RStudio Project and the wildsuf-wilddef-s30.R script.

πŸ’‰ Leishmania

🎬 Open the leish-88H RStudio Project and the pro-meta-s30.R script.

🐭 Stem cells

🎬 Open the mice-88H RStudio Project and the hspc-prog.R script.

Everyone

🎬 Make a new folder figures in the project directory.

This is where we will save our figure files

🎬 Load tidyverse (Wickham et al. 2019) and conflicted (Wickham 2023). You most likely have this code at the top of your script already.

── Attaching core tidyverse packages ─────────────────────────────────────────────── tidyverse 2.0.0 ──
βœ” dplyr     1.1.3     βœ” readr     2.1.4
βœ” forcats   1.0.0     βœ” stringr   1.5.0
βœ” ggplot2   3.4.3     βœ” tibble    3.2.1
βœ” lubridate 1.9.3     βœ” tidyr     1.3.0
βœ” purrr     1.0.2     
── Conflicts ───────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
βœ– dplyr::filter() masks stats::filter()
βœ– dplyr::lag()    masks stats::lag()
β„Ή Use the conflicted package to force all conflicts to become errors

I recommend you set the dplyr versions of filter() and select() to use by default

🎬 Use the dplyr version of filter() by default:

Import

Everyone

🎬 Import your results data. This should be a file in the results folder called xxxx_results.csv where xxxx indicates the comparison you made.

🎬 Remind yourself what is in the rows and columns and the structure of the dataframes (perhaps using glimpse())

When we do PCA we will want to label the samples with their treatment for figures. For 🐸 Frog development, πŸŽ„ Arabidopsis and πŸ’‰ Leishmania, this labelling information is most easily added using the metadata. You will need to filter for only the samples in the comparison that was made in the results file.

You may need to refer back to the Week 4 Statistical Analysis workshop (in section 2. Create DESeqDataSet object) to remind yourself how to import and filter the metadata you need.

🎬 Import the metadata that maps the sample names to treatments. Remember to select only the samples for comparison that was made.

For 🐭 Stem cells, cell types are encoded in the column names. We will do some regular expression magic to extract the cell type from the column names.

log2 transform the normalised counts

We use the normalised counts for data visualisations so that the comparisons are meaningful. Since the fold changes are given is log2 it is useful to log2 transform the normalised counts too. We will add columns to the dataframe with these transformed values. Since we have some counts of 0 we will add a tiny amount to avoid -Inf values.

log2 transformation would be applied to one column like this:

# DO NOT DO
# log2 transform the counts plus a tiny amount to avoid log(0)
dataframe <- dataframe |>
  mutate(log2_mycolumn = log2(mycolumn + 0.001))

We are going to use a wonderful bit of R wizardry to apply a transformation to multiple columns. This is the across() function which has three arguments:

across(.cols, .fns, .names)

where:

  • .cols is the selection of columns to transform
  • .fns is the function we want to apply to the selected columns
  • .names is the naming convention for the new columns

The general form of the code you need is:

# DO NOT DO
# log2 transform the counts plus a tiny amount to avoid log(0)
xxxx_results <- xxxx_results |>
  mutate(across(starts_with("pattern"), 
                \(x) log2(x + 0.001),
                .names = "log2_{.col}"))

where:

  • xxxx_results is the name of the dataframe of results
  • pattern matches the starting letters for all of the normalised counts so that starts_with("pattern") gives the selection of columns to transform
  • the bit after the \(x) is the function we want to apply to the selected columns
  • the \(x) means it is an β€œanonymous” function which means we don’t have to define a function name.
  • "log2_{.col}" means the columns will have the same name (in the .col) but with the prefix log2_ added.

You can read more about across() and anonymous functions from my posit::conf(2024) workshop

🐸 Frog development, πŸŽ„ Arabidopsis and πŸ’‰ Leishmania

🎬 Design the code to log2 transform the normalised counts using the template given

I recommend viewing the dataframe to see the new columns. Check you have the expected number of columns.

🐭 Stem cells

You do not need to apply this transformation because the data is already log2 transformed.

Everyone

We now all have dataframes with all the information we need: normalised counts, log2 normalised counts, statistical comparisons with fold changes and p-values, and information about the gene.

Write the significant genes to file

Everyone

We will create dataframe of the significant genes and write them to file. This is subset from the results file but will make it a little easier to examine and select genes of interest.

The general form of the code you need is:

# DO NOT DO
# create a dataframe of genes significant at 0.05 level
xxxx_results_sig0.05 <- xxxx_results |> 
  filter(pvalue <= 0.05)

Note that you determine the significance level using the adjusted p-values rather than the uncorrected p-values. This column is name padj in DESeq2 output and FDR in scran output.

🎬 Create a dataframe of the genes significant at the 0.05 level using filter(). You will need to know the name of column with the adjusted p-values.

❓How many genes are significant at the 0.05 levels?

If you have a very large number of genes significant at the 0.05 level, you may want to consider a more stringent cut-off such as 0.01.

🎬 Write the dataframe to a csv file. I recommend using the same file name as you used for the dataframe.

Principal Component Analysis (PCA)

We have many genes in our datasets. PCA will allow us to plot our samples in the β€œgene expression” space so we can see if replicates with the same treatment cluster together as we would expect. PCA is a dimension reduction technique that finds the directions of maximum variance in the data. We are doing PCA after our statistical analysis but often it is one of the first techniques used to explore the data. If we do not see treatment effects in a PCA plot then we are not likely to see them in the statistical analysis. PCA can also help you identify any outlier replicates. The PCA is best conducted on the normalised counts or the log2 transformed normalised counts. We will use the log2 transformed normalised counts.

We will carry out the following steps to do the PCA:

  • Select the log2 transformed normalised counts. You can only use numerical data in PCA.
  • Transpose our data. We have genes in rows and samples in columns (this is common for gene expression data). However, to treatment the genes as variables, PCA expects samples in rows and genes in columns.
  • Add the gene names as column names in the transposed data
  • Perform the PCA
  • Extract the scores on the first two principal components and label the data
  • Plot the the first two principal components as a scatter plot

In each case we will use only the samples for the comparison that was made in the PCA. However, it would be informative to do PCA on all the samples you have and that is something you may want to consider in your independent study.

Now go to PCA for:

🐸 Frog development

🎬 Transpose the log2 transformed normalised counts:

s30_log2_trans <- s30_results |> 
  select(starts_with("log2_")) |>
  t() |> 
  data.frame()

We have used the select() function to select all the columns that start with log2_. We then use the t() function to transpose the dataframe. We then convert the resulting matrix to a dataframe using data.frame(). If you view that dataframe you’ll see it has default column name which we can fix using colnames() to set the column names to the Xenbase gene ids.

🎬 Set the column names to the Xenbase gene ids:

colnames(s30_log2_trans) <- s30_results$xenbase_gene_id

🎬 Perform PCA on the log2 transformed normalised counts:

pca <- s30_log2_trans |>
  prcomp(rank. = 4) 

The rank. argument tells prcomp() to only calculate the first 4 principal components. This is useful for visualisation as we can only plot in 2 or 3 dimensions. We can see the results of the PCA by viewing the summary() of the pca object.

summary(pca)
Importance of first k=4 (out of 6) components:
                           PC1     PC2     PC3     PC4
Standard deviation     64.0124 47.3351 38.4706 31.4111
Proportion of Variance  0.4243  0.2320  0.1532  0.1022
Cumulative Proportion   0.4243  0.6562  0.8095  0.9116

The Proportion of Variance tells us how much of the variance is explained by each component. We can see that the first component explains 0.4243 of the variance, the second 0.2320, and the third 0.1532. Together the first three components explain nearly 81% of the total variance in the data. Plotting PC1 against PC2 will capture about 66% of the variance which is likely very much better than we would get plotting any two genes against each other. To plot the PC1 against PC2 we will need to extract the PC1 and PC2 β€œscores” from the PCA object and add labels for the samples. Those labels will come from the row names of the transformed data which has the sample ids and from the metadata.

🎬 Create a vector of the sample ids from the row names. These include the log2 prefix which we can removed for labelling:

sample_id <- row.names(s30_log2_trans) |> str_remove("log2_")

You might want to check the result.

Now we will extract the PC1 and PC2 scores from the PCA object and add. Our PCA object is called pca and the scores are in pca$x. We will create a dataframe of the scores and add the sample ids.

🎬 Create a dataframe of PC1 and PC2 scores and add the sample ids:

pca_labelled <- data.frame(pca$x,
                           sample_id)

🎬 Merge with the metadata so we can label points by treatment and sibling pair:

pca_labelled <- pca_labelled |> 
  left_join(meta_s30, 
            by = "sample_id")

Since the metadata contained the sample ids, it was especially important to remove the log2_ from the row names so that the join would work.

The dataframe should look like this:

PC1 PC2 PC3 PC4 sample_id stage treatment sibling_rep
-76.38391 0.814699 -60.728327 -5.820669 S30_C_1 stage_30 control one
-67.02571 25.668563 51.476835 28.480254 S30_C_2 stage_30 control two
-14.02772 -78.474054 15.282058 -9.213076 S30_C_3 stage_30 control three
47.60726 49.035510 -19.288753 20.928290 S30_F_1 stage_30 FGF one
26.04954 32.914201 20.206072 -55.752818 S30_F_2 stage_30 FGF two
83.78054 -29.958919 -6.947884 21.378020 S30_F_3 stage_30 FGF three

The next task is to plot PC2 against PC1 and colour by sibling pair. This is just a scatterplot so we can use geom_point(). We will use colour to indicate the sibling pair and shape to indicate the treatment.

🎬 Customise the PC2 against PC1 plot:

pca_labelled |> 
  ggplot(aes(x = PC1, y = PC2, 
             colour = sibling_rep,
             shape = treatment)) +
  geom_point(size = 3) +
  theme_classic()

There is a good separation between treatments on PCA1. The sibling pairs do not seem to cluster together. You can also try plotting PC3 or PC4.

I prefer to customise the colours and shapes. I especially like the
viridis colour scales which provide colour scales that are perceptually uniform in both colour and black-and-white. They are also designed to be perceived by viewers with common forms of colour blindness. See Introduction to viridis for more information.

ggplot provides functions to access the viridis scales. Here I use scale_fill_viridis_d(). The d stands for discrete. The function scale_fill_viridis_c() would be used for continuous data. I’ve used the default β€œviridis” (or β€œD”) option (do ?scale_fill_viridis_d for all the options) and used the begin and end arguments to control the range of colour - I have set the range to be from 0.15 to 0.95 the avoid the strongest contrast. I have also set the name argument to provide a label for the legend.

I have used scale_shape_manual() to set the shapes for the treatments. I have used the values 21 and 19 which are the codes for filled and open circles and filled triangles. I have set the name argument to NULL to remove the label (it’s obvious what that categories are treatments) and the labels argument to improve the legend.

🎬 Plot PC2 against PC1 and colour by sibling pair and shape by treatment:

pca_labelled |> 
  ggplot(aes(x = PC1, y = PC2, 
             colour = sibling_rep,
             shape = treatment)) +
  geom_point(size = 3) +
  scale_colour_viridis_d(end = 0.95, begin = 0.15,
                         name = "Sibling pair") +
  scale_shape_manual(values = c(21, 19),
                     name = NULL,
                     labels = c("Control", "FGF-Treated")) +
  theme_classic()

Now go to Volcano plots for 🐸 Frog development

πŸŽ„ Arabidopsis

🎬 Transpose the log2 transformed normalised counts:

wild_log2_trans <- wild_results |> 
  select(starts_with("log2_")) |>
  t() |> 
  data.frame()

We have used the select() function to select all the columns that start with log2_. We then use the t() function to transpose the dataframe. We then convert the resulting matrix to a dataframe using data.frame(). If you view that dataframe you’ll see it has default column name which we can fix using colnames() to set the column names to the gene ids.

🎬 Set the column names to the gene ids:

colnames(wild_log2_trans) <- wild_results$gene_id

🎬 Perform PCA on the log2 transformed normalised counts:

pca <- wild_log2_trans |>
  prcomp(rank. = 4, scale = TRUE)

The scale argument tells prcomp() to scale the data before doing the PCA. This is important when the variables are on different scales to stop variables with large values dominating the PCA. The rank. argument tells prcomp() to only calculate the first 4 principal components. This is useful for visualisation as we can only plot in 2 or 3 dimensions. We can see the results of the PCA by viewing the summary() of the pca object.

summary(pca)
Importance of components:
                            PC1     PC2     PC3       PC4
Standard deviation     123.0171 85.6822 62.2924 2.186e-13
Proportion of Variance   0.5742  0.2786  0.1472 0.000e+00
Cumulative Proportion    0.5742  0.8528  1.0000 1.000e+00

The Proportion of Variance tells us how much of the variance is explained by each component. We can see that the first component explains 0.5742 of the variance, the second 0.2786, and the third 0.1472. Together the first three components explain nearly 100% of the total variance in the data. Plotting PC1 against PC2 will capture about 92% of the variance which is very likely very much better than we would get plotting any two genes against each other. To plot the PC1 against PC2 we will need to extract the PC1 and PC2 β€œscores” from the PCA object and add labels for the samples. Those labels will come from the row names of the transformed data which has the sample ids and from the metadata.

🎬 Create a vector of the sample ids from the row names. These include the log2 prefix which we can removed for labelling:

sample_id <- row.names(wild_log2_trans) |> str_remove("log2_")

You might want to check the result.

Now we will extract the PC1 and PC2 scores from the PCA object and add. Our PCA object is called pca and the scores are in pca$x. We will create a dataframe of the scores and add the sample ids.

🎬 Create a dataframe of PC1 and PC2 scores and add the sample ids:

pca_labelled <- data.frame(pca$x,
                           sample_id)

🎬 Merge with the metadata so we can label points by treatment and sibling pair:

pca_labelled <- pca_labelled |> 
  left_join(meta_wild, 
            by = "sample_id")

Since the metadata contained the sample ids, it was especially important to remove the log2_ from the row names so that the join would work.

The dataframe should look like this:

PC1 PC2 PC3 PC4 sample_id genotype copper
-91.23038 105.92675 25.80774 0 SRX028956_wild_suf wt sufficient
-120.00085 -95.79115 -13.72618 0 SRX028957_wild_def wt deficient
92.06667 23.00841 -79.23090 0 SRX028960_wild_suf wt sufficient
119.16456 -33.14401 67.14935 0 SRX028961_wild_def wt deficient

The next task is to plot PC2 against PC1 and colour by copper conditions. This is just a scatterplot so we can use geom_point(). We will use colour to indicate the copper conditions.

🎬 Plot PC2 against PC1 and colour by copper conditions:

pca_labelled |> 
  ggplot(aes(x = PC3, y = PC2, 
             colour = copper)) +
  geom_point(size = 3) +
  theme_classic()

We do not see particularly good separation between treatments, though perhaps there is some separation between the copper sufficient and copper deficient on PC2. It is also difficult to see if the replicates cluster together (the treatments separate) when there are only two reps. You can also try plotting PC3 or PC4.

I prefer to customise the colours and shapes. I especially like the
viridis colour scales which provide colour scales that are perceptually uniform in both colour and black-and-white. They are also designed to be perceived by viewers with common forms of colour blindness. See Introduction to viridis for more information.

ggplot provides functions to access the viridis scales. Here I use scale_fill_viridis_d(). The d stands for discrete. The function scale_fill_viridis_c() would be used for continuous data. I’ve used the default β€œviridis” (or β€œD”) option (do ?scale_fill_viridis_d for all the options) and used the begin and end arguments to control the range of colour - I have set the range to be from 0.15 to 0.95 the avoid the strongest contrast. I have also set the name argument to provide a label for the legend.

🎬 Customise the PC2 against PC1 plot:

pca_labelled |> 
  ggplot(aes(x = PC1, y = PC2, 
             colour = copper)) +
  geom_point(size = 3) +
  scale_colour_viridis_d(end = 0.95, begin = 0.15,
                         name = "Copper") +
  theme_classic()

Now go to Volcano plots for πŸŽ„ Arabidopsis

πŸ’‰ Leishmania

🎬 Transpose the log2 transformed normalised counts:

pro_meta_log2_trans <- pro_meta_results |> 
  select(starts_with("log2_")) |>
  t() |> 
  data.frame()

We have used the select() function to select all the columns that start with log2_. We then use the t() function to transpose the dataframe. We then convert the resulting matrix to a dataframe using data.frame(). If you view that dataframe you’ll see it has default column name which we can fix using colnames() to set the column names to the gene ids.

🎬 Set the column names to the gene ids:

colnames(pro_meta_log2_trans) <- pro_meta_results$gene_id

🎬 Perform PCA on the log2 transformed normalised counts:

pca <- pro_meta_log2_trans |>
  prcomp(rank. = 4, scale = TRUE) 

The scale argument tells prcomp() to scale the data before doing the PCA. This is important when the variables are on different scales to stop variables with large values dominating the PCA. The rank. argument tells prcomp() to only calculate the first 4 principal components. This is useful for visualisation as we can only plot in 2 or 3 dimensions. We can see the results of the PCA by viewing the summary() of the pca object.

summary(pca)
Importance of first k=4 (out of 6) components:
                           PC1     PC2     PC3      PC4
Standard deviation     66.0744 41.9741 29.4077 27.61377
Proportion of Variance  0.5175  0.2088  0.1025  0.09038
Cumulative Proportion   0.5175  0.7263  0.8288  0.91916

The Proportion of Variance tells us how much of the variance is explained by each component. We can see that the first component explains 0.5175 of the variance, the second 0.2088, and the third 0.1025. Together the first three components explain nearly 92% of the total variance in the data. Plotting PC1 against PC2 will capture about 66% of the variance which is likely very much better than we would get plotting any two genes against each other. To plot the PC1 against PC2 we will need to extract the PC1 and PC2 β€œscores” from the PCA object and add labels for the samples. Those labels will come from the row names of the transformed data which has the sample ids and from the metadata.

🎬 Create a vector of the sample ids from the row names. These include the log2 prefix which we can removed for labelling:

sample_id <- row.names(pro_meta_log2_trans) |> str_remove("log2_")

You might want to check the result.

Now we will extract the PC1 and PC2 scores from the PCA object and add. Our PCA object is called pca and the scores are in pca$x. We will create a dataframe of the scores and add the sample ids.

🎬 Create a dataframe of PC1 and PC2 scores and add the sample ids:

pca_labelled <- data.frame(pca$x,
                           sample_id)

🎬 Merge with the metadata so we can label points by life cycle stage:

pca_labelled <- pca_labelled |> 
  left_join(meta_pro_meta, 
            by = "sample_id")

Since the metadata contained the sample ids, it was especially important to remove the log2_ from the row names so that the join would work.

The dataframe should look like this:

PC1 PC2 PC3 PC4 sample_id stage replicate
-59.42107 -2.922229 -11.043368 -13.4641060 lm_pro_1 procyclic 1
-63.25818 -15.985856 -36.248191 0.7025242 lm_pro_2 procyclic 2
-57.92065 25.679645 47.373874 13.7893070 lm_pro_3 procyclic 3
56.20326 -59.707787 5.958727 31.5528028 lm_meta_1 metacyclic 1
57.41205 -11.738157 14.157242 -47.2285609 lm_meta_2 metacyclic 2
66.98459 64.674383 -20.198284 14.6480329 lm_meta_3 metacyclic 3

The next task is to plot PC2 against PC1 and colour by sibling pair. This is just a scatterplot so we can use geom_point(). We will use colour to indicate the life cycle stage.

🎬 Plot PC2 against PC1 and colour by copper conditions:

pca_labelled |> 
  ggplot(aes(x = PC1, y = PC2, 
             colour = stage)) +
  geom_point(size = 3) +
  theme_classic()

There is a good separation between treatments on PCA1. The replicates do seem to cluster together. You can also try plotting PC3 or PC4.

I prefer to customise the colours. I especially like the
viridis colour scales which provide colour scales that are perceptually uniform in both colour and black-and-white. They are also designed to be perceived by viewers with common forms of colour blindness. See Introduction to viridis for more information.

ggplot provides functions to access the viridis scales. Here I use scale_fill_viridis_d(). The d stands for discrete. The function scale_fill_viridis_c() would be used for continuous data. I’ve used the default β€œviridis” (or β€œD”) option (do ?scale_fill_viridis_d for all the options) and used the begin and end arguments to control the range of colour - I have set the range to be from 0.15 to 0.95 the avoid the strongest contrast. I have also set the name argument to provide a label for the legend.

🎬 Plot PC2 against PC1 and colour by life stage:

pca_labelled |> 
  ggplot(aes(x = PC1, y = PC2, 
             colour = stage)) +
  geom_point(size = 3) +
  scale_colour_viridis_d(end = 0.95, begin = 0.15,
                         name = "Stage") +
  theme_classic()

Now go to Volcano plots for πŸ’‰ Leishmania

🐭 Stem cells

🎬 Transpose the log2 transformed normalised counts:

hspc_prog_trans <- hspc_prog_results |> 
  dplyr::select(starts_with(c("HSPC_", "Prog_"))) |>
  t() |> 
  data.frame()

We have used the select() function to select all the columns that start with Prog_ or HSPC_. We then use the t() function to transpose the dataframe. We then convert the resulting matrix to a dataframe using data.frame(). If you view that dataframe you’ll see it has default column name which we can fix using colnames() to set the column names to the gene ids.

🎬 Set the column names to the gene ids:

colnames(hspc_prog_trans) <- hspc_prog_results$ensembl_gene_id

🎬 Perform PCA on the log2 transformed normalised counts:

pca <- hspc_prog_trans |>
  prcomp(rank. = 8)

The rank. argument tells prcomp() to only calculate the first 8 principal components. This is useful for visualisation as we can only plot in 2 or 3 dimensions. We can see the results of the PCA by viewing the summary() of the pca object.

summary(pca)
Importance of first k=8 (out of 280) components:
                           PC1     PC2     PC3     PC4     PC5     PC6     PC7
Standard deviation     12.5612 8.36646 5.98988 5.41386 4.55730 4.06142 3.84444
Proportion of Variance  0.1099 0.04874 0.02498 0.02041 0.01446 0.01149 0.01029
Cumulative Proportion   0.1099 0.15861 0.18359 0.20400 0.21846 0.22995 0.24024
                           PC8
Standard deviation     3.70848
Proportion of Variance 0.00958
Cumulative Proportion  0.24982

The Proportion of Variance tells us how much of the variance is explained by each component. We can see that the first component explains 0.1099 of the variance, the second 0.04874, and the third 0.02498. Together the first three components explain 18% of the total variance in the data. This is not that high but it is also quite a lot better than than we would get plotting any two genes randomly chosen against each other.

To plot the PC1 against PC2 we will need to extract the PC1 and PC2 β€œscores” from the PCA object and add labels for the cells. Our PCA object is called pca and the scores are in pca$x. The cells labels will come from the row names of the transformed data.

🎬 Create a dataframe of the PC1 and PC2 scores (in pca$x) and add the cell ids:

pca_labelled <- data.frame(pca$x,
                           cell_id = row.names(hspc_prog_trans))

It will be helpful to add a column for the cell type so we can label points. One way to do this is to extract the information in the cell_id column into two columns: one with the complete cell id and one with just the cell type. This extraction is done with Regular Expression magic.

🎬 Extract the cell type and cell number from the cell_id column (keeping the cell_id column):

pca_labelled <- pca_labelled |> 
  extract(cell_id, 
          remove = FALSE,
          c("cell_type", "cell_number"),
          "([a-zA-Z]{4})_([0-9]{3})")

What this code does is take what is in the cell_id column (something like Prog_001 or HSPC_001) and split it into two columns (β€œcell_type” and β€œcell_number”). The reason why we want to do that is to colour the points by cell type. We would not want to use cell_id to colour the points because each cell id is unique and that would be 1000+ colours. The last argument in the extract() function is the pattern to match described with a regular expression. Three patterns are being matched, and two of those are in brackets meaning they are kept to fill the two new columns.

The first pattern is ([a-zA-Z]{4})

  • it is brackets because we want to keep it and put it in cell_type
  • [a-zA-Z] means any lower (a-z) or upper case letter (A-Z).
  • The square brackets means any of the characters in the square brackets will be matched
  • {4} means 4 of them.

So the first pattern inside the first (…) will match exactly 4 upper or lower case letters (like Prog or HSPC)

The second pattern is _ to match the underscore in every cell id that separates the cell type from the number. It is not in brackets because we do not want to keep it.

The third pattern is ([0-9]{3})

  • [0-9] means any number
  • {3} means 3 of them.

So the second pattern inside the second (…) will match exactly 3 numbers (like 001 or 851).

Important: Prog and HPSC have 4 letters. The column names, LT.HSPC_ have 6 characters and includes a dot. You will need to adjust the regex when make comparison between LT-HSC and other cell types. The pattern to match the LT.HSC as well as the Prog and HSPC is ([a-zA-Z.]{4, 6}). Note the dot inside the square brackets and numbers meaning 4 or 6 of. The pattern to match the underscore and the cell number is the same.

The top of the dataframe should look like this (but with more decimal places)

PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 cell_id cell_type cell_number
HSPC_001 -2.62 -15.39 -7.94 -4.87 -2.52 4.29 -1.69 4.45 HSPC_001 HSPC 001
HSPC_002 0.70 -15.22 0.13 3.74 -6.47 -1.70 0.83 -3.35 HSPC_002 HSPC 002
HSPC_003 -15.52 7.80 0.53 -4.04 0.10 -1.48 -0.32 -7.40 HSPC_003 HSPC 003
HSPC_004 -4.11 -7.04 0.45 10.58 -3.09 -8.30 -6.01 2.27 HSPC_004 HSPC 004
HSPC_006 -7.89 5.15 -2.95 12.45 4.77 4.63 5.46 1.97 HSPC_006 HSPC 006
HSPC_008 -10.29 -8.06 -8.33 -2.69 1.86 -2.42 0.03 -2.01 HSPC_008 HSPC 008

The next task is to plot PC2 against PC1 and colour by cell type. This is just a scatterplot so we can use geom_point(). We will use colour to indicate the cell type.

🎬 Plot PC2 against PC1 and colour by cell type:

pca_labelled |> 
  ggplot(aes(x = PC1, y = PC2, 
             colour = cell_type)) +
  geom_point(alpha = 0.4) +
  theme_classic()

There is a good clustering of cell types but plenty of overlap. You can also try plotting PC3 or PC4 (or others).

I prefer to customise the colours. I especially like the viridis colour scales which provide colour scales that are perceptually uniform in both colour and black-and-white. They are also designed to be perceived by viewers with common forms of colour blindness. See Introduction to viridis for more information.

ggplot provides functions to access the viridis scales. Here I use scale_fill_viridis_d(). The d stands for discrete used because cell type is a discrete variable. The function scale_fill_viridis_c() would be used for continuous data. I’ve used the default β€œviridis” (or β€œD”) option (do ?scale_fill_viridis_d for all the options) and used the begin and end arguments to control the range of colour - I have set the range to be from 0.15 to 0.95 the avoid the strongest contrast. I have also set the name argument to NULL because that the legend refers to cell types is obvious.

🎬 Plot PC2 against PC1 and colour by cell type:

pca_labelled |> 
  ggplot(aes(x = PC1, y = PC2, 
             colour = cell_type)) +
  geom_point(alpha = 0.4) +
  scale_colour_viridis_d(end = 0.95, begin = 0.15,
                         name = NULL) +
  theme_classic()

Now go to Volcano plots for 🐭 Stem cells

Visualise all the results with a volcano plot

A volcano plot is a scatterplot that shows statistical significance (p-value) versus magnitude of change (fold change). As the independent study explained, we plot -log10(adjusted p-value) on the y-axis so that the most significant genes are uppermost.

🐸 Frog development

We will add a column to the results dataframe that contains the -log10(padj). You could perform this transformation within the plot command without adding a column to the data if you prefer.

🎬 Add a column to the results dataframe that contains the -log10(padj):

s30_results <- s30_results |> 
  mutate(log10_padj = -log10(padj)) 

🎬 Create a volcano plot of the results:

s30_results |> 
  ggplot(aes(x = log2FoldChange, 
             y = log10_padj)) +
  geom_point() +
  geom_hline(yintercept = -log10(0.05), 
             linetype = "dashed") +
  geom_vline(xintercept = 2, 
             linetype = "dashed") +
  geom_vline(xintercept = -2, 
             linetype = "dashed") +
  scale_x_continuous(expand = c(0, 0)) +
  scale_y_continuous(expand = c(0, 0)) +
  theme_classic() +
  theme(legend.position = "none")

Our dashed lines are at -log10(0.05) and log2(2) and log2(-2) to make more clear which genes (points) are significantly different between the control and the FGF-treated samples and have a fold change of at least 2.

In most cases, people colour the points to show that the quadrants. I like to add columns to the dataframe to indicate if the gene is significant and if the fold change is large and use those variables in the plot.

🎬 Add columns to the results dataframe to indicate if the gene is significant and if the fold change is large:

s30_results <- s30_results |> 
  mutate(sig = padj <= 0.05,
         bigfc = abs(log2FoldChange) >= 2) 

The use of abs() (absolute) means genes with a fold change of at least 2 in either direction will be considered to have a large fold change.

Now we can colour the points by these new columns. I use interaction() to create four categories:

  • not significant and not large fold change (FF)
  • significant and not large fold change (TF)
  • not significant and large fold (FT)
  • significant and large fold change (TT)

And I use scale_colour_manual() to set the colours for these categories.

🎬 Create a volcano plot of the results with the points coloured by significance and fold change:

s30_results |> 
  ggplot(aes(x = log2FoldChange, 
             y = log10_padj, 
             colour = interaction(sig, bigfc))) +
  geom_point() +
  geom_hline(yintercept = -log10(0.05), 
             linetype = "dashed") +
  geom_vline(xintercept = 2, 
             linetype = "dashed") +
  geom_vline(xintercept = -2, 
             linetype = "dashed") +
  scale_x_continuous(expand = c(0, 0)) +
  scale_y_continuous(expand = c(0, 0)) +
  scale_colour_manual(values = c("gray", 
                                 "pink",
                                 "gray30",
                                 "deeppink")) +
  theme_classic() +
  theme(legend.position = "none")

For exploring the data, I like add labels to all the significant genes with a large fold change so I can very quickly identity them. The ggrepel package has a function geom_text_repel() that is useful for adding labels so that they don’t overlap.

🎬 Load the package:

🎬 Add labels to the significant genes with a large fold change:

s30_results |> 
  ggplot(aes(x = log2FoldChange, 
             y = log10_padj, 
             colour = interaction(sig, bigfc))) +
  geom_point() +
  geom_hline(yintercept = -log10(0.05), 
             linetype = "dashed") +
  geom_vline(xintercept = 2, 
             linetype = "dashed") +
  geom_vline(xintercept = -2, 
             linetype = "dashed") +
  scale_x_continuous(expand = c(0, 0)) +
  scale_y_continuous(expand = c(0, 0)) +
  scale_colour_manual(values = c("gray", 
                                 "pink",
                                 "gray30",
                                 "deeppink")) +
  geom_text_repel(data = s30_results |> 
                    filter(bigfc == TRUE, sig == TRUE),
                  aes(label = xenbase_gene_symbol),
                  size = 3,
                  max.overlaps = 50) +
  theme_classic() +
  theme(legend.position = "none")

Notice that I have used filter() label only the genes that are both significant and have a large fold change. In systems you are familiar with, this labelling is very informative and can help you quickly identify common themes. Key to interpreting the volcano plot is to remember that positive fold changes means the gene is up-regulated in the FGF-treated samples and negative fold changes means the gene is down-regulated (i.e., higher in the control). This was determined by the order of the treatments in the contrast used in the DESeq2 analysis

If you do forget which way round you did the comparison, you can always examine the results dataframe to see which of the treatments seem to be higher for the positive fold changes.

Please note that Betsy doesn’t like graphs like this in the report!

When you have a gene of interest, you may wish to label it, and only it, on the plot.
This is done in the same way except that you filter the data to only include the gene of interest. I have used and then use geom_label_repel() rather than geom_text_repel() to put the label in a box and nudged it’s position to get a line connecting the point and the label. I have also increased the size of the point.

🎬 Add a label to one gene of interest (hoxb9.S) and :

s30_results |> 
  ggplot(aes(x = log2FoldChange, 
             y = log10_padj, 
             colour = interaction(sig, bigfc))) +
  geom_point() +
  geom_hline(yintercept = -log10(0.05), 
             linetype = "dashed") +
  geom_vline(xintercept = 2, 
             linetype = "dashed") +
  geom_vline(xintercept = -2, 
             linetype = "dashed") +
  scale_x_continuous(expand = c(0, 0)) +
  scale_y_continuous(expand = c(0, 0)) +
  scale_colour_manual(values = c("gray", 
                                 "pink",
                                 "gray30",
                                 "deeppink")) +
  geom_label_repel(data = s30_results |> 
                    filter(xenbase_gene_symbol == "hoxb9.S"),
                  aes(label = xenbase_gene_symbol),
                  size = 4,
                  nudge_x = .5,
                  nudge_y = 1.5) +
  geom_point(data = s30_results |> 
                    filter(xenbase_gene_symbol == "hoxb9.S"),
                size = 3) +
  theme_classic() +
  theme(legend.position = "none")

Should you want to label more than one gene, you will need to use (for example): filter(xenbase_gene_symbol %in% c("hoxb9.S", "fzd7.S"))

Now go to Save your plots

πŸŽ„ Arabidopsis

We will add a column to the results dataframe that contains the -log10(padj). You could perform this transformation within the plot command without adding a column to the data if you prefer.

🎬 Add a column to the results dataframe that contains the -log10(padj):

wild_results <- wild_results |> 
  mutate(log10_padj = -log10(padj)) 

🎬 Create a volcano plot of the results:

wild_results |> 
  ggplot(aes(x = log2FoldChange, 
             y = log10_padj)) +
  geom_point() +
  geom_hline(yintercept = -log10(0.05), 
             linetype = "dashed") +
  geom_vline(xintercept = 2, 
             linetype = "dashed") +
  geom_vline(xintercept = -2, 
             linetype = "dashed") +
  scale_x_continuous(expand = c(0, 0)) +
  scale_y_continuous(expand = c(0, 0)) +
  theme_classic() +
  theme(legend.position = "none")

Our dashed lines are at -log10(0.05) and log2(2) and log2(-2) to make more clear which genes (points) are significantly different between the copper sufficient and deficient conditions samples and have a fold change of at least 2.

In most cases, people colour the points to show that the quadrants. I like to add columns to the dataframe to indicate if the gene is significant and if the fold change is large and use those variables in the plot.

🎬 Add columns to the results dataframe to indicate if the gene is significant and if the fold change is large:

wild_results <- wild_results |> 
  mutate(sig = padj <= 0.05,
         bigfc = abs(log2FoldChange) >= 2) 

The use of abs() (absolute) means genes with a fold change of at least 2 in either direction will be considered to have a large fold change.

Now we can colour the points by these new columns. I use interaction() to create four categories:

  • not significant and not large fold change (FF)
  • significant and not large fold change (TF)
  • not significant and large fold (FT)
  • significant and large fold change (TT)

And I use scale_colour_manual() to set the colours for these categories.

NOTE: there are no β€œsignificant and not large fold change (TF)” in this case.This means we do not need four colours. I have put the β€œpink” colour in the code but commented it out.

🎬 Create a volcano plot of the results with the points coloured by significance and fold change:

wild_results |> 
  ggplot(aes(x = log2FoldChange, 
             y = log10_padj, 
             colour = interaction(sig, bigfc))) +
  geom_point() +
  geom_hline(yintercept = -log10(0.05), 
             linetype = "dashed") +
  geom_vline(xintercept = 2, 
             linetype = "dashed") +
  geom_vline(xintercept = -2, 
             linetype = "dashed") +
  scale_x_continuous(expand = c(0, 0)) +
  scale_y_continuous(expand = c(0, 0)) +
  scale_colour_manual(values = c("gray", 
                                # "pink",
                                 "gray30",
                                 "deeppink")) +
  theme_classic() +
  theme(legend.position = "none")

For exploring the data, I like add labels to all the significant genes with a large fold change so I can very quickly identity them. The ggrepel package has a function geom_text_repel() that is useful for adding labels so that they don’t overlap.

🎬 Load the package:

🎬 Add labels to the significant genes with a large fold change:

wild_results |> 
  ggplot(aes(x = log2FoldChange, 
             y = log10_padj, 
             colour = interaction(sig, bigfc))) +
  geom_point() +
  geom_hline(yintercept = -log10(0.05), 
             linetype = "dashed") +
  geom_vline(xintercept = 2, 
             linetype = "dashed") +
  geom_vline(xintercept = -2, 
             linetype = "dashed") +
  scale_x_continuous(expand = c(0, 0)) +
  scale_y_continuous(expand = c(0, 0)) +
  scale_colour_manual(values = c("gray", 
                               #  "pink",
                                 "gray30",
                                 "deeppink")) +
  geom_text_repel(data = wild_results |> 
                    filter(bigfc == TRUE, sig == TRUE),
                  aes(label = external_gene_name),
                  size = 3,
                  max.overlaps = 50) +
  theme_classic() +
  theme(legend.position = "none")

Notice that I have used filter() label only the genes that are both significant and have a large fold change. In systems you are familiar with, this labelling is very informative and can help you quickly identify common themes. Key to interpreting the volcano plot is to remember that positive fold changes means the gene is up-regulated in the sufficient conditions and negative fold changes means the gene is down-regulated (i.e., higher in the deficient). This was determined by the order of the treatments in the contrast used in the DESeq2 analysis

If you do forget which way round you did the comparison, you can always examine the results dataframe to see which of the treatments seem to be higher for the positive fold changes.

When you have a gene of interest, you may wish to label it, and only it, on the plot. This is done in the same way except that you filter the data to only include the gene of interest. I have used and then use geom_label_repel() rather than geom_text_repel() to put the label in a box and nudged it’s position to get a line connecting the point and the label. I have also increased the size of the point.

🎬 Add a label to one gene of interest (hoxb9.S) and :

wild_results |> 
  ggplot(aes(x = log2FoldChange, 
             y = log10_padj, 
             colour = interaction(sig, bigfc))) +
  geom_point() +
  geom_hline(yintercept = -log10(0.05), 
             linetype = "dashed") +
  geom_vline(xintercept = 2, 
             linetype = "dashed") +
  geom_vline(xintercept = -2, 
             linetype = "dashed") +
  scale_x_continuous(expand = c(0, 0)) +
  scale_y_continuous(expand = c(0, 0)) +
  scale_colour_manual(values = c("gray", 
                               #  "pink",
                                 "gray30",
                                 "deeppink")) +
  geom_label_repel(data = wild_results |> 
                    filter(external_gene_name == "FRO4"),
                  aes(label = external_gene_name),
                  size = 4,
                  nudge_x = .5,
                  nudge_y = 1.5) +
  geom_point(data = wild_results |> 
                    filter(external_gene_name == "FRO4"),
                size = 3) +
  theme_classic() +
  theme(legend.position = "none")

Should you want to label more than one gene, you will need to use (for example): filter(external_gene_name %in% c("FRO4", "FRO5"))

Now go to Save your plots

πŸ’‰ Leishmania

We will add a column to the results dataframe that contains the -log10(padj). You could perform this transformation within the plot command without adding a column to the data if you prefer.

🎬 Add a column to the results dataframe that contains the -log10(padj):

pro_meta_results <- pro_meta_results |> 
  mutate(log10_padj = -log10(padj)) 

🎬 Create a volcano plot of the results:

pro_meta_results |> 
  ggplot(aes(x = log2FoldChange, 
             y = log10_padj)) +
  geom_point() +
  geom_hline(yintercept = -log10(0.05), 
             linetype = "dashed") +
  geom_vline(xintercept = 2, 
             linetype = "dashed") +
  geom_vline(xintercept = -2, 
             linetype = "dashed") +
  scale_x_continuous(expand = c(0, 0)) +
  scale_y_continuous(expand = c(0, 0)) +
  theme_classic() +
  theme(legend.position = "none")

Our dashed lines are at -log10(0.05) and log2(2) and log2(-2) to make more clear which genes (points) are significantly different between the life stages and have a fold change of at least 2. Whilst we have very many genes that are significant, we have fewer that are significant and have a large fold change.

In most cases, people colour the points to show that the quadrants. I like to add columns to the dataframe to indicate if the gene is significant and if the fold change is large and use those variables in the plot.

🎬 Add columns to the results dataframe to indicate if the gene is significant and if the fold change is large:

pro_meta_results <- pro_meta_results |> 
  mutate(sig = padj <= 0.05,
         bigfc = abs(log2FoldChange) >= 2) 

The use of abs() (absolute) means genes with a fold change of at least 2 in either direction will be considered to have a large fold change.

Now we can colour the points by these new columns. I use interaction() to create four categories:

  • not significant and not large fold change (FF)
  • significant and not large fold change (TF)
  • not significant and large fold (FT)
  • significant and large fold change (TT)

And I use scale_colour_manual() to set the colours for these categories.

🎬 Create a volcano plot of the results with the points coloured by significance and fold change:

pro_meta_results |> 
  ggplot(aes(x = log2FoldChange, 
             y = log10_padj, 
             colour = interaction(sig, bigfc))) +
  geom_point() +
  geom_hline(yintercept = -log10(0.05), 
             linetype = "dashed") +
  geom_vline(xintercept = 2, 
             linetype = "dashed") +
  geom_vline(xintercept = -2, 
             linetype = "dashed") +
  scale_x_continuous(expand = c(0, 0)) +
  scale_y_continuous(expand = c(0, 0)) +
  scale_colour_manual(values = c("gray", 
                                 "pink",
                                 "gray30",
                                 "deeppink")) +
  theme_classic() +
  theme(legend.position = "none")

For exploring the data, I like add labels to all the significant genes with a large fold change so I can very quickly identity them. The ggrepel package has a function geom_text_repel() that is useful for adding labels so that they don’t overlap.

🎬 Load the package:

🎬 Add labels to the significant genes with a large fold change:

pro_meta_results |> 
  ggplot(aes(x = log2FoldChange, 
             y = log10_padj, 
             colour = interaction(sig, bigfc))) +
  geom_point() +
  geom_hline(yintercept = -log10(0.05), 
             linetype = "dashed") +
  geom_vline(xintercept = 2, 
             linetype = "dashed") +
  geom_vline(xintercept = -2, 
             linetype = "dashed") +
  scale_x_continuous(expand = c(0, 0)) +
  scale_y_continuous(expand = c(0, 0)) +
  scale_colour_manual(values = c("gray", 
                                 "pink",
                                 "gray30",
                                 "deeppink")) +
  geom_text_repel(data = pro_meta_results |> 
                    filter(bigfc == TRUE, sig == TRUE),
                  aes(label = description),
                  size = 3,
                  max.overlaps = 50) +
  theme_classic() +
  theme(legend.position = "none")

Notice that I have used filter() label only the genes that are both significant and have a large fold change. In systems you are familiar with, this labelling is very informative and can help you quickly identify common themes. However, in this case, we do not have good annotation for the genes. We can label only with the gene id or the description. Many of the descriptions are

🎬 Add labels to the significant genes with a large fold change only where description doesn’t contain β€œhypothetical” or β€œunspecified :

pro_meta_results |> 
  ggplot(aes(x = log2FoldChange, 
             y = log10_padj, 
             colour = interaction(sig, bigfc))) +
  geom_point() +
  geom_hline(yintercept = -log10(0.05), 
             linetype = "dashed") +
  geom_vline(xintercept = 2, 
             linetype = "dashed") +
  geom_vline(xintercept = -2, 
             linetype = "dashed") +
  scale_x_continuous(expand = c(0, 0)) +
  scale_y_continuous(expand = c(0, 0)) +
  scale_colour_manual(values = c("gray", 
                                 "pink",
                                 "gray30",
                                 "deeppink")) +
  geom_text_repel(data = pro_meta_results |> 
                    filter(bigfc == TRUE, sig == TRUE,
                           !str_detect(description, "hypothetical|unspecified")),
                  aes(label = description),
                  size = 3,
                  max.overlaps = 50) +
  theme_classic() +
  theme(legend.position = "none")

Key to interpreting the volcano plot is to remember that positive fold changes means the gene is up-regulated in the procyclic stage and negative fold changes means the gene is down-regulated (i.e., higher in the metacyclic). This was determined by the order of the treatments in the contrast used in the DESeq2 analysis

If you do forget which way round you did the comparison, you can always examine the results dataframe to see which of the treatments seem to be higher for the positive fold changes.

When you have a gene of interest, you may wish to label it, and only it, on the plot. This is done in the same way except that you filter the data to only include the gene of interest. I have used and then use geom_label_repel() rather than geom_text_repel() to put the label in a box and nudged it’s position to get a line connecting the point and the label. I have also increased the size of the point.

🎬 Add a label to one gene of interest (elongation factor 1-alpha) and :

pro_meta_results |> 
  ggplot(aes(x = log2FoldChange, 
             y = log10_padj, 
             colour = interaction(sig, bigfc))) +
  geom_point() +
  geom_hline(yintercept = -log10(0.05), 
             linetype = "dashed") +
  geom_vline(xintercept = 2, 
             linetype = "dashed") +
  geom_vline(xintercept = -2, 
             linetype = "dashed") +
  scale_x_continuous(expand = c(0, 0)) +
  scale_y_continuous(expand = c(0, 0)) +
  scale_colour_manual(values = c("gray", 
                                 "pink",
                                 "gray30",
                                 "deeppink")) +
  geom_label_repel(data = pro_meta_results |> 
                    filter(description == "elongation factor 1-alpha"),
                  aes(label = description),
                  size = 4,
                  nudge_x = .5,
                  nudge_y = 1.5) +
  geom_point(data = pro_meta_results |> 
                    filter(description == "elongation factor 1-alpha"),
                size = 3) +
  theme_classic() +
  theme(legend.position = "none")

Should you want to label more than one gene, you will need to use (for example): filter(description %in% c("elongation factor 1-alpha", "ADP/ATP translocase 1 putative"))

Now go to Save your plots

🐭 Stem cells

We will add a column to the results dataframe that contains the -log10(FDR). You could perform this transformation within the plot command without adding a column to the data if you prefer.

🎬 Add a column to the results dataframe that contains the -log10(FDR):

hspc_prog_results <- hspc_prog_results |> 
  mutate(log10_FDR = -log10(FDR)) 

🎬 Create a volcano plot of the results:

hspc_prog_results |> 
  ggplot(aes(x = summary.logFC, 
             y = log10_FDR)) +
  geom_point() +
  geom_hline(yintercept = -log10(0.05), 
             linetype = "dashed") +
  geom_vline(xintercept = 2, 
             linetype = "dashed") +
  geom_vline(xintercept = -2, 
             linetype = "dashed") +
  scale_x_continuous(expand = c(0, 0)) +
  scale_y_continuous(expand = c(0, 0)) +
  theme_classic() +
  theme(legend.position = "none")

Our dashed lines are at -log10(0.05) and log2(2) and log2(-2) to make more clear which genes (points) are significantly different between the cell types and have a fold change of at least 2.

In most cases, people colour the points to show that the quadrants. I like to add columns to the dataframe to indicate if the gene is significant and if the fold change is large and use those variables in the plot.

🎬 Add columns to the results dataframe to indicate if the gene is significant and if the fold change is large:

hspc_prog_results <- hspc_prog_results |> 
  mutate(sig = FDR <= 0.05,
         bigfc = abs(summary.logFC) >= 2) 

The use of abs() (absolute) means genes with a fold change of at least 2 in either direction will be considered to have a large fold change.

Now we can colour the points by these new columns. I use interaction() to create four categories:

  • not significant and not large fold change (FF)
  • significant and not large fold change (TF)
  • not significant and large fold (FT)
  • significant and large fold change (TT)

And I use scale_colour_manual() to set the colours for these categories.

NOTE: there are no β€œnot significant and large fold change (FT)” in this case. This means we do not need four colours. I have put the β€œgray30” colour in the code but commented it out.

🎬 Create a volcano plot of the results with the points coloured by significance and fold change:

hspc_prog_results |> 
  ggplot(aes(x = summary.logFC, 
             y = log10_FDR, 
             colour = interaction(sig, bigfc))) +
  geom_point() +
  geom_hline(yintercept = -log10(0.05), 
             linetype = "dashed") +
  geom_vline(xintercept = 2, 
             linetype = "dashed") +
  geom_vline(xintercept = -2, 
             linetype = "dashed") +
  scale_x_continuous(expand = c(0, 0)) +
  scale_y_continuous(expand = c(0, 0)) +
  scale_colour_manual(values = c("gray", 
                                 "pink",
                                # "gray30",
                                 "deeppink")) +
  theme_classic() +
  theme(legend.position = "none")

For exploring the data, I like add labels to all the significant genes with a large fold change so I can very quickly identity them. The ggrepel package has a function geom_text_repel() that is useful for adding labels so that they don’t overlap.

🎬 Load the package:

🎬 Add labels to the significant genes with a large fold change:

hspc_prog_results |> 
  ggplot(aes(x = summary.logFC, 
             y = log10_FDR, 
             colour = interaction(sig, bigfc))) +
  geom_point() +
  geom_hline(yintercept = -log10(0.05), 
             linetype = "dashed") +
  geom_vline(xintercept = 2, 
             linetype = "dashed") +
  geom_vline(xintercept = -2, 
             linetype = "dashed") +
  scale_x_continuous(expand = c(0, 0)) +
  scale_y_continuous(expand = c(0, 0)) +
  scale_colour_manual(values = c("gray", 
                                 "pink",
                              #   "gray30",
                                 "deeppink")) +
  geom_text_repel(data = hspc_prog_results |> 
                    filter(bigfc == TRUE, sig == TRUE),
                  aes(label = external_gene_name),
                  size = 3,
                  max.overlaps = 50) +
  theme_classic() +
  theme(legend.position = "none")

Notice that I have used filter() label only the genes that are both significant and have a large fold change. In systems you are familiar with, this labelling is very informative and can help you quickly identify common themes. Key to interpreting the volcano plot is to remember that positive fold changes means the gene is up-regulated in the Prog and negative fold changes means the gene is down-regulated (i.e., higher HSPC). This was determined by us choosing the results_hspc_prog$prog dataframe from the list object

If you do forget which way round you did the comparison, you can always examine the gene summary dataframe to see which of the treatments seem to be higher for the positive fold changes.

When you have a gene of interest, you may wish to label it on the plot. This is done in the same way except that you filter the data to only include the gene of interest. I have used and then use geom_label_repel() rather than geom_text_repel() to put the label in a box and nudged it’s position to get a line connecting the point and the label. I have also increased the size of the point.

🎬 Add a label to one gene of interest (Procr) and :

hspc_prog_results |> 
  ggplot(aes(x = summary.logFC, 
             y = log10_FDR, 
             colour = interaction(sig, bigfc))) +
  geom_point() +
  geom_hline(yintercept = -log10(0.05), 
             linetype = "dashed") +
  geom_vline(xintercept = 2, 
             linetype = "dashed") +
  geom_vline(xintercept = -2, 
             linetype = "dashed") +
  scale_x_continuous(expand = c(0, 0)) +
  scale_y_continuous(expand = c(0, 0)) +
  scale_colour_manual(values = c("gray", 
                                 "pink",
                                # "gray30",
                                 "deeppink")) +
  geom_label_repel(data = hspc_prog_results |> 
                    filter(external_gene_name == "Procr"),
                  aes(label = external_gene_name),
                  size = 4,
                  nudge_x = .5,
                  nudge_y = 1.5) +
  geom_point(data = hspc_prog_results |> 
                    filter(external_gene_name == "Procr"),
                size = 3) +
  theme_classic() +
  theme(legend.position = "none")

Should you want to label more than one gene, you will need to use (for example): filter(external_gene_name %in% c("Procr", "Emb"))

Now go to Save your plots

Save your plots

Once you have finished designing your plots, you should save them using ggsave(). You will want to assign the plot to a variable and use code similar to this:

ggsave("figures/my-informative-file-name.png",
       plot = vol,
       height = 4.5, 
       width = 4.5,
       units = "in",
       device = "png")

I recommend saving your figure in the approximate size it will appear in your report, that is, don’t resize it in word/google docs. This because the font will then be an appropriate size for the report. For journals, we often have to creat figure files of very high resolution. You can do this by increasing the height and width and then resizing the figure in the document. however, this requires resizing the all font and lines in the figure. I also recommend saving it as a png file as this is a lossless format.

πŸ€— Look after future you!

🎬 Go through your script and tidy up. I would suggest restarting R and trying to run the full pipeline from start to finish. You might need to:

  • collect together library statements at the top of the script
  • remove code that you needed to start today but which wouldn’t be needed running the script from the top (e.g., importing the results)
  • edit your comments for clarity
  • rename variables for consistency or clarity
  • remove house keeping or exploratory code
  • restyle code, add code section headers etc

πŸ₯³ Finished

Well Done!

Independent study following the workshop

Consolidate

The Code file

These contain all the code needed in the workshop even where it is not visible on the webpage.

The workshop.qmd file is the file I use to compile the practical. Qmd stands for Quarto markdown. It allows code and ordinary text to be interleaved to produce well-formatted reports including webpages. Right-click on the link and choose Save-As to download. You will be able to open the Qmd file in RStudio. Alternatively, View in Browser. Coding and thinking answers are marked with #---CODING ANSWER--- and #---THINKING ANSWER---

Pages made with R (R Core Team 2024), Quarto (Allaire et al. 2024), knitr (Xie 2024, 2015, 2014), kableExtra (Zhu 2021)

References

Allaire, J. J., Charles Teague, Carlos Scheidegger, Yihui Xie, and Christophe Dervieux. 2024. β€œQuarto.” https://doi.org/10.5281/zenodo.5960048.
R Core Team. 2024. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.
Wickham, Hadley. 2023. Conflicted: An Alternative Conflict Resolution Strategy. https://conflicted.r-lib.org/.
Wickham, Hadley, Mara Averick, Jennifer Bryan, Winston Chang, Lucy D’Agostino McGowan, Romain FranΓ§ois, Garrett Grolemund, et al. 2019. β€œWelcome to the tidyverse.” Journal of Open Source Software 4 (43): 1686. https://doi.org/10.21105/joss.01686.
Xie, Yihui. 2014. β€œKnitr: A Comprehensive Tool for Reproducible Research in R.” In Implementing Reproducible Computational Research, edited by Victoria Stodden, Friedrich Leisch, and Roger D. Peng. Chapman; Hall/CRC.
β€”β€”β€”. 2015. Dynamic Documents with R and Knitr. 2nd ed. Boca Raton, Florida: Chapman; Hall/CRC. https://yihui.org/knitr/.
β€”β€”β€”. 2024. Knitr: A General-Purpose Package for Dynamic Report Generation in r. https://yihui.org/knitr/.
Zhu, Hao. 2021. β€œkableExtra: Construct Complex Table with ’Kable’ and Pipe Syntax.” https://CRAN.R-project.org/package=kableExtra.