Workshop

Transcriptomics 3: Visualising

Author

Emma Rand

Published

11 February, 2025

Introduction

Session overview

In the workshop, you will learn how to conduct and plot a Principle Component Analysis (PCA) as well as how to create a nicely formatted Volcano plot. You will also save significant genes to file to make it easier to identify genes of interest. ~~and perform Gene Ontology (GO) term enrichment analysis.~~

Set up

🐸 Frog development

🎬 Open the frogs-88H RStudio Project and the cont-fgf-s30.R script.

🎄 Arabidopsis

🎬 Open the arabi-88H RStudio Project and the wildsuf-wilddef-s30.R script.

💉 Leishmania

🎬 Open the leish-88H RStudio Project and the pro-meta-s30.R script.

🐭 Stem cells

🎬 Open the mice-88H RStudio Project and the hspc-prog.R script.

Everyone

🎬 Make a new folder figures in the project directory.

This is where we will save our figure files

🎬 Load tidyverse (Wickham et al. 2019) and conflicted (Wickham 2023). You most likely have this code at the top of your script already.

library(tidyverse)
library(conflicted)

── Attaching core tidyverse packages ─────────────────────────────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.3     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.3     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ───────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package to force all conflicts to become errors

I recommend you set the dplyr versions of filter() and select() to use by default

🎬 Use the dplyr version of filter() by default:

conflicts_prefer(dplyr::filter)
conflicts_prefer(dplyr::select)

Import

Everyone

🎬 Import your results data. This should be a file in the results folder called xxxx_results.csv where xxxx indicates the comparison you made.

🎬 Remind yourself what is in the rows and columns and the structure of the dataframes (perhaps using glimpse())

When we do PCA we will want to label the samples with their treatment for figures. For 🐸 Frog development, 🎄 Arabidopsis and 💉 Leishmania, this labelling information is most easily added using the metadata. You will need to filter for only the samples in the comparison that was made in the results file.

You may need to refer back to the Week 4 Statistical Analysis workshop (in section 2. Create DESeqDataSet object) to remind yourself how to import and filter the metadata you need.

🎬 Import the metadata that maps the sample names to treatments. Remember to select only the samples for comparison that was made.

For 🐭 Stem cells, cell types are encoded in the column names. We will do some regular expression magic to extract the cell type from the column names.

log₂ transform the normalised counts

We use the normalised counts for data visualisations so that the comparisons are meaningful. Since the fold changes are given is log₂ it is useful to log₂ transform the normalised counts too. We will add columns to the dataframe with these transformed values. Since we have some counts of 0 we will add a tiny amount to avoid -Inf values.

log₂ transformation would be applied to one column like this:

# DO NOT DO
# log2 transform the counts plus a tiny amount to avoid log(0)
dataframe <- dataframe |>
  mutate(log2_mycolumn = log2(mycolumn + 0.001))

We are going to use a wonderful bit of R wizardry to apply a transformation to multiple columns. This is the across() function which has three arguments:

across(.cols, .fns, .names)

where:

.cols is the selection of columns to transform
.fns is the function we want to apply to the selected columns
.names is the naming convention for the new columns

The general form of the code you need is:

# DO NOT DO
# log2 transform the counts plus a tiny amount to avoid log(0)
xxxx_results <- xxxx_results |>
  mutate(across(starts_with("pattern"), 
                \(x) log2(x + 0.001),
                .names = "log2_{.col}"))

where:

xxxx_results is the name of the dataframe of results
pattern matches the starting letters for all of the normalised counts so that starts_with("pattern") gives the selection of columns to transform
the bit after the \(x) is the function we want to apply to the selected columns
the \(x) means it is an “anonymous” function which means we don’t have to define a function name.
"log2_{.col}" means the columns will have the same name (in the .col) but with the prefix log2_ added.

You can read more about across() and anonymous functions from my posit::conf(2024) workshop

🐸 Frog development, 🎄 Arabidopsis and 💉 Leishmania

🎬 Design the code to log₂ transform the normalised counts using the template given

I recommend viewing the dataframe to see the new columns. Check you have the expected number of columns.

🐭 Stem cells

You do not need to apply this transformation because the data is already log₂ transformed.

Everyone

We now all have dataframes with all the information we need: normalised counts, log₂ normalised counts, statistical comparisons with fold changes and p-values, and information about the gene.

Write the significant genes to file

Everyone

We will create dataframe of the significant genes and write them to file. This is subset from the results file but will make it a little easier to examine and select genes of interest.

The general form of the code you need is:

# DO NOT DO
# create a dataframe of genes significant at 0.05 level
xxxx_results_sig0.05 <- xxxx_results |> 
  filter(pvalue <= 0.05)

Note that you determine the significance level using the adjusted p-values rather than the uncorrected p-values. This column is name padj in DESeq2 output and FDR in scran output.

🎬 Create a dataframe of the genes significant at the 0.05 level using filter(). You will need to know the name of column with the adjusted p-values.

❓How many genes are significant at the 0.05 levels?

If you have a very large number of genes significant at the 0.05 level, you may want to consider a more stringent cut-off such as 0.01.

🎬 Write the dataframe to a csv file. I recommend using the same file name as you used for the dataframe.

Principal Component Analysis (PCA)

We have many genes in our datasets. PCA will allow us to plot our samples in the “gene expression” space so we can see if replicates with the same treatment cluster together as we would expect. PCA is a dimension reduction technique that finds the directions of maximum variance in the data. We are doing PCA after our statistical analysis but often it is one of the first techniques used to explore the data. If we do not see treatment effects in a PCA plot then we are not likely to see them in the statistical analysis. PCA can also help you identify any outlier replicates. The PCA is best conducted on the normalised counts or the log₂ transformed normalised counts. We will use the log₂ transformed normalised counts.

We will carry out the following steps to do the PCA:

Select the log₂ transformed normalised counts. You can only use numerical data in PCA.
Transpose our data. We have genes in rows and samples in columns (this is common for gene expression data). However, to treatment the genes as variables, PCA expects samples in rows and genes in columns.
Add the gene names as column names in the transposed data
Perform the PCA
Extract the scores on the first two principal components and label the data
Plot the the first two principal components as a scatter plot

In each case we will use only the samples for the comparison that was made in the PCA. However, it would be informative to do PCA on all the samples you have and that is something you may want to consider in your independent study.

Now go to PCA for:

🐸 Frog development
🎄 Arabidopsis
💉 Leishmania
🐭 Stem cells

🐸 Frog development

🎬 Transpose the log₂ transformed normalised counts:

s30_log2_trans <- s30_results |> 
  select(starts_with("log2_")) |>
  t() |> 
  data.frame()

We have used the select() function to select all the columns that start with log2_. We then use the t() function to transpose the dataframe. We then convert the resulting matrix to a dataframe using data.frame(). If you view that dataframe you’ll see it has default column name which we can fix using colnames() to set the column names to the Xenbase gene ids.

🎬 Set the column names to the Xenbase gene ids:

colnames(s30_log2_trans) <- s30_results$xenbase_gene_id

🎬 Perform PCA on the log₂ transformed normalised counts:

pca <- s30_log2_trans |>
  prcomp(rank. = 4)

The rank. argument tells prcomp() to only calculate the first 4 principal components. This is useful for visualisation as we can only plot in 2 or 3 dimensions. We can see the results of the PCA by viewing the summary() of the pca object.

summary(pca)

Importance of first k=4 (out of 6) components:
                           PC1     PC2     PC3     PC4
Standard deviation     64.0124 47.3351 38.4706 31.4111
Proportion of Variance  0.4243  0.2320  0.1532  0.1022
Cumulative Proportion   0.4243  0.6562  0.8095  0.9116

The Proportion of Variance tells us how much of the variance is explained by each component. We can see that the first component explains 0.4243 of the variance, the second 0.2320, and the third 0.1532. Together the first three components explain nearly 81% of the total variance in the data. Plotting PC1 against PC2 will capture about 66% of the variance which is likely very much better than we would get plotting any two genes against each other. To plot the PC1 against PC2 we will need to extract the PC1 and PC2 “scores” from the PCA object and add labels for the samples. Those labels will come from the row names of the transformed data which has the sample ids and from the metadata.

🎬 Create a vector of the sample ids from the row names. These include the log2 prefix which we can removed for labelling:

sample_id <- row.names(s30_log2_trans) |> str_remove("log2_")

You might want to check the result.

Now we will extract the PC1 and PC2 scores from the PCA object and add. Our PCA object is called pca and the scores are in pca$x. We will create a dataframe of the scores and add the sample ids.

🎬 Create a dataframe of PC1 and PC2 scores and add the sample ids:

pca_labelled <- data.frame(pca$x,
                           sample_id)

🎬 Merge with the metadata so we can label points by treatment and sibling pair:

pca_labelled <- pca_labelled |> 
  left_join(meta_s30, 
            by = "sample_id")

Since the metadata contained the sample ids, it was especially important to remove the log2_ from the row names so that the join would work.

The dataframe should look like this:

PC1	PC2	PC3	PC4	sample_id	stage	treatment	sibling_rep
-76.38391	0.814699	-60.728327	-5.820669	S30_C_1	stage_30	control	one
-67.02571	25.668563	51.476835	28.480254	S30_C_2	stage_30	control	two
-14.02772	-78.474054	15.282058	-9.213076	S30_C_3	stage_30	control	three
47.60726	49.035510	-19.288753	20.928290	S30_F_1	stage_30	FGF	one
26.04954	32.914201	20.206072	-55.752818	S30_F_2	stage_30	FGF	two
83.78054	-29.958919	-6.947884	21.378020	S30_F_3	stage_30	FGF	three

The next task is to plot PC2 against PC1 and colour by sibling pair. This is just a scatterplot so we can use geom_point(). We will use colour to indicate the sibling pair and shape to indicate the treatment.

🎬 Customise the PC2 against PC1 plot:

pca_labelled |> 
  ggplot(aes(x = PC1, y = PC2, 
             colour = sibling_rep,
             shape = treatment)) +
  geom_point(size = 3) +
  theme_classic()

There is a good separation between treatments on PCA1. The sibling pairs do not seem to cluster together. You can also try plotting PC3 or PC4.

I prefer to customise the colours and shapes. I especially like the
viridis colour scales which provide colour scales that are perceptually uniform in both colour and black-and-white. They are also designed to be perceived by viewers with common forms of colour blindness. See Introduction to viridis for more information.

ggplot provides functions to access the viridis scales. Here I use scale_fill_viridis_d(). The d stands for discrete. The function scale_fill_viridis_c() would be used for continuous data. I’ve used the default “viridis” (or “D”) option (do ?scale_fill_viridis_d for all the options) and used the begin and end arguments to control the range of colour - I have set the range to be from 0.15 to 0.95 the avoid the strongest contrast. I have also set the name argument to provide a label for the legend.

I have used scale_shape_manual() to set the shapes for the treatments. I have used the values 21 and 19 which are the codes for filled and open circles and filled triangles. I have set the name argument to NULL to remove the label (it’s obvious what that categories are treatments) and the labels argument to improve the legend.

🎬 Plot PC2 against PC1 and colour by sibling pair and shape by treatment:

pca_labelled |> 
  ggplot(aes(x = PC1, y = PC2, 
             colour = sibling_rep,
             shape = treatment)) +
  geom_point(size = 3) +
  scale_colour_viridis_d(end = 0.95, begin = 0.15,
                         name = "Sibling pair") +
  scale_shape_manual(values = c(21, 19),
                     name = NULL,
                     labels = c("Control", "FGF-Treated")) +
  theme_classic()

Now go to Volcano plots for 🐸 Frog development

🎄 Arabidopsis

🎬 Transpose the log₂ transformed normalised counts:

wild_log2_trans <- wild_results |> 
  select(starts_with("log2_")) |>
  t() |> 
  data.frame()

🎬 Set the column names to the gene ids:

colnames(wild_log2_trans) <- wild_results$gene_id

🎬 Perform PCA on the log₂ transformed normalised counts:

pca <- wild_log2_trans |>
  prcomp(rank. = 4, scale = TRUE)

The scale argument tells prcomp() to scale the data before doing the PCA. This is important when the variables are on different scales to stop variables with large values dominating the PCA. The rank. argument tells prcomp() to only calculate the first 4 principal components. This is useful for visualisation as we can only plot in 2 or 3 dimensions. We can see the results of the PCA by viewing the summary() of the pca object.

summary(pca)

Importance of components:
                            PC1     PC2     PC3       PC4
Standard deviation     123.0171 85.6822 62.2924 2.186e-13
Proportion of Variance   0.5742  0.2786  0.1472 0.000e+00
Cumulative Proportion    0.5742  0.8528  1.0000 1.000e+00

The Proportion of Variance tells us how much of the variance is explained by each component. We can see that the first component explains 0.5742 of the variance, the second 0.2786, and the third 0.1472. Together the first three components explain nearly 100% of the total variance in the data. Plotting PC1 against PC2 will capture about 92% of the variance which is very likely very much better than we would get plotting any two genes against each other. To plot the PC1 against PC2 we will need to extract the PC1 and PC2 “scores” from the PCA object and add labels for the samples. Those labels will come from the row names of the transformed data which has the sample ids and from the metadata.

🎬 Create a vector of the sample ids from the row names. These include the log2 prefix which we can removed for labelling:

sample_id <- row.names(wild_log2_trans) |> str_remove("log2_")

You might want to check the result.

Now we will extract the PC1 and PC2 scores from the PCA object and add. Our PCA object is called pca and the scores are in pca$x. We will create a dataframe of the scores and add the sample ids.

🎬 Create a dataframe of PC1 and PC2 scores and add the sample ids:

pca_labelled <- data.frame(pca$x,
                           sample_id)

🎬 Merge with the metadata so we can label points by treatment and sibling pair:

pca_labelled <- pca_labelled |> 
  left_join(meta_wild, 
            by = "sample_id")

Since the metadata contained the sample ids, it was especially important to remove the log2_ from the row names so that the join would work.

The dataframe should look like this:

PC1	PC2	PC3	sample_id	genotype	copper
-91.23038	105.92675	25.80774	SRX028956_wild_suf	wt	sufficient
-120.00085	-95.79115	-13.72618	SRX028957_wild_def	wt	deficient
92.06667	23.00841	-79.23090	SRX028960_wild_suf	wt	sufficient
119.16456	-33.14401	67.14935	SRX028961_wild_def	wt	deficient

The next task is to plot PC2 against PC1 and colour by copper conditions. This is just a scatterplot so we can use geom_point(). We will use colour to indicate the copper conditions.

🎬 Plot PC2 against PC1 and colour by copper conditions:

pca_labelled |> 
  ggplot(aes(x = PC3, y = PC2, 
             colour = copper)) +
  geom_point(size = 3) +
  theme_classic()

We do not see particularly good separation between treatments, though perhaps there is some separation between the copper sufficient and copper deficient on PC2. It is also difficult to see if the replicates cluster together (the treatments separate) when there are only two reps. You can also try plotting PC3 or PC4.

🎬 Customise the PC2 against PC1 plot:

pca_labelled |> 
  ggplot(aes(x = PC1, y = PC2, 
             colour = copper)) +
  geom_point(size = 3) +
  scale_colour_viridis_d(end = 0.95, begin = 0.15,
                         name = "Copper") +
  theme_classic()

Now go to Volcano plots for 🎄 Arabidopsis

💉 Leishmania

🎬 Transpose the log₂ transformed normalised counts:

pro_meta_log2_trans <- pro_meta_results |> 
  select(starts_with("log2_")) |>
  t() |> 
  data.frame()

🎬 Set the column names to the gene ids:

colnames(pro_meta_log2_trans) <- pro_meta_results$gene_id

🎬 Perform PCA on the log₂ transformed normalised counts:

pca <- pro_meta_log2_trans |>
  prcomp(rank. = 4, scale = TRUE)

summary(pca)

Importance of first k=4 (out of 6) components:
                           PC1     PC2     PC3      PC4
Standard deviation     66.0744 41.9741 29.4077 27.61377
Proportion of Variance  0.5175  0.2088  0.1025  0.09038
Cumulative Proportion   0.5175  0.7263  0.8288  0.91916

The Proportion of Variance tells us how much of the variance is explained by each component. We can see that the first component explains 0.5175 of the variance, the second 0.2088, and the third 0.1025. Together the first three components explain nearly 92% of the total variance in the data. Plotting PC1 against PC2 will capture about 66% of the variance which is likely very much better than we would get plotting any two genes against each other. To plot the PC1 against PC2 we will need to extract the PC1 and PC2 “scores” from the PCA object and add labels for the samples. Those labels will come from the row names of the transformed data which has the sample ids and from the metadata.

🎬 Create a vector of the sample ids from the row names. These include the log2 prefix which we can removed for labelling:

sample_id <- row.names(pro_meta_log2_trans) |> str_remove("log2_")

You might want to check the result.

Now we will extract the PC1 and PC2 scores from the PCA object and add. Our PCA object is called pca and the scores are in pca$x. We will create a dataframe of the scores and add the sample ids.

🎬 Create a dataframe of PC1 and PC2 scores and add the sample ids:

pca_labelled <- data.frame(pca$x,
                           sample_id)

🎬 Merge with the metadata so we can label points by life cycle stage:

pca_labelled <- pca_labelled |> 
  left_join(meta_pro_meta, 
            by = "sample_id")

Since the metadata contained the sample ids, it was especially important to remove the log2_ from the row names so that the join would work.

The dataframe should look like this:

PC1	PC2	PC3	PC4	sample_id	stage	replicate
-59.42107	-2.922229	-11.043368	-13.4641060	lm_pro_1	procyclic	1
-63.25818	-15.985856	-36.248191	0.7025242	lm_pro_2	procyclic	2
-57.92065	25.679645	47.373874	13.7893070	lm_pro_3	procyclic	3
56.20326	-59.707787	5.958727	31.5528028	lm_meta_1	metacyclic	1
57.41205	-11.738157	14.157242	-47.2285609	lm_meta_2	metacyclic	2
66.98459	64.674383	-20.198284	14.6480329	lm_meta_3	metacyclic	3

The next task is to plot PC2 against PC1 and colour by sibling pair. This is just a scatterplot so we can use geom_point(). We will use colour to indicate the life cycle stage.

🎬 Plot PC2 against PC1 and colour by copper conditions:

pca_labelled |> 
  ggplot(aes(x = PC1, y = PC2, 
             colour = stage)) +
  geom_point(size = 3) +
  theme_classic()

There is a good separation between treatments on PCA1. The replicates do seem to cluster together. You can also try plotting PC3 or PC4.

I prefer to customise the colours. I especially like the
viridis colour scales which provide colour scales that are perceptually uniform in both colour and black-and-white. They are also designed to be perceived by viewers with common forms of colour blindness. See Introduction to viridis for more information.

🎬 Plot PC2 against PC1 and colour by life stage:

pca_labelled |> 
  ggplot(aes(x = PC1, y = PC2, 
             colour = stage)) +
  geom_point(size = 3) +
  scale_colour_viridis_d(end = 0.95, begin = 0.15,
                         name = "Stage") +
  theme_classic()

Now go to Volcano plots for 💉 Leishmania

🐭 Stem cells

🎬 Transpose the log₂ transformed normalised counts:

hspc_prog_trans <- hspc_prog_results |> 
  dplyr::select(starts_with(c("HSPC_", "Prog_"))) |>
  t() |> 
  data.frame()

We have used the select() function to select all the columns that start with Prog_ or HSPC_. We then use the t() function to transpose the dataframe. We then convert the resulting matrix to a dataframe using data.frame(). If you view that dataframe you’ll see it has default column name which we can fix using colnames() to set the column names to the gene ids.

🎬 Set the column names to the gene ids:

colnames(hspc_prog_trans) <- hspc_prog_results$ensembl_gene_id

🎬 Perform PCA on the log₂ transformed normalised counts:

pca <- hspc_prog_trans |>
  prcomp(rank. = 8)

The rank. argument tells prcomp() to only calculate the first 8 principal components. This is useful for visualisation as we can only plot in 2 or 3 dimensions. We can see the results of the PCA by viewing the summary() of the pca object.

summary(pca)

Importance of first k=8 (out of 280) components:
                           PC1     PC2     PC3     PC4     PC5     PC6     PC7
Standard deviation     12.5612 8.36646 5.98988 5.41386 4.55730 4.06142 3.84444
Proportion of Variance  0.1099 0.04874 0.02498 0.02041 0.01446 0.01149 0.01029
Cumulative Proportion   0.1099 0.15861 0.18359 0.20400 0.21846 0.22995 0.24024
                           PC8
Standard deviation     3.70848
Proportion of Variance 0.00958
Cumulative Proportion  0.24982

The Proportion of Variance tells us how much of the variance is explained by each component. We can see that the first component explains 0.1099 of the variance, the second 0.04874, and the third 0.02498. Together the first three components explain 18% of the total variance in the data. This is not that high but it is also quite a lot better than than we would get plotting any two genes randomly chosen against each other.

To plot the PC1 against PC2 we will need to extract the PC1 and PC2 “scores” from the PCA object and add labels for the cells. Our PCA object is called pca and the scores are in pca$x. The cells labels will come from the row names of the transformed data.

🎬 Create a dataframe of the PC1 and PC2 scores (in pca$x) and add the cell ids:

pca_labelled <- data.frame(pca$x,
                           cell_id = row.names(hspc_prog_trans))

It will be helpful to add a column for the cell type so we can label points. One way to do this is to extract the information in the cell_id column into two columns: one with the complete cell id and one with just the cell type. This extraction is done with Regular Expression magic.

🎬 Extract the cell type and cell number from the cell_id column (keeping the cell_id column):

pca_labelled <- pca_labelled |> 
  extract(cell_id, 
          remove = FALSE,
          c("cell_type", "cell_number"),
          "([a-zA-Z]{4})_([0-9]{3})")

What this code does is take what is in the cell_id column (something like Prog_001 or HSPC_001) and split it into two columns (“cell_type” and “cell_number”). The reason why we want to do that is to colour the points by cell type. We would not want to use cell_id to colour the points because each cell id is unique and that would be 1000+ colours. The last argument in the extract() function is the pattern to match described with a regular expression. Three patterns are being matched, and two of those are in brackets meaning they are kept to fill the two new columns.

The first pattern is ([a-zA-Z]{4})

it is brackets because we want to keep it and put it in cell_type
[a-zA-Z] means any lower (a-z) or upper case letter (A-Z).
The square brackets means any of the characters in the square brackets will be matched
{4} means 4 of them.

So the first pattern inside the first (…) will match exactly 4 upper or lower case letters (like Prog or HSPC)

The second pattern is _ to match the underscore in every cell id that separates the cell type from the number. It is not in brackets because we do not want to keep it.

The third pattern is ([0-9]{3})

[0-9] means any number
{3} means 3 of them.

So the second pattern inside the second (…) will match exactly 3 numbers (like 001 or 851).

Important: Prog and HPSC have 4 letters. The column names, LT.HSPC_ have 6 characters and includes a dot. You will need to adjust the regex when make comparison between LT-HSC and other cell types. The pattern to match the LT.HSC as well as the Prog and HSPC is ([a-zA-Z.]{4,6}). Note the dot inside the square brackets and numbers meaning 4 or 6 of. The pattern to match the underscore and the cell number is the same.

The top of the dataframe should look like this (but with more decimal places)

	PC1	PC2	PC3	PC4	PC5	PC6	PC7	PC8	cell_id	cell_type	cell_number
HSPC_001	-2.62	-15.39	-7.94	-4.87	-2.52	4.29	-1.69	4.45	HSPC_001	HSPC	001
HSPC_002	0.70	-15.22	0.13	3.74	-6.47	-1.70	0.83	-3.35	HSPC_002	HSPC	002
HSPC_003	-15.52	7.80	0.53	-4.04	0.10	-1.48	-0.32	-7.40	HSPC_003	HSPC	003
HSPC_004	-4.11	-7.04	0.45	10.58	-3.09	-8.30	-6.01	2.27	HSPC_004	HSPC	004
HSPC_006	-7.89	5.15	-2.95	12.45	4.77	4.63	5.46	1.97	HSPC_006	HSPC	006
HSPC_008	-10.29	-8.06	-8.33	-2.69	1.86	-2.42	0.03	-2.01	HSPC_008	HSPC	008

The next task is to plot PC2 against PC1 and colour by cell type. This is just a scatterplot so we can use geom_point(). We will use colour to indicate the cell type.

🎬 Plot PC2 against PC1 and colour by cell type:

pca_labelled |> 
  ggplot(aes(x = PC1, y = PC2, 
             colour = cell_type)) +
  geom_point(alpha = 0.4) +
  theme_classic()

There is a good clustering of cell types but plenty of overlap. You can also try plotting PC3 or PC4 (or others).

I prefer to customise the colours. I especially like the viridis colour scales which provide colour scales that are perceptually uniform in both colour and black-and-white. They are also designed to be perceived by viewers with common forms of colour blindness. See Introduction to viridis for more information.

ggplot provides functions to access the viridis scales. Here I use scale_fill_viridis_d(). The d stands for discrete used because cell type is a discrete variable. The function scale_fill_viridis_c() would be used for continuous data. I’ve used the default “viridis” (or “D”) option (do ?scale_fill_viridis_d for all the options) and used the begin and end arguments to control the range of colour - I have set the range to be from 0.15 to 0.95 the avoid the strongest contrast. I have also set the name argument to NULL because that the legend refers to cell types is obvious.

🎬 Plot PC2 against PC1 and colour by cell type:

pca_labelled |> 
  ggplot(aes(x = PC1, y = PC2, 
             colour = cell_type)) +
  geom_point(alpha = 0.4) +
  scale_colour_viridis_d(end = 0.95, begin = 0.15,
                         name = NULL) +
  theme_classic()

Now go to Volcano plots for 🐭 Stem cells

Visualise all the results with a volcano plot

A volcano plot is a scatterplot that shows statistical significance (p-value) versus magnitude of change (fold change). As the independent study explained, we plot -log₁₀(adjusted p-value) on the y-axis so that the most significant genes are uppermost.

🐸 Frog development

We will add a column to the results dataframe that contains the -log₁₀(padj). You could perform this transformation within the plot command without adding a column to the data if you prefer.

🎬 Add a column to the results dataframe that contains the -log₁₀(padj):

s30_results <- s30_results |> 
  mutate(log10_padj = -log10(padj))

🎬 Create a volcano plot of the results:

s30_results |> 
  ggplot(aes(x = log2FoldChange, 
             y = log10_padj)) +
  geom_point() +
  geom_hline(yintercept = -log10(0.05), 
             linetype = "dashed") +
  geom_vline(xintercept = 2, 
             linetype = "dashed") +
  geom_vline(xintercept = -2, 
             linetype = "dashed") +
  scale_x_continuous(expand = c(0, 0)) +
  scale_y_continuous(expand = c(0, 0)) +
  theme_classic() +
  theme(legend.position = "none")

Our dashed lines are at -log₁₀(0.05) and log₂(2) and log₂(-2) to make more clear which genes (points) are significantly different between the control and the FGF-treated samples and have a fold change of at least 2.

In most cases, people colour the points to show that the quadrants. I like to add columns to the dataframe to indicate if the gene is significant and if the fold change is large and use those variables in the plot.

🎬 Add columns to the results dataframe to indicate if the gene is significant and if the fold change is large:

s30_results <- s30_results |> 
  mutate(sig = padj <= 0.05,
         bigfc = abs(log2FoldChange) >= 2)

The use of abs() (absolute) means genes with a fold change of at least 2 in either direction will be considered to have a large fold change.

Now we can colour the points by these new columns. I use interaction() to create four categories:

not significant and not large fold change (FF)
significant and not large fold change (TF)
not significant and large fold (FT)
significant and large fold change (TT)

And I use scale_colour_manual() to set the colours for these categories.

🎬 Create a volcano plot of the results with the points coloured by significance and fold change:

s30_results |> 
  ggplot(aes(x = log2FoldChange, 
             y = log10_padj, 
             colour = interaction(sig, bigfc))) +
  geom_point() +
  geom_hline(yintercept = -log10(0.05), 
             linetype = "dashed") +
  geom_vline(xintercept = 2, 
             linetype = "dashed") +
  geom_vline(xintercept = -2, 
             linetype = "dashed") +
  scale_x_continuous(expand = c(0, 0)) +
  scale_y_continuous(expand = c(0, 0)) +
  scale_colour_manual(values = c("gray", 
                                 "pink",
                                 "gray30",
                                 "deeppink")) +
  theme_classic() +
  theme(legend.position = "none")

For exploring the data, I like add labels to all the significant genes with a large fold change so I can very quickly identity them. The ggrepel package has a function geom_text_repel() that is useful for adding labels so that they don’t overlap.

🎬 Load the package:

library(ggrepel)

🎬 Add labels to the significant genes with a large fold change:

s30_results |> 
  ggplot(aes(x = log2FoldChange, 
             y = log10_padj, 
             colour = interaction(sig, bigfc))) +
  geom_point() +
  geom_hline(yintercept = -log10(0.05), 
             linetype = "dashed") +
  geom_vline(xintercept = 2, 
             linetype = "dashed") +
  geom_vline(xintercept = -2, 
             linetype = "dashed") +
  scale_x_continuous(expand = c(0, 0)) +
  scale_y_continuous(expand = c(0, 0)) +
  scale_colour_manual(values = c("gray", 
                                 "pink",
                                 "gray30",
                                 "deeppink")) +
  geom_text_repel(data = s30_results |> 
                    filter(bigfc == TRUE, sig == TRUE),
                  aes(label = xenbase_gene_symbol),
                  size = 3,
                  max.overlaps = 50) +
  theme_classic() +
  theme(legend.position = "none")

Notice that I have used filter() label only the genes that are both significant and have a large fold change. In systems you are familiar with, this labelling is very informative and can help you quickly identify common themes. Key to interpreting the volcano plot is to remember that positive fold changes means the gene is up-regulated in the FGF-treated samples and negative fold changes means the gene is down-regulated (i.e., higher in the control). This was determined by the order of the treatments in the contrast used in the DESeq2 analysis

If you do forget which way round you did the comparison, you can always examine the results dataframe to see which of the treatments seem to be higher for the positive fold changes.

Please note that Betsy doesn’t like graphs like this in the report!

When you have a gene of interest, you may wish to label it, and only it, on the plot.
This is done in the same way except that you filter the data to only include the gene of interest. I have used and then use geom_label_repel() rather than geom_text_repel() to put the label in a box and nudged it’s position to get a line connecting the point and the label. I have also increased the size of the point.

🎬 Add a label to one gene of interest (hoxb9.S) and :

s30_results |> 
  ggplot(aes(x = log2FoldChange, 
             y = log10_padj, 
             colour = interaction(sig, bigfc))) +
  geom_point() +
  geom_hline(yintercept = -log10(0.05), 
             linetype = "dashed") +
  geom_vline(xintercept = 2, 
             linetype = "dashed") +
  geom_vline(xintercept = -2, 
             linetype = "dashed") +
  scale_x_continuous(expand = c(0, 0)) +
  scale_y_continuous(expand = c(0, 0)) +
  scale_colour_manual(values = c("gray", 
                                 "pink",
                                 "gray30",
                                 "deeppink")) +
  geom_label_repel(data = s30_results |> 
                    filter(xenbase_gene_symbol == "hoxb9.S"),
                  aes(label = xenbase_gene_symbol),
                  size = 4,
                  nudge_x = .5,
                  nudge_y = 1.5) +
  geom_point(data = s30_results |> 
                    filter(xenbase_gene_symbol == "hoxb9.S"),
                size = 3) +
  theme_classic() +
  theme(legend.position = "none")

Should you want to label more than one gene, you will need to use (for example): filter(xenbase_gene_symbol %in% c("hoxb9.S", "fzd7.S"))

Now go to Save your plots

🎄 Arabidopsis

We will add a column to the results dataframe that contains the -log₁₀(padj). You could perform this transformation within the plot command without adding a column to the data if you prefer.

🎬 Add a column to the results dataframe that contains the -log₁₀(padj):

wild_results <- wild_results |> 
  mutate(log10_padj = -log10(padj))

🎬 Create a volcano plot of the results:

wild_results |> 
  ggplot(aes(x = log2FoldChange, 
             y = log10_padj)) +
  geom_point() +
  geom_hline(yintercept = -log10(0.05), 
             linetype = "dashed") +
  geom_vline(xintercept = 2, 
             linetype = "dashed") +
  geom_vline(xintercept = -2, 
             linetype = "dashed") +
  scale_x_continuous(expand = c(0, 0)) +
  scale_y_continuous(expand = c(0, 0)) +
  theme_classic() +
  theme(legend.position = "none")

Our dashed lines are at -log₁₀(0.05) and log₂(2) and log₂(-2) to make more clear which genes (points) are significantly different between the copper sufficient and deficient conditions samples and have a fold change of at least 2.

🎬 Add columns to the results dataframe to indicate if the gene is significant and if the fold change is large:

wild_results <- wild_results |> 
  mutate(sig = padj <= 0.05,
         bigfc = abs(log2FoldChange) >= 2)

The use of abs() (absolute) means genes with a fold change of at least 2 in either direction will be considered to have a large fold change.

Now we can colour the points by these new columns. I use interaction() to create four categories:

not significant and not large fold change (FF)
significant and not large fold change (TF)
not significant and large fold (FT)
significant and large fold change (TT)

And I use scale_colour_manual() to set the colours for these categories.

NOTE: there are no “significant and not large fold change (TF)” in this case.This means we do not need four colours. I have put the “pink” colour in the code but commented it out.

🎬 Create a volcano plot of the results with the points coloured by significance and fold change:

wild_results |> 
  ggplot(aes(x = log2FoldChange, 
             y = log10_padj, 
             colour = interaction(sig, bigfc))) +
  geom_point() +
  geom_hline(yintercept = -log10(0.05), 
             linetype = "dashed") +
  geom_vline(xintercept = 2, 
             linetype = "dashed") +
  geom_vline(xintercept = -2, 
             linetype = "dashed") +
  scale_x_continuous(expand = c(0, 0)) +
  scale_y_continuous(expand = c(0, 0)) +
  scale_colour_manual(values = c("gray", 
                                # "pink",
                                 "gray30",
                                 "deeppink")) +
  theme_classic() +
  theme(legend.position = "none")

🎬 Load the package:

library(ggrepel)

🎬 Add labels to the significant genes with a large fold change:

wild_results |> 
  ggplot(aes(x = log2FoldChange, 
             y = log10_padj, 
             colour = interaction(sig, bigfc))) +
  geom_point() +
  geom_hline(yintercept = -log10(0.05), 
             linetype = "dashed") +
  geom_vline(xintercept = 2, 
             linetype = "dashed") +
  geom_vline(xintercept = -2, 
             linetype = "dashed") +
  scale_x_continuous(expand = c(0, 0)) +
  scale_y_continuous(expand = c(0, 0)) +
  scale_colour_manual(values = c("gray", 
                               #  "pink",
                                 "gray30",
                                 "deeppink")) +
  geom_text_repel(data = wild_results |> 
                    filter(bigfc == TRUE, sig == TRUE),
                  aes(label = external_gene_name),
                  size = 3,
                  max.overlaps = 50) +
  theme_classic() +
  theme(legend.position = "none")

Notice that I have used filter() label only the genes that are both significant and have a large fold change. In systems you are familiar with, this labelling is very informative and can help you quickly identify common themes. Key to interpreting the volcano plot is to remember that positive fold changes means the gene is up-regulated in the sufficient conditions and negative fold changes means the gene is down-regulated (i.e., higher in the deficient). This was determined by the order of the treatments in the contrast used in the DESeq2 analysis

If you do forget which way round you did the comparison, you can always examine the results dataframe to see which of the treatments seem to be higher for the positive fold changes.

When you have a gene of interest, you may wish to label it, and only it, on the plot. This is done in the same way except that you filter the data to only include the gene of interest. I have used and then use geom_label_repel() rather than geom_text_repel() to put the label in a box and nudged it’s position to get a line connecting the point and the label. I have also increased the size of the point.

🎬 Add a label to one gene of interest (hoxb9.S) and :

wild_results |> 
  ggplot(aes(x = log2FoldChange, 
             y = log10_padj, 
             colour = interaction(sig, bigfc))) +
  geom_point() +
  geom_hline(yintercept = -log10(0.05), 
             linetype = "dashed") +
  geom_vline(xintercept = 2, 
             linetype = "dashed") +
  geom_vline(xintercept = -2, 
             linetype = "dashed") +
  scale_x_continuous(expand = c(0, 0)) +
  scale_y_continuous(expand = c(0, 0)) +
  scale_colour_manual(values = c("gray", 
                               #  "pink",
                                 "gray30",
                                 "deeppink")) +
  geom_label_repel(data = wild_results |> 
                    filter(external_gene_name == "FRO4"),
                  aes(label = external_gene_name),
                  size = 4,
                  nudge_x = .5,
                  nudge_y = 1.5) +
  geom_point(data = wild_results |> 
                    filter(external_gene_name == "FRO4"),
                size = 3) +
  theme_classic() +
  theme(legend.position = "none")

Should you want to label more than one gene, you will need to use (for example): filter(external_gene_name %in% c("FRO4", "FRO5"))

Now go to Save your plots

💉 Leishmania

We will add a column to the results dataframe that contains the -log₁₀(padj). You could perform this transformation within the plot command without adding a column to the data if you prefer.

🎬 Add a column to the results dataframe that contains the -log₁₀(padj):

pro_meta_results <- pro_meta_results |> 
  mutate(log10_padj = -log10(padj))

🎬 Create a volcano plot of the results:

pro_meta_results |> 
  ggplot(aes(x = log2FoldChange, 
             y = log10_padj)) +
  geom_point() +
  geom_hline(yintercept = -log10(0.05), 
             linetype = "dashed") +
  geom_vline(xintercept = 2, 
             linetype = "dashed") +
  geom_vline(xintercept = -2, 
             linetype = "dashed") +
  scale_x_continuous(expand = c(0, 0)) +
  scale_y_continuous(expand = c(0, 0)) +
  theme_classic() +
  theme(legend.position = "none")

Our dashed lines are at -log₁₀(0.05) and log₂(2) and log₂(-2) to make more clear which genes (points) are significantly different between the life stages and have a fold change of at least 2. Whilst we have very many genes that are significant, we have fewer that are significant and have a large fold change.

🎬 Add columns to the results dataframe to indicate if the gene is significant and if the fold change is large:

pro_meta_results <- pro_meta_results |> 
  mutate(sig = padj <= 0.05,
         bigfc = abs(log2FoldChange) >= 2)

The use of abs() (absolute) means genes with a fold change of at least 2 in either direction will be considered to have a large fold change.

Now we can colour the points by these new columns. I use interaction() to create four categories:

not significant and not large fold change (FF)
significant and not large fold change (TF)
not significant and large fold (FT)
significant and large fold change (TT)

And I use scale_colour_manual() to set the colours for these categories.

🎬 Create a volcano plot of the results with the points coloured by significance and fold change:

pro_meta_results |> 
  ggplot(aes(x = log2FoldChange, 
             y = log10_padj, 
             colour = interaction(sig, bigfc))) +
  geom_point() +
  geom_hline(yintercept = -log10(0.05), 
             linetype = "dashed") +
  geom_vline(xintercept = 2, 
             linetype = "dashed") +
  geom_vline(xintercept = -2, 
             linetype = "dashed") +
  scale_x_continuous(expand = c(0, 0)) +
  scale_y_continuous(expand = c(0, 0)) +
  scale_colour_manual(values = c("gray", 
                                 "pink",
                                 "gray30",
                                 "deeppink")) +
  theme_classic() +
  theme(legend.position = "none")

🎬 Load the package:

library(ggrepel)

🎬 Add labels to the significant genes with a large fold change:

pro_meta_results |> 
  ggplot(aes(x = log2FoldChange, 
             y = log10_padj, 
             colour = interaction(sig, bigfc))) +
  geom_point() +
  geom_hline(yintercept = -log10(0.05), 
             linetype = "dashed") +
  geom_vline(xintercept = 2, 
             linetype = "dashed") +
  geom_vline(xintercept = -2, 
             linetype = "dashed") +
  scale_x_continuous(expand = c(0, 0)) +
  scale_y_continuous(expand = c(0, 0)) +
  scale_colour_manual(values = c("gray", 
                                 "pink",
                                 "gray30",
                                 "deeppink")) +
  geom_text_repel(data = pro_meta_results |> 
                    filter(bigfc == TRUE, sig == TRUE),
                  aes(label = description),
                  size = 3,
                  max.overlaps = 50) +
  theme_classic() +
  theme(legend.position = "none")

Notice that I have used filter() label only the genes that are both significant and have a large fold change. In systems you are familiar with, this labelling is very informative and can help you quickly identify common themes. However, in this case, we do not have good annotation for the genes. We can label only with the gene id or the description. Many of the descriptions are

🎬 Add labels to the significant genes with a large fold change only where description doesn’t contain “hypothetical” or “unspecified :

pro_meta_results |> 
  ggplot(aes(x = log2FoldChange, 
             y = log10_padj, 
             colour = interaction(sig, bigfc))) +
  geom_point() +
  geom_hline(yintercept = -log10(0.05), 
             linetype = "dashed") +
  geom_vline(xintercept = 2, 
             linetype = "dashed") +
  geom_vline(xintercept = -2, 
             linetype = "dashed") +
  scale_x_continuous(expand = c(0, 0)) +
  scale_y_continuous(expand = c(0, 0)) +
  scale_colour_manual(values = c("gray", 
                                 "pink",
                                 "gray30",
                                 "deeppink")) +
  geom_text_repel(data = pro_meta_results |> 
                    filter(bigfc == TRUE, sig == TRUE,
                           !str_detect(description, "hypothetical|unspecified")),
                  aes(label = description),
                  size = 3,
                  max.overlaps = 50) +
  theme_classic() +
  theme(legend.position = "none")

Key to interpreting the volcano plot is to remember that positive fold changes means the gene is up-regulated in the procyclic stage and negative fold changes means the gene is down-regulated (i.e., higher in the metacyclic). This was determined by the order of the treatments in the contrast used in the DESeq2 analysis

If you do forget which way round you did the comparison, you can always examine the results dataframe to see which of the treatments seem to be higher for the positive fold changes.

When you have a gene of interest, you may wish to label it, and only it, on the plot. This is done in the same way except that you filter the data to only include the gene of interest. I have used and then use geom_label_repel() rather than geom_text_repel() to put the label in a box and nudged it’s position to get a line connecting the point and the label. I have also increased the size of the point.

🎬 Add a label to one gene of interest (elongation factor 1-alpha) and :

pro_meta_results |> 
  ggplot(aes(x = log2FoldChange, 
             y = log10_padj, 
             colour = interaction(sig, bigfc))) +
  geom_point() +
  geom_hline(yintercept = -log10(0.05), 
             linetype = "dashed") +
  geom_vline(xintercept = 2, 
             linetype = "dashed") +
  geom_vline(xintercept = -2, 
             linetype = "dashed") +
  scale_x_continuous(expand = c(0, 0)) +
  scale_y_continuous(expand = c(0, 0)) +
  scale_colour_manual(values = c("gray", 
                                 "pink",
                                 "gray30",
                                 "deeppink")) +
  geom_label_repel(data = pro_meta_results |> 
                    filter(description == "elongation factor 1-alpha"),
                  aes(label = description),
                  size = 4,
                  nudge_x = .5,
                  nudge_y = 1.5) +
  geom_point(data = pro_meta_results |> 
                    filter(description == "elongation factor 1-alpha"),
                size = 3) +
  theme_classic() +
  theme(legend.position = "none")

Should you want to label more than one gene, you will need to use (for example): filter(description %in% c("elongation factor 1-alpha", "ADP/ATP translocase 1 putative"))

Now go to Save your plots

🐭 Stem cells

We will add a column to the results dataframe that contains the -log₁₀(FDR). You could perform this transformation within the plot command without adding a column to the data if you prefer.

🎬 Add a column to the results dataframe that contains the -log₁₀(FDR):

hspc_prog_results <- hspc_prog_results |> 
  mutate(log10_FDR = -log10(FDR))

🎬 Create a volcano plot of the results:

hspc_prog_results |> 
  ggplot(aes(x = summary.logFC, 
             y = log10_FDR)) +
  geom_point() +
  geom_hline(yintercept = -log10(0.05), 
             linetype = "dashed") +
  geom_vline(xintercept = 2, 
             linetype = "dashed") +
  geom_vline(xintercept = -2, 
             linetype = "dashed") +
  scale_x_continuous(expand = c(0, 0)) +
  scale_y_continuous(expand = c(0, 0)) +
  theme_classic() +
  theme(legend.position = "none")

Our dashed lines are at -log₁₀(0.05) and log₂(2) and log₂(-2) to make more clear which genes (points) are significantly different between the cell types and have a fold change of at least 2.

🎬 Add columns to the results dataframe to indicate if the gene is significant and if the fold change is large:

hspc_prog_results <- hspc_prog_results |> 
  mutate(sig = FDR <= 0.05,
         bigfc = abs(summary.logFC) >= 2)

The use of abs() (absolute) means genes with a fold change of at least 2 in either direction will be considered to have a large fold change.

Now we can colour the points by these new columns. I use interaction() to create four categories:

not significant and not large fold change (FF)
significant and not large fold change (TF)
not significant and large fold (FT)
significant and large fold change (TT)

And I use scale_colour_manual() to set the colours for these categories.

NOTE: there are no “not significant and large fold change (FT)” in this case. This means we do not need four colours. I have put the “gray30” colour in the code but commented it out.

🎬 Create a volcano plot of the results with the points coloured by significance and fold change:

hspc_prog_results |> 
  ggplot(aes(x = summary.logFC, 
             y = log10_FDR, 
             colour = interaction(sig, bigfc))) +
  geom_point() +
  geom_hline(yintercept = -log10(0.05), 
             linetype = "dashed") +
  geom_vline(xintercept = 2, 
             linetype = "dashed") +
  geom_vline(xintercept = -2, 
             linetype = "dashed") +
  scale_x_continuous(expand = c(0, 0)) +
  scale_y_continuous(expand = c(0, 0)) +
  scale_colour_manual(values = c("gray", 
                                 "pink",
                                # "gray30",
                                 "deeppink")) +
  theme_classic() +
  theme(legend.position = "none")

🎬 Load the package:

library(ggrepel)

🎬 Add labels to the significant genes with a large fold change:

hspc_prog_results |> 
  ggplot(aes(x = summary.logFC, 
             y = log10_FDR, 
             colour = interaction(sig, bigfc))) +
  geom_point() +
  geom_hline(yintercept = -log10(0.05), 
             linetype = "dashed") +
  geom_vline(xintercept = 2, 
             linetype = "dashed") +
  geom_vline(xintercept = -2, 
             linetype = "dashed") +
  scale_x_continuous(expand = c(0, 0)) +
  scale_y_continuous(expand = c(0, 0)) +
  scale_colour_manual(values = c("gray", 
                                 "pink",
                              #   "gray30",
                                 "deeppink")) +
  geom_text_repel(data = hspc_prog_results |> 
                    filter(bigfc == TRUE, sig == TRUE),
                  aes(label = external_gene_name),
                  size = 3,
                  max.overlaps = 50) +
  theme_classic() +
  theme(legend.position = "none")

Notice that I have used filter() label only the genes that are both significant and have a large fold change. In systems you are familiar with, this labelling is very informative and can help you quickly identify common themes. Key to interpreting the volcano plot is to remember that positive fold changes means the gene is up-regulated in the Prog and negative fold changes means the gene is down-regulated (i.e., higher HSPC). This was determined by us choosing the results_hspc_prog$prog dataframe from the list object

If you do forget which way round you did the comparison, you can always examine the gene summary dataframe to see which of the treatments seem to be higher for the positive fold changes.

When you have a gene of interest, you may wish to label it on the plot. This is done in the same way except that you filter the data to only include the gene of interest. I have used and then use geom_label_repel() rather than geom_text_repel() to put the label in a box and nudged it’s position to get a line connecting the point and the label. I have also increased the size of the point.

🎬 Add a label to one gene of interest (Procr) and :

hspc_prog_results |> 
  ggplot(aes(x = summary.logFC, 
             y = log10_FDR, 
             colour = interaction(sig, bigfc))) +
  geom_point() +
  geom_hline(yintercept = -log10(0.05), 
             linetype = "dashed") +
  geom_vline(xintercept = 2, 
             linetype = "dashed") +
  geom_vline(xintercept = -2, 
             linetype = "dashed") +
  scale_x_continuous(expand = c(0, 0)) +
  scale_y_continuous(expand = c(0, 0)) +
  scale_colour_manual(values = c("gray", 
                                 "pink",
                                # "gray30",
                                 "deeppink")) +
  geom_label_repel(data = hspc_prog_results |> 
                    filter(external_gene_name == "Procr"),
                  aes(label = external_gene_name),
                  size = 4,
                  nudge_x = .5,
                  nudge_y = 1.5) +
  geom_point(data = hspc_prog_results |> 
                    filter(external_gene_name == "Procr"),
                size = 3) +
  theme_classic() +
  theme(legend.position = "none")

Should you want to label more than one gene, you will need to use (for example): filter(external_gene_name %in% c("Procr", "Emb"))

Now go to Save your plots

Save your plots

Once you have finished designing your plots, you should save them using ggsave(). You will want to assign the plot to a variable and use code similar to this:

ggsave("figures/my-informative-file-name.png",
       plot = vol,
       height = 4.5, 
       width = 4.5,
       units = "in",
       device = "png")

I recommend saving your figure in the approximate size it will appear in your report, that is, don’t resize it in word/google docs. This because the font will then be an appropriate size for the report. For journals, we often have to creat figure files of very high resolution. You can do this by increasing the height and width and then resizing the figure in the document. however, this requires resizing the all font and lines in the figure. I also recommend saving it as a png file as this is a lossless format.

🤗 Look after future you!

🎬 Go through your script and tidy up. I would suggest restarting R and trying to run the full pipeline from start to finish. You might need to:

collect together library statements at the top of the script
remove code that you needed to start today but which wouldn’t be needed running the script from the top (e.g., importing the results)
edit your comments for clarity
rename variables for consistency or clarity
remove house keeping or exploratory code
restyle code, add code section headers etc

🥳 Finished

Well Done!

Independent study following the workshop

Consolidate

The Code file

These contain all the code needed in the workshop even where it is not visible on the webpage.

The workshop.qmd file is the file I use to compile the practical. Qmd stands for Quarto markdown. It allows code and ordinary text to be interleaved to produce well-formatted reports including webpages. Right-click on the link and choose Save-As to download. You will be able to open the Qmd file in RStudio. Alternatively, View in Browser. Coding and thinking answers are marked with #---CODING ANSWER--- and #---THINKING ANSWER---

Pages made with R (R Core Team 2024), Quarto (Allaire et al. 2024), knitr (Xie 2024, 2015, 2014), kableExtra (Zhu 2021)

References

Allaire, J. J., Charles Teague, Carlos Scheidegger, Yihui Xie, and Christophe Dervieux. 2024. “Quarto.” https://doi.org/10.5281/zenodo.5960048.

R Core Team. 2024. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.

Wickham, Hadley. 2023. Conflicted: An Alternative Conflict Resolution Strategy. https://conflicted.r-lib.org/.

Wickham, Hadley, Mara Averick, Jennifer Bryan, Winston Chang, Lucy D’Agostino McGowan, Romain François, Garrett Grolemund, et al. 2019. “Welcome to the tidyverse.” Journal of Open Source Software 4 (43): 1686. https://doi.org/10.21105/joss.01686.

Xie, Yihui. 2014. “Knitr: A Comprehensive Tool for Reproducible Research in R.” In Implementing Reproducible Computational Research, edited by Victoria Stodden, Friedrich Leisch, and Roger D. Peng. Chapman; Hall/CRC.

———. 2015. Dynamic Documents with R and Knitr. 2nd ed. Boca Raton, Florida: Chapman; Hall/CRC. https://yihui.org/knitr/.

———. 2024. Knitr: A General-Purpose Package for Dynamic Report Generation in r. https://yihui.org/knitr/.

Zhu, Hao. 2021. “kableExtra: Construct Complex Table with ’Kable’ and Pipe Syntax.” https://CRAN.R-project.org/package=kableExtra.

Introduction

Session overview

Set up

🐸 Frog development

🎄 Arabidopsis

💉 Leishmania

🐭 Stem cells

Everyone

Import

Everyone

log2 transform the normalised counts

🐸 Frog development, 🎄 Arabidopsis and 💉 Leishmania

🐭 Stem cells

Everyone

Write the significant genes to file

Everyone

Principal Component Analysis (PCA)

🐸 Frog development

🎄 Arabidopsis

💉 Leishmania

🐭 Stem cells

Visualise all the results with a volcano plot

🐸 Frog development

🎄 Arabidopsis

💉 Leishmania

🐭 Stem cells

Save your plots

🤗 Look after future you!

🥳 Finished

Independent study following the workshop

The Code file

References

log₂ transform the normalised counts