Workshop
Transcriptomics 3: Visualising
Introduction
Session overview
In the workshop, you will learn how to conduct and plot a Principle Component Analysis (PCA) as well as how to create a nicely formatted Volcano plot. You will also save significant genes to file to make it easier to identify genes of interest. and perform Gene Ontology (GO) term enrichment analysis.
Set up
πΈ Frog development
π¬ Open the frogs-88H
RStudio Project and the cont-fgf-s30.R
script.
π Arabidopsis
π¬ Open the arabi-88H
RStudio Project and the wildsuf-wilddef-s30.R
script.
π Leishmania
π¬ Open the leish-88H
RStudio Project and the pro-meta-s30.R
script.
π Stem cells
π¬ Open the mice-88H
RStudio Project and the hspc-prog.R
script.
Everyone
π¬ Make a new folder figures
in the project directory.
This is where we will save our figure files
π¬ Load tidyverse
(Wickham et al. 2019) and conflicted
(Wickham 2023). You most likely have this code at the top of your script already.
ββ Attaching core tidyverse packages βββββββββββββββββββββββββββββββββββββββββββββββ tidyverse 2.0.0 ββ
β dplyr 1.1.3 β readr 2.1.4
β forcats 1.0.0 β stringr 1.5.0
β ggplot2 3.4.3 β tibble 3.2.1
β lubridate 1.9.3 β tidyr 1.3.0
β purrr 1.0.2
ββ Conflicts βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ tidyverse_conflicts() ββ
β dplyr::filter() masks stats::filter()
β dplyr::lag() masks stats::lag()
βΉ Use the conflicted package to force all conflicts to become errors
I recommend you set the dplyr
versions of filter()
and select()
to use by default
π¬ Use the dplyr
version of filter()
by default:
conflicts_prefer(dplyr::filter)
conflicts_prefer(dplyr::select)
Import
Everyone
π¬ Import your results data. This should be a file in the results
folder called xxxx_results.csv
where xxxx indicates the comparison you made.
π¬ Remind yourself what is in the rows and columns and the structure of the dataframes (perhaps using glimpse()
)
When we do PCA we will want to label the samples with their treatment for figures. For πΈ Frog development, π Arabidopsis and π Leishmania, this labelling information is most easily added using the metadata. You will need to filter for only the samples in the comparison that was made in the results file.
You may need to refer back to the Week 4 Statistical Analysis workshop (in section 2. Create DESeqDataSet object) to remind yourself how to import and filter the metadata you need.
π¬ Import the metadata that maps the sample names to treatments. Remember to select only the samples for comparison that was made.
For π Stem cells, cell types are encoded in the column names. We will do some regular expression magic to extract the cell type from the column names.
log2 transform the normalised counts
We use the normalised counts for data visualisations so that the comparisons are meaningful. Since the fold changes are given is log2 it is useful to log2 transform the normalised counts too. We will add columns to the dataframe with these transformed values. Since we have some counts of 0 we will add a tiny amount to avoid -Inf
values.
log2 transformation would be applied to one column like this:
We are going to use a wonderful bit of R wizardry to apply a transformation to multiple columns. This is the across()
function which has three arguments:
across(.cols, .fns, .names)
where:
-
.cols
is the selection of columns to transform -
.fns
is the function we want to apply to the selected columns -
.names
is the naming convention for the new columns
The general form of the code you need is:
# DO NOT DO
# log2 transform the counts plus a tiny amount to avoid log(0)
xxxx_results <- xxxx_results |>
mutate(across(starts_with("pattern"),
\(x) log2(x + 0.001),
.names = "log2_{.col}"))
where:
-
xxxx_results
is the name of the dataframe of results -
pattern
matches the starting letters for all of the normalised counts so thatstarts_with("pattern")
gives the selection of columns to transform - the bit after the
\(x)
is the function we want to apply to the selected columns - the
\(x)
means it is an βanonymousβ function which means we donβt have to define a function name. -
"log2_{.col}"
means the columns will have the same name (in the.col
) but with the prefixlog2_
added.
You can read more about across()
and anonymous functions from my posit::conf(2024) workshop
πΈ Frog development, π Arabidopsis and π Leishmania
π¬ Design the code to log2 transform the normalised counts using the template given
I recommend viewing the dataframe to see the new columns. Check you have the expected number of columns.
π Stem cells
You do not need to apply this transformation because the data is already log2 transformed.
Everyone
We now all have dataframes with all the information we need: normalised counts, log2 normalised counts, statistical comparisons with fold changes and p-values, and information about the gene.
Write the significant genes to file
Everyone
We will create dataframe of the significant genes and write them to file. This is subset from the results file but will make it a little easier to examine and select genes of interest.
The general form of the code you need is:
# DO NOT DO
# create a dataframe of genes significant at 0.05 level
xxxx_results_sig0.05 <- xxxx_results |>
filter(pvalue <= 0.05)
Note that you determine the significance level using the adjusted p-values rather than the uncorrected p-values. This column is name padj
in DESeq2
output and FDR
in scran
output.
π¬ Create a dataframe of the genes significant at the 0.05 level using filter()
. You will need to know the name of column with the adjusted p-values.
βHow many genes are significant at the 0.05 levels?
If you have a very large number of genes significant at the 0.05 level, you may want to consider a more stringent cut-off such as 0.01.
π¬ Write the dataframe to a csv file. I recommend using the same file name as you used for the dataframe.
Principal Component Analysis (PCA)
We have many genes in our datasets. PCA will allow us to plot our samples in the βgene expressionβ space so we can see if replicates with the same treatment cluster together as we would expect. PCA is a dimension reduction technique that finds the directions of maximum variance in the data. We are doing PCA after our statistical analysis but often it is one of the first techniques used to explore the data. If we do not see treatment effects in a PCA plot then we are not likely to see them in the statistical analysis. PCA can also help you identify any outlier replicates. The PCA is best conducted on the normalised counts or the log2 transformed normalised counts. We will use the log2 transformed normalised counts.
We will carry out the following steps to do the PCA:
- Select the log2 transformed normalised counts. You can only use numerical data in PCA.
- Transpose our data. We have genes in rows and samples in columns (this is common for gene expression data). However, to treatment the genes as variables, PCA expects samples in rows and genes in columns.
- Add the gene names as column names in the transposed data
- Perform the PCA
- Extract the scores on the first two principal components and label the data
- Plot the the first two principal components as a scatter plot
In each case we will use only the samples for the comparison that was made in the PCA. However, it would be informative to do PCA on all the samples you have and that is something you may want to consider in your independent study.
Now go to PCA for:
πΈ Frog development
π¬ Transpose the log2 transformed normalised counts:
s30_log2_trans <- s30_results |>
select(starts_with("log2_")) |>
t() |>
data.frame()
We have used the select()
function to select all the columns that start with log2_
. We then use the t()
function to transpose the dataframe. We then convert the resulting matrix to a dataframe using data.frame()
. If you view that dataframe youβll see it has default column name which we can fix using colnames()
to set the column names to the Xenbase gene ids.
π¬ Set the column names to the Xenbase gene ids:
colnames(s30_log2_trans) <- s30_results$xenbase_gene_id
π¬ Perform PCA on the log2 transformed normalised counts:
pca <- s30_log2_trans |>
prcomp(rank. = 4)
The rank.
argument tells prcomp()
to only calculate the first 4 principal components. This is useful for visualisation as we can only plot in 2 or 3 dimensions. We can see the results of the PCA by viewing the summary()
of the pca
object.
summary(pca)
Importance of first k=4 (out of 6) components:
PC1 PC2 PC3 PC4
Standard deviation 64.0124 47.3351 38.4706 31.4111
Proportion of Variance 0.4243 0.2320 0.1532 0.1022
Cumulative Proportion 0.4243 0.6562 0.8095 0.9116
The Proportion of Variance tells us how much of the variance is explained by each component. We can see that the first component explains 0.4243 of the variance, the second 0.2320, and the third 0.1532. Together the first three components explain nearly 81% of the total variance in the data. Plotting PC1 against PC2 will capture about 66% of the variance which is likely very much better than we would get plotting any two genes against each other. To plot the PC1 against PC2 we will need to extract the PC1 and PC2 βscoresβ from the PCA object and add labels for the samples. Those labels will come from the row names of the transformed data which has the sample ids and from the metadata.
π¬ Create a vector of the sample ids from the row names. These include the log2
prefix which we can removed for labelling:
sample_id <- row.names(s30_log2_trans) |> str_remove("log2_")
You might want to check the result.
Now we will extract the PC1 and PC2 scores from the PCA object and add. Our PCA object is called pca
and the scores are in pca$x. We will create a dataframe of the scores and add the sample ids.
π¬ Create a dataframe of PC1 and PC2 scores and add the sample ids:
pca_labelled <- data.frame(pca$x,
sample_id)
π¬ Merge with the metadata so we can label points by treatment and sibling pair:
pca_labelled <- pca_labelled |>
left_join(meta_s30,
by = "sample_id")
Since the metadata contained the sample ids, it was especially important to remove the log2_
from the row names so that the join would work.
The dataframe should look like this:
PC1 | PC2 | PC3 | PC4 | sample_id | stage | treatment | sibling_rep |
---|---|---|---|---|---|---|---|
-76.38391 | 0.814699 | -60.728327 | -5.820669 | S30_C_1 | stage_30 | control | one |
-67.02571 | 25.668563 | 51.476835 | 28.480254 | S30_C_2 | stage_30 | control | two |
-14.02772 | -78.474054 | 15.282058 | -9.213076 | S30_C_3 | stage_30 | control | three |
47.60726 | 49.035510 | -19.288753 | 20.928290 | S30_F_1 | stage_30 | FGF | one |
26.04954 | 32.914201 | 20.206072 | -55.752818 | S30_F_2 | stage_30 | FGF | two |
83.78054 | -29.958919 | -6.947884 | 21.378020 | S30_F_3 | stage_30 | FGF | three |
The next task is to plot PC2 against PC1 and colour by sibling pair. This is just a scatterplot so we can use geom_point()
. We will use colour to indicate the sibling pair and shape to indicate the treatment.
π¬ Customise the PC2 against PC1 plot:
pca_labelled |>
ggplot(aes(x = PC1, y = PC2,
colour = sibling_rep,
shape = treatment)) +
geom_point(size = 3) +
theme_classic()
There is a good separation between treatments on PCA1. The sibling pairs do not seem to cluster together. You can also try plotting PC3 or PC4.
I prefer to customise the colours and shapes. I especially like the
viridis colour scales which provide colour scales that are perceptually uniform in both colour and black-and-white. They are also designed to be perceived by viewers with common forms of colour blindness. See Introduction to viridis for more information.
ggplot
provides functions to access the viridis scales. Here I use scale_fill_viridis_d()
. The d stands for discrete. The function scale_fill_viridis_c()
would be used for continuous data. Iβve used the default βviridisβ (or βDβ) option (do ?scale_fill_viridis_d for all the options) and used the begin
and end
arguments to control the range of colour - I have set the range to be from 0.15 to 0.95 the avoid the strongest contrast. I have also set the name
argument to provide a label for the legend.
I have used scale_shape_manual()
to set the shapes for the treatments. I have used the values 21 and 19 which are the codes for filled and open circles and filled triangles. I have set the name
argument to NULL
to remove the label (itβs obvious what that categories are treatments) and the labels
argument to improve the legend.
π¬ Plot PC2 against PC1 and colour by sibling pair and shape by treatment:
pca_labelled |>
ggplot(aes(x = PC1, y = PC2,
colour = sibling_rep,
shape = treatment)) +
geom_point(size = 3) +
scale_colour_viridis_d(end = 0.95, begin = 0.15,
name = "Sibling pair") +
scale_shape_manual(values = c(21, 19),
name = NULL,
labels = c("Control", "FGF-Treated")) +
theme_classic()
Now go to Volcano plots for πΈ Frog development
π Arabidopsis
π¬ Transpose the log2 transformed normalised counts:
wild_log2_trans <- wild_results |>
select(starts_with("log2_")) |>
t() |>
data.frame()
We have used the select()
function to select all the columns that start with log2_
. We then use the t()
function to transpose the dataframe. We then convert the resulting matrix to a dataframe using data.frame()
. If you view that dataframe youβll see it has default column name which we can fix using colnames()
to set the column names to the gene ids.
π¬ Set the column names to the gene ids:
colnames(wild_log2_trans) <- wild_results$gene_id
π¬ Perform PCA on the log2 transformed normalised counts:
pca <- wild_log2_trans |>
prcomp(rank. = 4, scale = TRUE)
The scale
argument tells prcomp()
to scale the data before doing the PCA. This is important when the variables are on different scales to stop variables with large values dominating the PCA. The rank.
argument tells prcomp()
to only calculate the first 4 principal components. This is useful for visualisation as we can only plot in 2 or 3 dimensions. We can see the results of the PCA by viewing the summary()
of the pca
object.
summary(pca)
Importance of components:
PC1 PC2 PC3 PC4
Standard deviation 123.0171 85.6822 62.2924 2.185e-13
Proportion of Variance 0.5742 0.2786 0.1472 0.000e+00
Cumulative Proportion 0.5742 0.8528 1.0000 1.000e+00
The Proportion of Variance tells us how much of the variance is explained by each component. We can see that the first component explains 0.5742 of the variance, the second 0.2786, and the third 0.1472. Together the first three components explain nearly 100% of the total variance in the data. Plotting PC1 against PC2 will capture about 92% of the variance which is very likely very much better than we would get plotting any two genes against each other. To plot the PC1 against PC2 we will need to extract the PC1 and PC2 βscoresβ from the PCA object and add labels for the samples. Those labels will come from the row names of the transformed data which has the sample ids and from the metadata.
π¬ Create a vector of the sample ids from the row names. These include the log2
prefix which we can removed for labelling:
sample_id <- row.names(wild_log2_trans) |> str_remove("log2_")
You might want to check the result.
Now we will extract the PC1 and PC2 scores from the PCA object and add. Our PCA object is called pca
and the scores are in pca$x. We will create a dataframe of the scores and add the sample ids.
π¬ Create a dataframe of PC1 and PC2 scores and add the sample ids:
pca_labelled <- data.frame(pca$x,
sample_id)
π¬ Merge with the metadata so we can label points by treatment and sibling pair:
pca_labelled <- pca_labelled |>
left_join(meta_wild,
by = "sample_id")
Since the metadata contained the sample ids, it was especially important to remove the log2_
from the row names so that the join would work.
The dataframe should look like this:
PC1 | PC2 | PC3 | PC4 | sample_id | genotype | copper |
---|---|---|---|---|---|---|
-91.23038 | 105.92675 | 25.80774 | 0 | SRX028956_wild_suf | wt | sufficient |
-120.00085 | -95.79115 | -13.72618 | 0 | SRX028957_wild_def | wt | deficient |
92.06667 | 23.00841 | -79.23090 | 0 | SRX028960_wild_suf | wt | sufficient |
119.16456 | -33.14401 | 67.14935 | 0 | SRX028961_wild_def | wt | deficient |
The next task is to plot PC2 against PC1 and colour by copper conditions. This is just a scatterplot so we can use geom_point()
. We will use colour to indicate the copper conditions.
π¬ Plot PC2 against PC1 and colour by copper conditions:
pca_labelled |>
ggplot(aes(x = PC3, y = PC2,
colour = copper)) +
geom_point(size = 3) +
theme_classic()
We do not see particularly good separation between treatments, though perhaps there is some separation between the copper sufficient and copper deficient on PC2. It is also difficult to see if the replicates cluster together (the treatments separate) when there are only two reps. You can also try plotting PC3 or PC4.
I prefer to customise the colours and shapes. I especially like the
viridis colour scales which provide colour scales that are perceptually uniform in both colour and black-and-white. They are also designed to be perceived by viewers with common forms of colour blindness. See Introduction to viridis for more information.
ggplot
provides functions to access the viridis scales. Here I use scale_fill_viridis_d()
. The d stands for discrete. The function scale_fill_viridis_c()
would be used for continuous data. Iβve used the default βviridisβ (or βDβ) option (do ?scale_fill_viridis_d for all the options) and used the begin
and end
arguments to control the range of colour - I have set the range to be from 0.15 to 0.95 the avoid the strongest contrast. I have also set the name
argument to provide a label for the legend.
π¬ Customise the PC2 against PC1 plot:
pca_labelled |>
ggplot(aes(x = PC1, y = PC2,
colour = copper)) +
geom_point(size = 3) +
scale_colour_viridis_d(end = 0.95, begin = 0.15,
name = "Copper") +
theme_classic()
Now go to Volcano plots for π Arabidopsis
π Leishmania
π¬ Transpose the log2 transformed normalised counts:
pro_meta_log2_trans <- pro_meta_results |>
select(starts_with("log2_")) |>
t() |>
data.frame()
We have used the select()
function to select all the columns that start with log2_
. We then use the t()
function to transpose the dataframe. We then convert the resulting matrix to a dataframe using data.frame()
. If you view that dataframe youβll see it has default column name which we can fix using colnames()
to set the column names to the gene ids.
π¬ Set the column names to the gene ids:
colnames(pro_meta_log2_trans) <- pro_meta_results$gene_id
π¬ Perform PCA on the log2 transformed normalised counts:
pca <- pro_meta_log2_trans |>
prcomp(rank. = 4, scale = TRUE)
The scale
argument tells prcomp()
to scale the data before doing the PCA. This is important when the variables are on different scales to stop variables with large values dominating the PCA. The rank.
argument tells prcomp()
to only calculate the first 4 principal components. This is useful for visualisation as we can only plot in 2 or 3 dimensions. We can see the results of the PCA by viewing the summary()
of the pca
object.
summary(pca)
Importance of first k=4 (out of 6) components:
PC1 PC2 PC3 PC4
Standard deviation 66.0744 41.9741 29.4077 27.61377
Proportion of Variance 0.5175 0.2088 0.1025 0.09038
Cumulative Proportion 0.5175 0.7263 0.8288 0.91916
The Proportion of Variance tells us how much of the variance is explained by each component. We can see that the first component explains 0.5175 of the variance, the second 0.2088, and the third 0.1025. Together the first three components explain nearly 92% of the total variance in the data. Plotting PC1 against PC2 will capture about 66% of the variance which is likely very much better than we would get plotting any two genes against each other. To plot the PC1 against PC2 we will need to extract the PC1 and PC2 βscoresβ from the PCA object and add labels for the samples. Those labels will come from the row names of the transformed data which has the sample ids and from the metadata.
π¬ Create a vector of the sample ids from the row names. These include the log2
prefix which we can removed for labelling:
sample_id <- row.names(pro_meta_log2_trans) |> str_remove("log2_")
You might want to check the result.
Now we will extract the PC1 and PC2 scores from the PCA object and add. Our PCA object is called pca
and the scores are in pca$x. We will create a dataframe of the scores and add the sample ids.
π¬ Create a dataframe of PC1 and PC2 scores and add the sample ids:
pca_labelled <- data.frame(pca$x,
sample_id)
π¬ Merge with the metadata so we can label points by life cycle stage:
pca_labelled <- pca_labelled |>
left_join(meta_pro_meta,
by = "sample_id")
Since the metadata contained the sample ids, it was especially important to remove the log2_
from the row names so that the join would work.
The dataframe should look like this:
PC1 | PC2 | PC3 | PC4 | sample_id | stage | replicate |
---|---|---|---|---|---|---|
-59.42107 | -2.922229 | -11.043368 | -13.4641060 | lm_pro_1 | procyclic | 1 |
-63.25818 | -15.985856 | -36.248191 | 0.7025242 | lm_pro_2 | procyclic | 2 |
-57.92065 | 25.679645 | 47.373874 | 13.7893070 | lm_pro_3 | procyclic | 3 |
56.20326 | -59.707787 | 5.958727 | 31.5528028 | lm_meta_1 | metacyclic | 1 |
57.41205 | -11.738157 | 14.157242 | -47.2285609 | lm_meta_2 | metacyclic | 2 |
66.98459 | 64.674383 | -20.198284 | 14.6480329 | lm_meta_3 | metacyclic | 3 |
The next task is to plot PC2 against PC1 and colour by sibling pair. This is just a scatterplot so we can use geom_point()
. We will use colour to indicate the life cycle stage.
π¬ Plot PC2 against PC1 and colour by copper conditions:
pca_labelled |>
ggplot(aes(x = PC1, y = PC2,
colour = stage)) +
geom_point(size = 3) +
theme_classic()
There is a good separation between treatments on PCA1. The replicates do seem to cluster together. You can also try plotting PC3 or PC4.
I prefer to customise the colours. I especially like the
viridis colour scales which provide colour scales that are perceptually uniform in both colour and black-and-white. They are also designed to be perceived by viewers with common forms of colour blindness. See Introduction to viridis for more information.
ggplot
provides functions to access the viridis scales. Here I use scale_fill_viridis_d()
. The d stands for discrete. The function scale_fill_viridis_c()
would be used for continuous data. Iβve used the default βviridisβ (or βDβ) option (do ?scale_fill_viridis_d for all the options) and used the begin
and end
arguments to control the range of colour - I have set the range to be from 0.15 to 0.95 the avoid the strongest contrast. I have also set the name
argument to provide a label for the legend.
π¬ Plot PC2 against PC1 and colour by life stage:
pca_labelled |>
ggplot(aes(x = PC1, y = PC2,
colour = stage)) +
geom_point(size = 3) +
scale_colour_viridis_d(end = 0.95, begin = 0.15,
name = "Stage") +
theme_classic()
Now go to Volcano plots for π Leishmania
π Stem cells
π¬ Transpose the log2 transformed normalised counts:
hspc_prog_trans <- hspc_prog_results |>
dplyr::select(starts_with(c("HSPC_", "Prog_"))) |>
t() |>
data.frame()
We have used the select()
function to select all the columns that start with Prog_
or HSPC_
. We then use the t()
function to transpose the dataframe. We then convert the resulting matrix to a dataframe using data.frame()
. If you view that dataframe youβll see it has default column name which we can fix using colnames()
to set the column names to the gene ids.
π¬ Set the column names to the gene ids:
colnames(hspc_prog_trans) <- hspc_prog_results$ensembl_gene_id
π¬ Perform PCA on the log2 transformed normalised counts:
pca <- hspc_prog_trans |>
prcomp(rank. = 8)
The rank.
argument tells prcomp()
to only calculate the first 8 principal components. This is useful for visualisation as we can only plot in 2 or 3 dimensions. We can see the results of the PCA by viewing the summary()
of the pca
object.
summary(pca)
Importance of first k=8 (out of 280) components:
PC1 PC2 PC3 PC4 PC5 PC6 PC7
Standard deviation 12.5612 8.36646 5.98988 5.41386 4.55730 4.06142 3.84444
Proportion of Variance 0.1099 0.04874 0.02498 0.02041 0.01446 0.01149 0.01029
Cumulative Proportion 0.1099 0.15861 0.18359 0.20400 0.21846 0.22995 0.24024
PC8
Standard deviation 3.70848
Proportion of Variance 0.00958
Cumulative Proportion 0.24982
The Proportion of Variance tells us how much of the variance is explained by each component. We can see that the first component explains 0.1099 of the variance, the second 0.04874, and the third 0.02498. Together the first three components explain 18% of the total variance in the data. This is not that high but it is also quite a lot better than than we would get plotting any two genes randomly chosen against each other.
To plot the PC1 against PC2 we will need to extract the PC1 and PC2 βscoresβ from the PCA object and add labels for the cells. Our PCA object is called pca
and the scores are in pca$x. The cells labels will come from the row names of the transformed data.
π¬ Create a dataframe of the PC1 and PC2 scores (in pca$x
) and add the cell ids:
pca_labelled <- data.frame(pca$x,
cell_id = row.names(hspc_prog_trans))
It will be helpful to add a column for the cell type so we can label points. One way to do this is to extract the information in the cell_id column into two columns: one with the complete cell id and one with just the cell type. This extraction is done with Regular Expression magic.
π¬ Extract the cell type and cell number from the cell_id
column (keeping the cell_id
column):
What this code does is take what is in the cell_id
column (something like Prog_001
or HSPC_001
) and split it into two columns (βcell_typeβ and βcell_numberβ). The reason why we want to do that is to colour the points by cell type. We would not want to use cell_id
to colour the points because each cell id is unique and that would be 1000+ colours. The last argument in the extract()
function is the pattern to match described with a regular expression. Three patterns are being matched, and two of those are in brackets meaning they are kept to fill the two new columns.
The first pattern is ([a-zA-Z]{4})
- it is brackets because we want to keep it and put it in
cell_type
-
[a-zA-Z]
means any lower (a-z
) or upper case letter (A-Z
). - The square brackets means any of the characters in the square brackets will be matched
-
{4}
means 4 of them.
So the first pattern inside the first (β¦) will match exactly 4 upper or lower case letters (like Prog or HSPC)
The second pattern is _
to match the underscore in every cell id that separates the cell type from the number. It is not in brackets because we do not want to keep it.
The third pattern is ([0-9]{3})
-
[0-9]
means any number -
{3}
means 3 of them.
So the second pattern inside the second (β¦) will match exactly 3 numbers (like 001 or 851).
Important: Prog and HPSC have 4 letters. The column names, LT.HSPC_ have 6 characters and includes a dot. You will need to adjust the regex when make comparison between LT-HSC and other cell types. The pattern to match the LT.HSC
as well as the Prog
and HSPC
is ([a-zA-Z.]{4, 6})
. Note the dot inside the square brackets and numbers meaning 4 or 6 of. The pattern to match the underscore and the cell number is the same.
The top of the dataframe should look like this (but with more decimal places)
PC1 | PC2 | PC3 | PC4 | PC5 | PC6 | PC7 | PC8 | cell_id | cell_type | cell_number | |
---|---|---|---|---|---|---|---|---|---|---|---|
HSPC_001 | -2.62 | -15.39 | -7.94 | -4.87 | -2.52 | 4.29 | -1.69 | 4.45 | HSPC_001 | HSPC | 001 |
HSPC_002 | 0.70 | -15.22 | 0.13 | 3.74 | -6.47 | -1.70 | 0.83 | -3.35 | HSPC_002 | HSPC | 002 |
HSPC_003 | -15.52 | 7.80 | 0.53 | -4.04 | 0.10 | -1.48 | -0.32 | -7.40 | HSPC_003 | HSPC | 003 |
HSPC_004 | -4.11 | -7.04 | 0.45 | 10.58 | -3.09 | -8.30 | -6.01 | 2.27 | HSPC_004 | HSPC | 004 |
HSPC_006 | -7.89 | 5.15 | -2.95 | 12.45 | 4.77 | 4.63 | 5.46 | 1.97 | HSPC_006 | HSPC | 006 |
HSPC_008 | -10.29 | -8.06 | -8.33 | -2.69 | 1.86 | -2.42 | 0.03 | -2.01 | HSPC_008 | HSPC | 008 |
The next task is to plot PC2 against PC1 and colour by cell type. This is just a scatterplot so we can use geom_point()
. We will use colour to indicate the cell type.
π¬ Plot PC2 against PC1 and colour by cell type:
pca_labelled |>
ggplot(aes(x = PC1, y = PC2,
colour = cell_type)) +
geom_point(alpha = 0.4) +
theme_classic()
There is a good clustering of cell types but plenty of overlap. You can also try plotting PC3 or PC4 (or others).
I prefer to customise the colours. I especially like the viridis colour scales which provide colour scales that are perceptually uniform in both colour and black-and-white. They are also designed to be perceived by viewers with common forms of colour blindness. See Introduction to viridis for more information.
ggplot
provides functions to access the viridis scales. Here I use scale_fill_viridis_d()
. The d stands for discrete used because cell type is a discrete variable. The function scale_fill_viridis_c()
would be used for continuous data. Iβve used the default βviridisβ (or βDβ) option (do ?scale_fill_viridis_d for all the options) and used the begin
and end
arguments to control the range of colour - I have set the range to be from 0.15 to 0.95 the avoid the strongest contrast. I have also set the name
argument to NULL because that the legend refers to cell types is obvious.
π¬ Plot PC2 against PC1 and colour by cell type:
pca_labelled |>
ggplot(aes(x = PC1, y = PC2,
colour = cell_type)) +
geom_point(alpha = 0.4) +
scale_colour_viridis_d(end = 0.95, begin = 0.15,
name = NULL) +
theme_classic()
Now go to Volcano plots for π Stem cells
Visualise all the results with a volcano plot
A volcano plot is a scatterplot that shows statistical significance (p-value) versus magnitude of change (fold change). As the independent study explained, we plot -log10(adjusted p-value) on the y-axis so that the most significant genes are uppermost.
πΈ Frog development
We will add a column to the results dataframe that contains the -log10(padj). You could perform this transformation within the plot command without adding a column to the data if you prefer.
π¬ Add a column to the results dataframe that contains the -log10(padj):
π¬ Create a volcano plot of the results:
s30_results |>
ggplot(aes(x = log2FoldChange,
y = log10_padj)) +
geom_point() +
geom_hline(yintercept = -log10(0.05),
linetype = "dashed") +
geom_vline(xintercept = 2,
linetype = "dashed") +
geom_vline(xintercept = -2,
linetype = "dashed") +
scale_x_continuous(expand = c(0, 0)) +
scale_y_continuous(expand = c(0, 0)) +
theme_classic() +
theme(legend.position = "none")
Our dashed lines are at -log10(0.05) and log2(2) and log2(-2) to make more clear which genes (points) are significantly different between the control and the FGF-treated samples and have a fold change of at least 2.
In most cases, people colour the points to show that the quadrants. I like to add columns to the dataframe to indicate if the gene is significant and if the fold change is large and use those variables in the plot.
π¬ Add columns to the results dataframe to indicate if the gene is significant and if the fold change is large:
The use of abs()
(absolute) means genes with a fold change of at least 2 in either direction will be considered to have a large fold change.
Now we can colour the points by these new columns. I use interaction()
to create four categories:
- not significant and not large fold change (FF)
- significant and not large fold change (TF)
- not significant and large fold (FT)
- significant and large fold change (TT)
And I use scale_colour_manual()
to set the colours for these categories.
π¬ Create a volcano plot of the results with the points coloured by significance and fold change:
s30_results |>
ggplot(aes(x = log2FoldChange,
y = log10_padj,
colour = interaction(sig, bigfc))) +
geom_point() +
geom_hline(yintercept = -log10(0.05),
linetype = "dashed") +
geom_vline(xintercept = 2,
linetype = "dashed") +
geom_vline(xintercept = -2,
linetype = "dashed") +
scale_x_continuous(expand = c(0, 0)) +
scale_y_continuous(expand = c(0, 0)) +
scale_colour_manual(values = c("gray",
"pink",
"gray30",
"deeppink")) +
theme_classic() +
theme(legend.position = "none")
For exploring the data, I like add labels to all the significant genes with a large fold change so I can very quickly identity them. The ggrepel
package has a function geom_text_repel()
that is useful for adding labels so that they donβt overlap.
π¬ Load the package:
π¬ Add labels to the significant genes with a large fold change:
s30_results |>
ggplot(aes(x = log2FoldChange,
y = log10_padj,
colour = interaction(sig, bigfc))) +
geom_point() +
geom_hline(yintercept = -log10(0.05),
linetype = "dashed") +
geom_vline(xintercept = 2,
linetype = "dashed") +
geom_vline(xintercept = -2,
linetype = "dashed") +
scale_x_continuous(expand = c(0, 0)) +
scale_y_continuous(expand = c(0, 0)) +
scale_colour_manual(values = c("gray",
"pink",
"gray30",
"deeppink")) +
geom_text_repel(data = s30_results |>
filter(bigfc == TRUE, sig == TRUE),
aes(label = xenbase_gene_symbol),
size = 3,
max.overlaps = 50) +
theme_classic() +
theme(legend.position = "none")
Notice that I have used filter()
label only the genes that are both significant and have a large fold change. In systems you are familiar with, this labelling is very informative and can help you quickly identify common themes. Key to interpreting the volcano plot is to remember that positive fold changes means the gene is up-regulated in the FGF-treated samples and negative fold changes means the gene is down-regulated (i.e., higher in the control). This was determined by the order of the treatments in the contrast used in the DESeq2 analysis
If you do forget which way round you did the comparison, you can always examine the results dataframe to see which of the treatments seem to be higher for the positive fold changes.
Please note that Betsy doesnβt like graphs like this in the report!
When you have a gene of interest, you may wish to label it, and only it, on the plot.
This is done in the same way except that you filter the data to only include the gene of interest. I have used and then use geom_label_repel()
rather than geom_text_repel()
to put the label in a box and nudged itβs position to get a line connecting the point and the label. I have also increased the size of the point.
π¬ Add a label to one gene of interest (hoxb9.S) and :
s30_results |>
ggplot(aes(x = log2FoldChange,
y = log10_padj,
colour = interaction(sig, bigfc))) +
geom_point() +
geom_hline(yintercept = -log10(0.05),
linetype = "dashed") +
geom_vline(xintercept = 2,
linetype = "dashed") +
geom_vline(xintercept = -2,
linetype = "dashed") +
scale_x_continuous(expand = c(0, 0)) +
scale_y_continuous(expand = c(0, 0)) +
scale_colour_manual(values = c("gray",
"pink",
"gray30",
"deeppink")) +
geom_label_repel(data = s30_results |>
filter(xenbase_gene_symbol == "hoxb9.S"),
aes(label = xenbase_gene_symbol),
size = 4,
nudge_x = .5,
nudge_y = 1.5) +
geom_point(data = s30_results |>
filter(xenbase_gene_symbol == "hoxb9.S"),
size = 3) +
theme_classic() +
theme(legend.position = "none")
Should you want to label more than one gene, you will need to use (for example): filter(xenbase_gene_symbol %in% c("hoxb9.S", "fzd7.S"))
Now go to Save your plots
π Arabidopsis
We will add a column to the results dataframe that contains the -log10(padj). You could perform this transformation within the plot command without adding a column to the data if you prefer.
π¬ Add a column to the results dataframe that contains the -log10(padj):
π¬ Create a volcano plot of the results:
wild_results |>
ggplot(aes(x = log2FoldChange,
y = log10_padj)) +
geom_point() +
geom_hline(yintercept = -log10(0.05),
linetype = "dashed") +
geom_vline(xintercept = 2,
linetype = "dashed") +
geom_vline(xintercept = -2,
linetype = "dashed") +
scale_x_continuous(expand = c(0, 0)) +
scale_y_continuous(expand = c(0, 0)) +
theme_classic() +
theme(legend.position = "none")
Our dashed lines are at -log10(0.05) and log2(2) and log2(-2) to make more clear which genes (points) are significantly different between the copper sufficient and deficient conditions samples and have a fold change of at least 2.
In most cases, people colour the points to show that the quadrants. I like to add columns to the dataframe to indicate if the gene is significant and if the fold change is large and use those variables in the plot.
π¬ Add columns to the results dataframe to indicate if the gene is significant and if the fold change is large:
The use of abs()
(absolute) means genes with a fold change of at least 2 in either direction will be considered to have a large fold change.
Now we can colour the points by these new columns. I use interaction()
to create four categories:
- not significant and not large fold change (FF)
- significant and not large fold change (TF)
- not significant and large fold (FT)
- significant and large fold change (TT)
And I use scale_colour_manual()
to set the colours for these categories.
NOTE: there are no βsignificant and not large fold change (TF)β in this case.This means we do not need four colours. I have put the βpinkβ colour in the code but commented it out.
π¬ Create a volcano plot of the results with the points coloured by significance and fold change:
wild_results |>
ggplot(aes(x = log2FoldChange,
y = log10_padj,
colour = interaction(sig, bigfc))) +
geom_point() +
geom_hline(yintercept = -log10(0.05),
linetype = "dashed") +
geom_vline(xintercept = 2,
linetype = "dashed") +
geom_vline(xintercept = -2,
linetype = "dashed") +
scale_x_continuous(expand = c(0, 0)) +
scale_y_continuous(expand = c(0, 0)) +
scale_colour_manual(values = c("gray",
# "pink",
"gray30",
"deeppink")) +
theme_classic() +
theme(legend.position = "none")
For exploring the data, I like add labels to all the significant genes with a large fold change so I can very quickly identity them. The ggrepel
package has a function geom_text_repel()
that is useful for adding labels so that they donβt overlap.
π¬ Load the package:
π¬ Add labels to the significant genes with a large fold change:
wild_results |>
ggplot(aes(x = log2FoldChange,
y = log10_padj,
colour = interaction(sig, bigfc))) +
geom_point() +
geom_hline(yintercept = -log10(0.05),
linetype = "dashed") +
geom_vline(xintercept = 2,
linetype = "dashed") +
geom_vline(xintercept = -2,
linetype = "dashed") +
scale_x_continuous(expand = c(0, 0)) +
scale_y_continuous(expand = c(0, 0)) +
scale_colour_manual(values = c("gray",
# "pink",
"gray30",
"deeppink")) +
geom_text_repel(data = wild_results |>
filter(bigfc == TRUE, sig == TRUE),
aes(label = external_gene_name),
size = 3,
max.overlaps = 50) +
theme_classic() +
theme(legend.position = "none")
Notice that I have used filter()
label only the genes that are both significant and have a large fold change. In systems you are familiar with, this labelling is very informative and can help you quickly identify common themes. Key to interpreting the volcano plot is to remember that positive fold changes means the gene is up-regulated in the sufficient conditions and negative fold changes means the gene is down-regulated (i.e., higher in the deficient). This was determined by the order of the treatments in the contrast used in the DESeq2 analysis
If you do forget which way round you did the comparison, you can always examine the results dataframe to see which of the treatments seem to be higher for the positive fold changes.
When you have a gene of interest, you may wish to label it, and only it, on the plot. This is done in the same way except that you filter the data to only include the gene of interest. I have used and then use geom_label_repel()
rather than geom_text_repel()
to put the label in a box and nudged itβs position to get a line connecting the point and the label. I have also increased the size of the point.
π¬ Add a label to one gene of interest (hoxb9.S) and :
wild_results |>
ggplot(aes(x = log2FoldChange,
y = log10_padj,
colour = interaction(sig, bigfc))) +
geom_point() +
geom_hline(yintercept = -log10(0.05),
linetype = "dashed") +
geom_vline(xintercept = 2,
linetype = "dashed") +
geom_vline(xintercept = -2,
linetype = "dashed") +
scale_x_continuous(expand = c(0, 0)) +
scale_y_continuous(expand = c(0, 0)) +
scale_colour_manual(values = c("gray",
# "pink",
"gray30",
"deeppink")) +
geom_label_repel(data = wild_results |>
filter(external_gene_name == "FRO4"),
aes(label = external_gene_name),
size = 4,
nudge_x = .5,
nudge_y = 1.5) +
geom_point(data = wild_results |>
filter(external_gene_name == "FRO4"),
size = 3) +
theme_classic() +
theme(legend.position = "none")
Should you want to label more than one gene, you will need to use (for example): filter(external_gene_name %in% c("FRO4", "FRO5"))
Now go to Save your plots
π Leishmania
We will add a column to the results dataframe that contains the -log10(padj). You could perform this transformation within the plot command without adding a column to the data if you prefer.
π¬ Add a column to the results dataframe that contains the -log10(padj):
π¬ Create a volcano plot of the results:
pro_meta_results |>
ggplot(aes(x = log2FoldChange,
y = log10_padj)) +
geom_point() +
geom_hline(yintercept = -log10(0.05),
linetype = "dashed") +
geom_vline(xintercept = 2,
linetype = "dashed") +
geom_vline(xintercept = -2,
linetype = "dashed") +
scale_x_continuous(expand = c(0, 0)) +
scale_y_continuous(expand = c(0, 0)) +
theme_classic() +
theme(legend.position = "none")
Our dashed lines are at -log10(0.05) and log2(2) and log2(-2) to make more clear which genes (points) are significantly different between the life stages and have a fold change of at least 2. Whilst we have very many genes that are significant, we have fewer that are significant and have a large fold change.
In most cases, people colour the points to show that the quadrants. I like to add columns to the dataframe to indicate if the gene is significant and if the fold change is large and use those variables in the plot.
π¬ Add columns to the results dataframe to indicate if the gene is significant and if the fold change is large:
The use of abs()
(absolute) means genes with a fold change of at least 2 in either direction will be considered to have a large fold change.
Now we can colour the points by these new columns. I use interaction()
to create four categories:
- not significant and not large fold change (FF)
- significant and not large fold change (TF)
- not significant and large fold (FT)
- significant and large fold change (TT)
And I use scale_colour_manual()
to set the colours for these categories.
π¬ Create a volcano plot of the results with the points coloured by significance and fold change:
pro_meta_results |>
ggplot(aes(x = log2FoldChange,
y = log10_padj,
colour = interaction(sig, bigfc))) +
geom_point() +
geom_hline(yintercept = -log10(0.05),
linetype = "dashed") +
geom_vline(xintercept = 2,
linetype = "dashed") +
geom_vline(xintercept = -2,
linetype = "dashed") +
scale_x_continuous(expand = c(0, 0)) +
scale_y_continuous(expand = c(0, 0)) +
scale_colour_manual(values = c("gray",
"pink",
"gray30",
"deeppink")) +
theme_classic() +
theme(legend.position = "none")
For exploring the data, I like add labels to all the significant genes with a large fold change so I can very quickly identity them. The ggrepel
package has a function geom_text_repel()
that is useful for adding labels so that they donβt overlap.
π¬ Load the package:
π¬ Add labels to the significant genes with a large fold change:
pro_meta_results |>
ggplot(aes(x = log2FoldChange,
y = log10_padj,
colour = interaction(sig, bigfc))) +
geom_point() +
geom_hline(yintercept = -log10(0.05),
linetype = "dashed") +
geom_vline(xintercept = 2,
linetype = "dashed") +
geom_vline(xintercept = -2,
linetype = "dashed") +
scale_x_continuous(expand = c(0, 0)) +
scale_y_continuous(expand = c(0, 0)) +
scale_colour_manual(values = c("gray",
"pink",
"gray30",
"deeppink")) +
geom_text_repel(data = pro_meta_results |>
filter(bigfc == TRUE, sig == TRUE),
aes(label = description),
size = 3,
max.overlaps = 50) +
theme_classic() +
theme(legend.position = "none")
Notice that I have used filter()
label only the genes that are both significant and have a large fold change. In systems you are familiar with, this labelling is very informative and can help you quickly identify common themes. However, in this case, we do not have good annotation for the genes. We can label only with the gene id or the description. Many of the descriptions are
π¬ Add labels to the significant genes with a large fold change only where description doesnβt contain βhypotheticalβ or βunspecified :
pro_meta_results |>
ggplot(aes(x = log2FoldChange,
y = log10_padj,
colour = interaction(sig, bigfc))) +
geom_point() +
geom_hline(yintercept = -log10(0.05),
linetype = "dashed") +
geom_vline(xintercept = 2,
linetype = "dashed") +
geom_vline(xintercept = -2,
linetype = "dashed") +
scale_x_continuous(expand = c(0, 0)) +
scale_y_continuous(expand = c(0, 0)) +
scale_colour_manual(values = c("gray",
"pink",
"gray30",
"deeppink")) +
geom_text_repel(data = pro_meta_results |>
filter(bigfc == TRUE, sig == TRUE,
!str_detect(description, "hypothetical|unspecified")),
aes(label = description),
size = 3,
max.overlaps = 50) +
theme_classic() +
theme(legend.position = "none")
Key to interpreting the volcano plot is to remember that positive fold changes means the gene is up-regulated in the procyclic stage and negative fold changes means the gene is down-regulated (i.e., higher in the metacyclic). This was determined by the order of the treatments in the contrast used in the DESeq2 analysis
If you do forget which way round you did the comparison, you can always examine the results dataframe to see which of the treatments seem to be higher for the positive fold changes.
When you have a gene of interest, you may wish to label it, and only it, on the plot. This is done in the same way except that you filter the data to only include the gene of interest. I have used and then use geom_label_repel()
rather than geom_text_repel()
to put the label in a box and nudged itβs position to get a line connecting the point and the label. I have also increased the size of the point.
π¬ Add a label to one gene of interest (elongation factor 1-alpha) and :
pro_meta_results |>
ggplot(aes(x = log2FoldChange,
y = log10_padj,
colour = interaction(sig, bigfc))) +
geom_point() +
geom_hline(yintercept = -log10(0.05),
linetype = "dashed") +
geom_vline(xintercept = 2,
linetype = "dashed") +
geom_vline(xintercept = -2,
linetype = "dashed") +
scale_x_continuous(expand = c(0, 0)) +
scale_y_continuous(expand = c(0, 0)) +
scale_colour_manual(values = c("gray",
"pink",
"gray30",
"deeppink")) +
geom_label_repel(data = pro_meta_results |>
filter(description == "elongation factor 1-alpha"),
aes(label = description),
size = 4,
nudge_x = .5,
nudge_y = 1.5) +
geom_point(data = pro_meta_results |>
filter(description == "elongation factor 1-alpha"),
size = 3) +
theme_classic() +
theme(legend.position = "none")
Should you want to label more than one gene, you will need to use (for example): filter(description %in% c("elongation factor 1-alpha", "ADP/ATP translocase 1 putative"))
Now go to Save your plots
π Stem cells
We will add a column to the results dataframe that contains the -log10(FDR). You could perform this transformation within the plot command without adding a column to the data if you prefer.
π¬ Add a column to the results dataframe that contains the -log10(FDR):
π¬ Create a volcano plot of the results:
hspc_prog_results |>
ggplot(aes(x = summary.logFC,
y = log10_FDR)) +
geom_point() +
geom_hline(yintercept = -log10(0.05),
linetype = "dashed") +
geom_vline(xintercept = 2,
linetype = "dashed") +
geom_vline(xintercept = -2,
linetype = "dashed") +
scale_x_continuous(expand = c(0, 0)) +
scale_y_continuous(expand = c(0, 0)) +
theme_classic() +
theme(legend.position = "none")
Our dashed lines are at -log10(0.05) and log2(2) and log2(-2) to make more clear which genes (points) are significantly different between the cell types and have a fold change of at least 2.
In most cases, people colour the points to show that the quadrants. I like to add columns to the dataframe to indicate if the gene is significant and if the fold change is large and use those variables in the plot.
π¬ Add columns to the results dataframe to indicate if the gene is significant and if the fold change is large:
The use of abs()
(absolute) means genes with a fold change of at least 2 in either direction will be considered to have a large fold change.
Now we can colour the points by these new columns. I use interaction()
to create four categories:
- not significant and not large fold change (FF)
- significant and not large fold change (TF)
- not significant and large fold (FT)
- significant and large fold change (TT)
And I use scale_colour_manual()
to set the colours for these categories.
NOTE: there are no βnot significant and large fold change (FT)β in this case. This means we do not need four colours. I have put the βgray30β colour in the code but commented it out.
π¬ Create a volcano plot of the results with the points coloured by significance and fold change:
hspc_prog_results |>
ggplot(aes(x = summary.logFC,
y = log10_FDR,
colour = interaction(sig, bigfc))) +
geom_point() +
geom_hline(yintercept = -log10(0.05),
linetype = "dashed") +
geom_vline(xintercept = 2,
linetype = "dashed") +
geom_vline(xintercept = -2,
linetype = "dashed") +
scale_x_continuous(expand = c(0, 0)) +
scale_y_continuous(expand = c(0, 0)) +
scale_colour_manual(values = c("gray",
"pink",
# "gray30",
"deeppink")) +
theme_classic() +
theme(legend.position = "none")
For exploring the data, I like add labels to all the significant genes with a large fold change so I can very quickly identity them. The ggrepel
package has a function geom_text_repel()
that is useful for adding labels so that they donβt overlap.
π¬ Load the package:
π¬ Add labels to the significant genes with a large fold change:
hspc_prog_results |>
ggplot(aes(x = summary.logFC,
y = log10_FDR,
colour = interaction(sig, bigfc))) +
geom_point() +
geom_hline(yintercept = -log10(0.05),
linetype = "dashed") +
geom_vline(xintercept = 2,
linetype = "dashed") +
geom_vline(xintercept = -2,
linetype = "dashed") +
scale_x_continuous(expand = c(0, 0)) +
scale_y_continuous(expand = c(0, 0)) +
scale_colour_manual(values = c("gray",
"pink",
# "gray30",
"deeppink")) +
geom_text_repel(data = hspc_prog_results |>
filter(bigfc == TRUE, sig == TRUE),
aes(label = external_gene_name),
size = 3,
max.overlaps = 50) +
theme_classic() +
theme(legend.position = "none")
Notice that I have used filter()
label only the genes that are both significant and have a large fold change. In systems you are familiar with, this labelling is very informative and can help you quickly identify common themes. Key to interpreting the volcano plot is to remember that positive fold changes means the gene is up-regulated in the Prog and negative fold changes means the gene is down-regulated (i.e., higher HSPC). This was determined by us choosing the results_hspc_prog$prog dataframe from the list object
If you do forget which way round you did the comparison, you can always examine the gene summary dataframe to see which of the treatments seem to be higher for the positive fold changes.
When you have a gene of interest, you may wish to label it on the plot. This is done in the same way except that you filter the data to only include the gene of interest. I have used and then use geom_label_repel()
rather than geom_text_repel()
to put the label in a box and nudged itβs position to get a line connecting the point and the label. I have also increased the size of the point.
π¬ Add a label to one gene of interest (Procr) and :
hspc_prog_results |>
ggplot(aes(x = summary.logFC,
y = log10_FDR,
colour = interaction(sig, bigfc))) +
geom_point() +
geom_hline(yintercept = -log10(0.05),
linetype = "dashed") +
geom_vline(xintercept = 2,
linetype = "dashed") +
geom_vline(xintercept = -2,
linetype = "dashed") +
scale_x_continuous(expand = c(0, 0)) +
scale_y_continuous(expand = c(0, 0)) +
scale_colour_manual(values = c("gray",
"pink",
# "gray30",
"deeppink")) +
geom_label_repel(data = hspc_prog_results |>
filter(external_gene_name == "Procr"),
aes(label = external_gene_name),
size = 4,
nudge_x = .5,
nudge_y = 1.5) +
geom_point(data = hspc_prog_results |>
filter(external_gene_name == "Procr"),
size = 3) +
theme_classic() +
theme(legend.position = "none")
Should you want to label more than one gene, you will need to use (for example): filter(external_gene_name %in% c("Procr", "Emb"))
Now go to Save your plots
Save your plots
Once you have finished designing your plots, you should save them using ggsave()
. You will want to assign the plot to a variable and use code similar to this:
ggsave("figures/my-informative-file-name.png",
plot = vol,
height = 4.5,
width = 4.5,
units = "in",
device = "png")
I recommend saving your figure in the approximate size it will appear in your report, that is, donβt resize it in word/google docs. This because the font will then be an appropriate size for the report. For journals, we often have to creat figure files of very high resolution. You can do this by increasing the height
and width
and then resizing the figure in the document. however, this requires resizing the all font and lines in the figure. I also recommend saving it as a png file as this is a lossless format.
π€ Look after future you!
π¬ Go through your script and tidy up. I would suggest restarting R and trying to run the full pipeline from start to finish. You might need to:
- collect together library statements at the top of the script
- remove code that you needed to start today but which wouldnβt be needed running the script from the top (e.g., importing the results)
- edit your comments for clarity
- rename variables for consistency or clarity
- remove house keeping or exploratory code
- restyle code, add code section headers etc
π₯³ Finished
Well Done!
Independent study following the workshop
The Code file
These contain all the code needed in the workshop even where it is not visible on the webpage.
The workshop.qmd file is the file I use to compile the practical. Qmd stands for Quarto markdown. It allows code and ordinary text to be interleaved to produce well-formatted reports including webpages. Right-click on the link and choose Save-As to download. You will be able to open the Qmd file in RStudio. Alternatively, View in Browser. Coding and thinking answers are marked with #---CODING ANSWER---
and #---THINKING ANSWER---
Pages made with R (R Core Team 2024), Quarto (Allaire et al. 2024), knitr
(Xie 2024, 2015, 2014), kableExtra
(Zhu 2021)