install.packages("ggrepel")
Transcriptomics 3: Visualising
6 November, 2024
In these slides we will:
Check where you are
learn some concepts used omics visualisation
Find out what packages to install before the workshop
carried out differential expression analysis
found genes not expressed at all, or expressed in one group only
Saved results files
After the Transcriptomics 2: 👋 Statistical Analysis Workshop including:
the Independent Study to consolidate, you should have:
An RStudio Project called frogs-88H
which contains:
xlaevis_counts_S14.csv
, xlaevis_counts_S20.csv
, xlaevis_counts_S30.csv
s30_filtered.csv
, s20_filtered.csv
s30_fgf_only.csv
(there were no control only genes in s30), s30_results.csv
, and equivalent for S20cont-fgf-s30.R
, cont-fgf-s20.R
An RStudio Project called arab-88H
which contains:
arabidopsis-wild.csv
, arabidopsis-spl7.csv
wild_filtered.csv
, spl7_filtered.csv
wild_suf_only.csv
, wild-_def_only.csv, wild_results.csv
, and equivalents for spl7suff-def-wild.R
, suff-def-spl7.R
An RStudio Project called leish-88H
which contains:
leishmania-mex-ama.csv
, leishmania-mex-pro.csv
, leishmania-mex-meta.csv
pro_meta_filtered.csv
, pro_ama_filtered.csv
pro_meta_results.csv
, pro_ama_results.csv
pro_meta.R
, pro_ama.R
An RStudio Project called mice-88H
which contains:
surfaceome_hspc.csv
, surfaceome_prog.csv
, surfaceome_lthsc.csv
hspc_prog.csv
, hspc_lthsc.csv
hspc_prog_results.csv
, hspc-lthsc_results.csv
,hspc-prog.R
, hspc-lthsc.R
Files should be organised into folders. Code should well commented and easy to read. You should have curated your code to remove unnecessary commands that were useful to troubleshoot or understand objects in your environment but which are not needed for the final analysis.
If you are missing files, go through:
Go through:
Remind yourself of the key columns in any of the results files:
FDR
or padj
)baseMean
is the mean of the normalised counts for the gene across all sampleslfcSE
standard error of the fold changestat
is the test statistic (the Wald statistic)summary.logFC
and logFC.hspc
give the same value (in this case since comparing two cell types)In general, we plot data to help us summarise and understand it
This is especially import for transcriptomics data where we have a very large number of variables and often a large number of observations
We will look at two plots very commonly used in transcriptomics analysis: Principal Component Analysis (PCA) plot and Volcano Plots
Principal Component Analysis is an unsupervised machine learning technique
Unsupervised methods1 are unsupervised in that they do not use/optimise to a particular output. The goal is to uncover structure. They do not test hypotheses
It is often used to visualise high dimensional data because it is a dimension reduction technique
Takes a large number of continuous variables (like gene expression) and reduces them to a smaller number of variables (called principal components) that explain most of the variation in the data
The principal components can be plotted to see how samples cluster together
This gives us some in insight in how the sample/cells cluster. But we have a lot of genes (even for the stem cells) to consider. How do we know if the pair we use is typical? How can we consider all the genes at once?
We have done PCA after differential expression, but often PCA might is one of the first exploratory steps because it gives you an idea whether you expect general patterns in gene expression that distinguish groups.
Volcano plots often used to visualise the results of differential expression analysis
They are just a scatter of the adjusted p value against the fold change….
almost - in fact we plot the negative log of the adjusted p value against the log fold change
This is because small probabilities are important, large ones are not so the axis is counter intuitive because small p-values (i.e., significant values) are at the bottom of the axis)
And since p-values range from 1 to very tiny the important points are all squashed at the bottom of the axis
Should be done on normalised data so meaningful comparisons can be made
The 🐭 stem cell data were already log2normalised
The other datasets were normalised by the DE method and we saved the values to the results files. We will log transform them in the workshop
This package is on the University computers which you can access on campus or remotely using the VDS
If you want to use your own machine you will need to install the package.
Install ggrepel
from CRAN in the the normal way:
install.packages("ggrepel")
This package allows you to label points on a plot without them overlapping.
Transcriptomics 1: Hello data Getting to know the data. Checking the distributions of values overall, across rows and columns to check things are as we expect and detect rows/columns that need to be removed
Transcriptomics 2: Statistical Analysis. Identifying which genes are differentially expressed between treatments. This is the main analysis step. We will use different methods for bulk and single cell data.
Transcriptomics 3: Visualising. Principal Component Analysis (PCA) volcano plots to visualise the results of the