class: center, middle, inverse, title-slide # An introduction to Machine Learning: Unsupervised methods ## Data Science option of BIO00058M Data Analysis. ### Emma Rand ### University of York, UK --- <style> div.blue { background-color:#b0cdef; border-radius: 5px; padding: 20px;} div.grey { background-color:#d3d3d3; border-radius: 0px; padding: 0px;} </style> # Outline This slide show covers two commonly used unsupervised methods: Principal Components Analysis (PCA) and *t*-Distributed Stochastic Neighbour Embedding (*t*-SNE, pronounced tea-snee or tisne). You'll also get a little more practice with data filtering and tidying and meet the **`GGally`** package and some penguins. 🐧 🐧 -- PCA and *t*-SNE are often used for 'dimension reduction' so that high dimensional data can be visualised more easily and clusters of observations can be seen. --- # Outline: PCA PCA is used on **continuous variables**. It creates a new, smaller, set of variables - known as principal components - from linear combinations of the original variables. -- The first principal component, PC1, is chosen to capture the maximum variance in the data. The PC2 is orthogonal (uncorrelated) to PC1 and captures the maximum amount of the remaining variance. And so on. -- PCA is a fast, linear, parametric method. It is very widely used and often the first method applied. It works less well when when there are polynomial relationships between variables. -- Linear dimension reduction methods concentrate on putting dissimilar observations far apart. --- # Outline: *t*-SNE *t*-SNE is a non-parametric, non-linear method that prioritises placing similar observations near each other. -- It is a probabilistic method and is computationally intensive. -- It is fairly common to use PCA to reduce the number of dimensions and then carry out *t*-SNE on the top principal components. --- # Outline In these slides show you will apply: * PCA to some penguin data which has relatively few variables. This will make it a bit easier to understand what PCA does. -- * PCA and *t*-SNE (separately) to some single-cell RNAseq data --- # Outline You should be able to code along with the examples. When you see the film clapper it is 🎬 an instruction to do something!! I suggest having a different RStudio Project for each dataset we use. Note that you can have multiple instances of RStudio running to allow you to work on more than one RStudio Project. -- Create directory structure for each RStudio Project, write your analysis in R Markdown with named chunks which are well organised. -- Load the tidyverse for each RStudio Project. --- class: inverse # Principal Components Analysis --- background-image: url(../pics/lter_penguins.png) background-position: 50% 70% background-size: 500px # Principal Components Analysis We will first use PCA on a dataset with relatively few dimensions. The data are available in the **`palmerpenguins`** (Horst Hill, et al., 2020) package which is based on data in (Gorman Williams, et al., 2014) --- # About Palmer penguins The Palmer penguins data contains size measurements for three penguin species observed on three islands in the Palmer Archipelago, Antarctica. You can read more about this dataset in the [Introduction to palmerpenguins](https://allisonhorst.github.io/palmerpenguins/articles/intro.html) -- 🎬 Load the package to get the data: ```r library(palmerpenguins) ``` -- You will not yet see the dataframes in your workspace --- # Tidy Penguins We will use the `penguins_raw` dataframe. You can read the full set of variables [here](https://allisonhorst.github.io/palmerpenguins/reference/penguins_raw.html) -- 🎬 Clean the variable names: ```r penguin <- penguins_raw %>% janitor::clean_names() ``` -- I cleaned the variable names so I didn't have to use backticks around variable names with spaces or special characters. -- Do `str(penguins_raw)` if you'd like to see the original names. --- # Tidy Penguins 🎬 Take a look at the four variables we will use: .code70[ ```r penguin %>% select(body_mass_g, ends_with("_mm")) %>% summary() ``` ``` ## body_mass_g culmen_length_mm culmen_depth_mm flipper_length_mm ## Min. :2700 Min. :32.10 Min. :13.10 Min. :172.0 ## 1st Qu.:3550 1st Qu.:39.23 1st Qu.:15.60 1st Qu.:190.0 ## Median :4050 Median :44.45 Median :17.30 Median :197.0 ## Mean :4202 Mean :43.92 Mean :17.15 Mean :200.9 ## 3rd Qu.:4750 3rd Qu.:48.50 3rd Qu.:18.70 3rd Qu.:213.0 ## Max. :6300 Max. :59.60 Max. :21.50 Max. :231.0 ## NA's :2 NA's :2 NA's :2 NA's :2 ``` ] -- Notice we have selected `body_mass_g` three other variables ending in `_mm` with the helper function `ends_with()`. --- # Tidy Penguins We have rows with missing values which we will exclude from our analysis. 🎬 Filter out the rows with missing values: ```r penguin <- penguin %>% filter(!is.na(body_mass_g)) %>% filter(!is.na(sex)) ``` -- Dealing with missing data is a much discussed are in ML. Several data imputation methods exist --- # Tidy Penguins Our species variable contains the common name followed by the scientific name in brackets. This is quite long, and will make legends a bit big. I will split the column into two. 🎬 Split `species` into `common_name` and `scientific_name`: .code80[ ```r penguin <- penguin %>% extract(species, c("common_name", "scientific_name"), "([a-zA-Z]+\\s[a-zA-Z]+)\\s\\(([a-zA-Z]+\\s[a-zA-Z]+)\\)") ``` ] --- # Tidy Penguins `"([a-zA-Z]+\\s[a-zA-Z]+)\\s\\(([a-zA-Z]+\\s[a-zA-Z]+)\\)"` is made of 2 patterns to be saved and two patterns we don't want to save. -- The two patterns we want to save are in brackets: `([a-zA-Z]+\\s[a-zA-Z]+)` `([a-zA-Z]+\\s[a-zA-Z]+)` The two patterns are the same and comprised of: -- - `+` one or more of - `[a-zA-Z]` or lower or uppercase letters - `\s` a space - The second `\` 'escapes' the first - meaning it is recognised to be part of the whitespace pattern. - `[a-zA-Z]` or lower or uppercase letters --- # Tidy Penguins `"([a-zA-Z]+\\s[a-zA-Z]+)\\s\\(([a-zA-Z]+\\s[a-zA-Z]+)\\)"` is made of 2 patterns to be saved and two patterns we don't want to save. The two patterns we don't want to save are not in brackets and comprised of: - `\\s\\(` a whitespace followed by an opening bracket (between the patterns) - `\\)` a closing bracket (at the end) --- # Explore Penguins **`GGally`** is a package which is very use for quickly getting an overview of a dataset. It has a function `ggpairs()` which outputs a matrix of pairwise plots of the variables. It works well for datasets with up to 15 or so variables but will be slow and less useful a higher number of dimensions. 🎬 Load the package: ```r library(GGally) ``` --- # Explore Penguins 🎬 Select variables of interest and pipe in to `ggpairs()`: .code80[ ```r penguin %>% select(common_name, sex, island, body_mass_g, ends_with("_mm")) %>% ggpairs(aes(color = common_name), upper = list(continuous = wrap("cor", size = 1))) + theme(text = element_text(size = 6)) ``` ] `upper = list(continuous = wrap("cor", size = 1))` controls the size of the font on the graph, giving the correlations `theme(text = element_text(size = 6))` controls the size of axis labels and text --- # Explore Penguins <img src="05_intro_to_ML_unsupervised_files/figure-html/unnamed-chunk-7-1.png" width="70%" /> --- # Explore Penguins - each plot shows three variables - the vertical, the horizontal and the species by colour - the leading diagonal is each variable 'against' itself - for continuous variables you get the distribution for each species in that variable. - for discrete variable you get the counts in each category - for two continuous variables - you get scatter plots below the diagonal - and correlation coefficients for those catter plots above the diagonal --- # Principal Components Analysis Now to run the PCA. Remember, we can only include numeric variables. 🎬 Select the four variables and pipe into the `prcomp()` function which does the PCA: ```r pca <- penguin %>% select(body_mass_g, ends_with("_mm")) %>% prcomp(scale. = TRUE) ``` **Scaling**: prevents the variables with the biggest values dominating the analysis. -- We have saved the result to a list object called `pca` --- # Principal Components Analysis The maximum number of PCs possible is the same as the number of variables included. But a useful PCA would capture much of the variation in the data in fewer PCs. -- 🎬 See the variance accounted for by each component .code70[ ```r summary(pca) ``` ``` ## Importance of components: ## PC1 PC2 PC3 PC4 ## Standard deviation 1.6569 0.8821 0.60716 0.32846 ## Proportion of Variance 0.6863 0.1945 0.09216 0.02697 ## Cumulative Proportion 0.6863 0.8809 0.97303 1.00000 ``` ] -- 68.634% of the variation is captured in the first PC; 88.087% in the first 2 together; and 97.303% in the first 2 together --- # Principal Components Analysis 🎬 See the importance (loadings) of each variable in each component: .code80[ ```r pca$rotation ``` ``` ## PC1 PC2 PC3 PC4 ## body_mass_g 0.5496747 0.07646366 -0.5917374 -0.5846861 ## culmen_length_mm 0.4537532 0.60019490 0.6424951 -0.1451695 ## culmen_depth_mm -0.3990472 0.79616951 -0.4258004 0.1599044 ## flipper_length_mm 0.5768250 0.00578817 -0.2360952 0.7819837 ``` ] `\(PC1=\)` 0.5496747 `\(body\_mass\_g +\)` 0.4537532 `\(culmen\_length\_mm +\)` -0.3990472 `\(culmen\_depth\_mm +\)` 0.576825 `\(flipper\_length\_mm\)` -- This what we mean when we say the PCs are linear combinations of the original variables --- # Principal Components Analysis To plot, we might want to use the scores on each of the new axes (PCs) and colour them by species. The scores are in a variable called `$x` 🎬 Extract the scores into a dataframe with the species names: ```r pca_labelled <- data.frame(pca$x, common_name = penguin$common_name) # a then to do a scatterplot ``` 🎬 Create a scatter plot: ```r pca_labelled %>% ggplot(aes(x = PC1, y = PC2, color = common_name)) + geom_point() ``` --- # Principal Components Analysis <img src="05_intro_to_ML_unsupervised_files/figure-html/unnamed-chunk-14-1.png" width="700px" height="550px" /> --- # Principal Components Analysis Our species separate a little better on the first two PC than any pairwise comparison. Compare to the `ggpairs()` plot. -- We have coloured our observations by species but PCA is an unsupervised method because PCA itself takes no account of the species in creating the PCs. -- There's a recording with a demo of this PCA. --- class: inverse # PCA on single-cell RNASeq data --- # PCA on scRNASeq data The data in [scrna_data.csv](../data-raw/scrna_data.csv) are some single-cell RNASeq data. Each row is a cell (an observation) and each column is a gene (a variable / feature). The values are gene expression values. 🎬 Import the data: ```r file <- "../data-raw/scrna_data.csv" rna <- read_csv(file) ``` -- The expression of 1838 genes has been measured for 2638 cells. -- You do **not** want to use GGally on this data!! In fact, it is difficult to get any sort of overview for so many rows and columns. This is why dimension reduction with PCA is useful. --- # PCA on scRNAseq data 🎬 Carry out a PCA: ```r pca <- rna %>% prcomp(scale. = TRUE) ``` 🎬 Consider the variance in the first ten PC: .code60[ ```r summary(pca)[["importance"]][,1:10] ``` ``` ## PC1 PC2 PC3 PC4 PC5 PC6 ## Standard deviation 5.738281 5.125134 4.365527 3.686102 2.383895 2.277447 ## Proportion of Variance 0.017920 0.014290 0.010370 0.007390 0.003090 0.002820 ## Cumulative Proportion 0.017920 0.032210 0.042570 0.049970 0.053060 0.055880 ## PC7 PC8 PC9 PC10 ## Standard deviation 2.169628 2.051966 1.979966 1.940448 ## Proportion of Variance 0.002560 0.002290 0.002130 0.002050 ## Cumulative Proportion 0.058440 0.060730 0.062870 0.064910 ``` ] --- # PCA on scRNAseq data The amount of variation in the first 10 components combined is still quite small (6.49%) -- but as there are 1838 variables, perhaps not that small. There is 20.71% in the first 100. --- # PCA on single-cell RNAseq data Let's plot PC1 against PC2. 🎬 Put the PC scores in a dataframe: ```r dat <- data.frame(pca$x) ``` 🎬 Plot PC1 against PC2. ```r ggplot(dat, aes(x = PC1, y = PC2)) + geom_point() ``` --- # PCA on scRNAseq data <img src="05_intro_to_ML_unsupervised_files/figure-html/unnamed-chunk-20-1.png" width="580px" height="580px" /> It's difficult to discern and clusters of cell types from this figure. That PCA maximises the distance between dissimilar cells is clear here. --- # PCA on single-cell RNAseq data Since the first two components don't capture much of the variation in the cells, it's worth looking at some other pairwise comparisons. -- Any two will still only capture a small amount of variance but clusters may be seen better in some comparisons than others. 🎬 Select the first 10 PCs and pipe in to `ggpairs()`: ```r dat %>% select(PC1:PC10) %>% ggpairs(upper = list(continuous = wrap("cor", size = 1))) ``` --- # PCA on single-cell RNAseq data <img src="05_intro_to_ML_unsupervised_files/figure-html/unnamed-chunk-21-1.png" width="580px" height="580px" /> Some comparisons suggest we might have clusters of cell types but PCA isn't helping us much. --- class: inverse # *t*-SNE on scRNASeq data --- # *t*-SNE on scRNASeq data We will now use *t*-SNE on this scRNASeq data. Will it visualise different cell types more effectively? 🎬 Load the **`Rtsne`** package: ```r library(Rtsne) ``` --- # *t*-SNE on scRNASeq data 🎬 Perform *t*-SNE with the `Rtsne()` function: ```r tsne <- rna %>% Rtsne(perplexity = 40, check_duplicates = FALSE) ``` *t*-SNE is computationally demanding so expect it to take a minute or two. It is a stochastic method - the results will differ each time you run it even if the arguments are the same. `perplexity` is one of the arguments than can be altered - it is a smoothing of the number of neighbours. --- # *t*-SNE on scRNASeq data 🎬 Put the *t*-SNE scores in a dataframe: ```r dat <- data.frame(tsne$Y) ``` 🎬 Plot the first *t*-SNE dimension against the second: ```r dat %>% ggplot(aes(x = X1, y = X2)) + geom_point(size = 0.5) ``` Yours will look a little different but if the cell type patterns are real, we would expect to see similar groupings. --- # *t*-SNE on scRNASeq data <img src="05_intro_to_ML_unsupervised_files/figure-html/unnamed-chunk-25-1.png" width="700px" height="550px" /> How many cell types do you think there are? --- # *t*-SNE on scRNASeq data I would expect you to see at least three or four cell types and possible 6 - it looks like the large cluster could be three clusters and one of the other clusters is two or three. -- A cluster analysis (a different unsupervised method) has been performed on these data and the cell types identified and verified by mapping the expression of markers on the clusters. We can import this labelling and colour our *t*-SNE plot by cell type. --- # *t*-SNE on scRNASeq data The labelling data are in [scrna_meta.csv](../data-raw/scrna_meta.csv) 🎬 Import the metadata ```r file <- "../data-raw/scrna_meta.csv" meta <- read_csv(file) ``` -- There is a row for every cell and one of the columns `louvain`, gives the cell types. Louvain is the name of the clustering algorithm that was used. --- # *t*-SNE on scRNASeq data There are 8 cell types. .code80[ ```r unique(meta$louvain) ``` ``` ## [1] "CD4 T" "B CELLS" "CD14+ Monocytes" ## [4] "NK CELLS" "CD8 T" "FCGR3A+ Monocytes" ## [7] "Dendritic" "Megakaryocytes" ``` ] --- # *t*-SNE on scRNASeq data 🎬 Add the cell type to the *t*-SNE scores dataframe: ```r dat <- data.frame(dat, type = meta$louvain) ``` 🎬 Replot the *t*-SNE scores coloured by cell type: ```r dat %>% ggplot(aes(x = X1, y = X2, colour = type)) + geom_point(size = 0.5) ``` --- # *t*-SNE on scRNASeq data <img src="05_intro_to_ML_unsupervised_files/figure-html/unnamed-chunk-29-1.png" width="700px" height="550px" /> --- # Summary .font80[ * PCA and *t*-SNE are unsupervised methods * PCA and *t*-SNE are dimension/data reduction and visualisation methods * They are applied when you have very many continuous variables and allow you to visualised the data in 2 dimensions, and thus see group/patterns more easily. * PCA is a fast, linear, parametric method that optimises putting dissimilar observations far apart * *t*-SNE is a non-parametric, non-linear method that prioritises placing similar observations near each other. * *t*-SNE is a stochastic method and is computational intensive but works better on most biological/expression data where the relationships between variables are often non-linear ] --- # Reading ## Strongly recommended * Identifying cell populations with scRNASeq (Andrews and Hemberg, 2018) A good overview of different experimental protocols for Single-cell RNASeq and the most popular methods for facilitating the computational analysis. --- # References .font50[ .footnote[ Slides made with xaringan (Xie, 2021), xaringanExtra (Aden-Buie, 2020), xaringanthemer (Aden-Buie, 2021) and RefManageR (McLean, 2017) ] Aden-Buie, G. (2020). _xaringanExtra: Extras And Extensions for Xaringan Slides_. R package version 0.2.3.9000. URL: [https://github.com/gadenbuie/xaringanExtra](https://github.com/gadenbuie/xaringanExtra). Aden-Buie, G. (2021). _xaringanthemer: Custom 'xaringan' CSS Themes_. R package version 0.4.0. URL: [https://CRAN.R-project.org/package=xaringanthemer](https://CRAN.R-project.org/package=xaringanthemer). Andrews, T. S. and M. Hemberg (2018). "Identifying cell populations with scRNASeq". En. In: _Mol. Aspects Med._ 59, pp. 114-122. Gorman, K. B., T. D. Williams, et al. (2014). "Ecological Sexual Dimorphism and Environmental Variability within a Community of Antarctic Penguins (Genus Pygoscelis)". In: _PLOS ONE_ 9.3, pp. 1-14. DOI: [10.1371/journal.pone.0090081](https://doi.org/10.1371%2Fjournal.pone.0090081). URL: [https://doi.org/10.1371/journal.pone.0090081](https://doi.org/10.1371/journal.pone.0090081). Horst, A. M., A. P. Hill, et al. (2020). _palmerpenguins: Palmer Archipelago (Antarctica) penguin data_. R package version 0.1.0. URL: [https://allisonhorst.github.io/palmerpenguins/](https://allisonhorst.github.io/palmerpenguins/). McLean, M. W. (2017). "RefManageR: Import and Manage BibTeX and BibLaTeX References in R". In: _The Journal of Open Source Software_. DOI: [10.21105/joss.00338](https://doi.org/10.21105%2Fjoss.00338). Xie, Y. (2021). _xaringan: Presentation Ninja_. R package version 0.22. URL: [https://CRAN.R-project.org/package=xaringan](https://CRAN.R-project.org/package=xaringan). ] --- Emma Rand <br> [emma.rand@york.ac.uk](mailto:emma.rand@york.ac.uk) <br> Twitter: [@er13_r](https://twitter.com/er13_r) <br> GitHub: [3mmaRand](https://github.com/3mmaRand) <br> blog: https://buzzrbeeline.blog/ <br> <br> <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a><br /><span xmlns:dct="http://purl.org/dc/terms/" property="dct:title">Data Science strand of BIO00058M</span> by <span xmlns:cc="http://creativecommons.org/ns#" property="cc:attributionName">Emma Rand</span> is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License</a>.