An introduction to Machine Learning: Unsupervised methods

class: center, middle, inverse, title-slide

# An introduction to Machine Learning: Unsupervised methods
## Data Science option of BIO00058M Data Analysis.
### Emma Rand
### University of York, UK

---

# Outline

This slide show covers two commonly used unsupervised methods: Principal Components Analysis (PCA) and *t*-Distributed Stochastic Neighbour Embedding (*t*-SNE, pronounced tea-snee or tisne).

You'll also get a little more practice with data filtering and tidying and meet the **`GGally`** package and some penguins. 🐧 🐧

PCA and *t*-SNE are often used for 'dimension reduction' so that high dimensional data can be visualised more easily and clusters of observations can be seen.

---
# Outline: PCA

PCA is used on **continuous variables**. It creates a new, smaller, set of variables - known as principal components - from linear combinations of the original variables.

The first principal component, PC1, is chosen to capture the maximum variance in the data. The PC2 is orthogonal (uncorrelated) to PC1 and captures the maximum amount of the remaining variance. And so on.

PCA is a fast, linear, parametric method. It is very widely used and often the first method applied.

It works less well when when there are polynomial relationships between variables.

Linear dimension reduction methods concentrate on putting dissimilar observations far apart.

---
# Outline: *t*-SNE

*t*-SNE is a non-parametric, non-linear method that prioritises placing similar observations near each other.

It is a probabilistic method and is computationally intensive.

It is fairly common to use PCA to reduce the number of dimensions and then carry out *t*-SNE on the top principal components.

---
# Outline

In these slides show you will apply:

* PCA to some penguin data which has relatively few variables. This will make it a bit easier to understand what PCA does.

* PCA and *t*-SNE (separately) to some single-cell RNAseq data

---
# Outline

You should be able to code along with the examples. When you see the film clapper it is

🎬 an instruction to do something!!

I suggest having a different RStudio Project for each dataset we use. Note that you can have multiple instances of RStudio running to allow you to work on more than one RStudio Project.

Create directory structure for each RStudio Project, write your analysis in R Markdown with named chunks which are well organised.

Load the tidyverse for each RStudio Project.

---
class: inverse

#  Principal Components Analysis

---
background-image: url(../pics/lter_penguins.png)
background-position: 50% 70%
background-size: 500px

# Principal Components Analysis

We will first use PCA on a dataset with relatively few dimensions. The data are available in the  **`palmerpenguins`** (Horst Hill, et al., 2020) package which is based on data in (Gorman Williams, et al., 2014)

---
#  About Palmer penguins

The Palmer penguins data contains size measurements for three penguin species observed on three islands in the Palmer Archipelago, Antarctica.

You can read more about this dataset in the [Introduction to palmerpenguins](https://allisonhorst.github.io/palmerpenguins/articles/intro.html)

🎬 Load the package to get the data:

```r
library(palmerpenguins)
```

You will not yet see the dataframes in your workspace

---
# Tidy Penguins

We will use the `penguins_raw` dataframe.

You can read the full set of variables [here](https://allisonhorst.github.io/palmerpenguins/reference/penguins_raw.html)

🎬 Clean the variable names:

```r
penguin <- penguins_raw %>%
  janitor::clean_names()
```

I cleaned the variable names so I didn't have to use backticks around variable names with spaces or special characters.

Do `str(penguins_raw)` if you'd like to see the original names.

---
# Tidy Penguins

🎬 Take a look at the four variables we will use:

.code70[

```r
penguin %>% 
  select(body_mass_g,
         ends_with("_mm")) %>% 
  summary()
```

```
##   body_mass_g   culmen_length_mm culmen_depth_mm flipper_length_mm
##  Min.   :2700   Min.   :32.10    Min.   :13.10   Min.   :172.0    
##  1st Qu.:3550   1st Qu.:39.23    1st Qu.:15.60   1st Qu.:190.0    
##  Median :4050   Median :44.45    Median :17.30   Median :197.0    
##  Mean   :4202   Mean   :43.92    Mean   :17.15   Mean   :200.9    
##  3rd Qu.:4750   3rd Qu.:48.50    3rd Qu.:18.70   3rd Qu.:213.0    
##  Max.   :6300   Max.   :59.60    Max.   :21.50   Max.   :231.0    
##  NA's   :2      NA's   :2        NA's   :2       NA's   :2
```
]

Notice we have selected `body_mass_g` three other variables ending in `_mm` with the helper function `ends_with()`.

---
# Tidy Penguins

We have rows with missing values which we will exclude from our analysis.

🎬 Filter out the rows with missing values:

```r
penguin <- penguin %>% 
  filter(!is.na(body_mass_g)) %>% 
  filter(!is.na(sex))
```

Dealing with missing data is a much discussed are in ML. Several data imputation methods exist

---
# Tidy Penguins

Our species variable contains the common name followed by the scientific name in brackets. This is quite long, and will make legends a bit big. I will split the column into two.

🎬 Split `species` into `common_name` and `scientific_name`: 
.code80[

```r
penguin <- penguin %>% 
  extract(species, 
          c("common_name", "scientific_name"),
          "([a-zA-Z]+\\s[a-zA-Z]+)\\s\$([a-zA-Z]+\\s[a-zA-Z]+)\$")
```
]

---
# Tidy Penguins

`"([a-zA-Z]+\\s[a-zA-Z]+)\\s\$([a-zA-Z]+\\s[a-zA-Z]+)\$"` is made of 2 patterns to be saved and two patterns we don't want to save.

--
The two patterns we want to save are in brackets:

`([a-zA-Z]+\\s[a-zA-Z]+)`  
`([a-zA-Z]+\\s[a-zA-Z]+)`

The two patterns are the same and comprised of:

-  `+` one or more of
-  `[a-zA-Z]` or lower or uppercase letters 
-  `\s` a space 
-  The second `\` 'escapes' the first - meaning it is recognised to be part of the whitespace pattern. 
-  `[a-zA-Z]` or lower or uppercase letters

---
# Tidy Penguins

`"([a-zA-Z]+\\s[a-zA-Z]+)\\s\$([a-zA-Z]+\\s[a-zA-Z]+)\$"` is made of 2 patterns to be saved and two patterns we don't want to save.

The two patterns we don't want to save are not in brackets and comprised of:
-  `\\s\$` a whitespace followed by an opening bracket (between the patterns)
-  `\$` a closing bracket (at the end)

---
#  Explore Penguins

**`GGally`** is a package which is very use for quickly getting an overview of a dataset. It has a function `ggpairs()` which outputs a matrix of pairwise plots of the variables.

It works well for datasets with up to 15 or so variables but will be slow and less useful a higher number of dimensions.

🎬 Load the package:

```r
library(GGally)
```

---
#  Explore Penguins

🎬 Select variables of interest and pipe in to `ggpairs()`: 
.code80[

```r
penguin %>%
  select(common_name, 
         sex, 
         island,
         body_mass_g,
         ends_with("_mm")) %>%
  ggpairs(aes(color = common_name), 
          upper = list(continuous = wrap("cor", size = 1))) +
  theme(text = element_text(size = 6))
```
]

`upper = list(continuous = wrap("cor", size = 1))` controls the size of the font on the graph, giving the correlations

`theme(text = element_text(size = 6))` controls the size of axis labels and text

---
#  Explore Penguins
<img src="05_intro_to_ML_unsupervised_files/figure-html/unnamed-chunk-7-1.png" width="70%" />

---
#  Explore Penguins

-  each plot shows three variables - the vertical, the horizontal and the species by colour
-  the leading diagonal is each variable 'against' itself
   -  for continuous variables you get the distribution for each species in that variable.
   -  for discrete variable you get the counts in each category
-  for two continuous variables
   -  you get scatter plots below the diagonal
   -  and correlation coefficients for those catter plots above the diagonal

---
# Principal Components Analysis

Now to run the PCA. Remember, we can only include numeric variables.

🎬 Select the four variables and pipe into the `prcomp()` function which does the PCA:

```r
pca <- penguin %>% 
  select(body_mass_g,
         ends_with("_mm")) %>%
  prcomp(scale. = TRUE)
```

**Scaling**: prevents the variables with the biggest values dominating the analysis.

We have saved the result to a list object called `pca`
---
# Principal Components Analysis

The maximum number of PCs possible is the same as the number of variables included.

But a useful PCA would capture much of the variation in the data in fewer PCs.

🎬 See the variance accounted for by each component

.code70[

```r
summary(pca)
```

```
## Importance of components:
##                           PC1    PC2     PC3     PC4
## Standard deviation     1.6569 0.8821 0.60716 0.32846
## Proportion of Variance 0.6863 0.1945 0.09216 0.02697
## Cumulative Proportion  0.6863 0.8809 0.97303 1.00000
```

]

68.634% of the variation is captured in the first PC; 88.087% in the first 2 together; and 97.303% in the first 2 together

---
# Principal Components Analysis

🎬 See the importance (loadings) of each variable in each component:
.code80[

```r
pca$rotation 
```

```
##                          PC1        PC2        PC3        PC4
## body_mass_g        0.5496747 0.07646366 -0.5917374 -0.5846861
## culmen_length_mm   0.4537532 0.60019490  0.6424951 -0.1451695
## culmen_depth_mm   -0.3990472 0.79616951 -0.4258004  0.1599044
## flipper_length_mm  0.5768250 0.00578817 -0.2360952  0.7819837
```
]

`$PC1=$` 0.5496747 `$body\_mass\_g +$` 0.4537532 `$culmen\_length\_mm +$` -0.3990472 `$culmen\_depth\_mm +$` 0.576825 `$flipper\_length\_mm$`

This what we mean when we say the PCs are linear combinations of the original variables

---
# Principal Components Analysis

To plot, we might want to use the scores on each of the new axes (PCs) and colour them by species. The scores are in a variable called `$x`

🎬 Extract the scores into a dataframe with the species names:

```r
pca_labelled <- data.frame(pca$x, common_name = penguin$common_name)
# a then to do a scatterplot
```

🎬 Create a scatter plot:

```r
pca_labelled %>% 
  ggplot(aes(x = PC1, y = PC2, color = common_name)) +
  geom_point() 
```

---
# Principal Components Analysis

---
# Principal Components Analysis

Our species separate a little better on the first two PC than any pairwise comparison.

Compare to the `ggpairs()` plot.

We have coloured our observations by species but PCA is an unsupervised method because PCA itself takes no account of the species in creating the PCs.

There's a recording with a demo of this PCA.

---
class: inverse

#  PCA on single-cell RNASeq data

---
#  PCA on scRNASeq data

The data in [scrna_data.csv](../data-raw/scrna_data.csv) are some single-cell  RNASeq data. Each row is a cell (an observation) and each column is a gene (a variable / feature). The values are gene expression values.

🎬 Import the data:

```r
file <- "../data-raw/scrna_data.csv"
rna <- read_csv(file)
```

The expression of 1838 genes has been measured for 2638 cells.

You do **not** want to use GGally on this data!!

In fact, it is difficult to get any sort of overview for so many rows and columns. This is why dimension reduction with PCA is useful.

---
#  PCA on scRNAseq data

🎬 Carry out a PCA:

```r
pca <- rna %>% 
  prcomp(scale. = TRUE)
```

🎬 Consider the variance in the first ten PC:

.code60[

```r
summary(pca)[["importance"]][,1:10]
```

```
##                             PC1      PC2      PC3      PC4      PC5      PC6
## Standard deviation     5.738281 5.125134 4.365527 3.686102 2.383895 2.277447
## Proportion of Variance 0.017920 0.014290 0.010370 0.007390 0.003090 0.002820
## Cumulative Proportion  0.017920 0.032210 0.042570 0.049970 0.053060 0.055880
##                             PC7      PC8      PC9     PC10
## Standard deviation     2.169628 2.051966 1.979966 1.940448
## Proportion of Variance 0.002560 0.002290 0.002130 0.002050
## Cumulative Proportion  0.058440 0.060730 0.062870 0.064910
```
]

---
#  PCA on scRNAseq data

The amount of variation in the first 10 components combined is still quite small (6.49%)

but as there are 1838 variables, perhaps not that small. There is 20.71% in the first 100.

---
#  PCA on single-cell RNAseq data

Let's plot PC1 against PC2.

🎬 Put the PC scores in a dataframe:

```r
dat <-  data.frame(pca$x)
```
🎬 Plot PC1 against PC2.

```r
ggplot(dat, aes(x = PC1, y = PC2)) +
  geom_point()
```

---
#  PCA on scRNAseq data

It's difficult to discern and clusters of cell types from this figure. That PCA maximises the distance between dissimilar cells is clear here.

---
#  PCA on single-cell RNAseq data

Since the first two components don't capture much of the variation in the cells, it's worth looking at some other pairwise comparisons.

Any two will still only capture a small amount of variance but clusters may be seen better in some comparisons than others.

🎬 Select the first 10 PCs and pipe in to `ggpairs()`:

```r
dat %>%
  select(PC1:PC10) %>% 
  ggpairs(upper = list(continuous = wrap("cor", size = 1)))
```

---
#  PCA on single-cell RNAseq data

Some comparisons suggest we might have clusters of cell types but PCA isn't helping us much.

---
class: inverse

# *t*-SNE on scRNASeq data