Tidying data and the tidyverse including the pipe.

class: center, middle, inverse, title-slide

.title[
# Tidying data and the tidyverse including the pipe.
]
.subtitle[
## White Rose BBSRC DTP Training: An Introduction to Reproducible Analyses in R.
]
.author[
### Emma Rand
]
.institute[
### University of York, UK
]

---

# Outline

The aim of this section is to introduce you to the tidyverse (Wickham Averick et al., 2019), the pipe `|>`,  and some commonly applied data tidying operations.

---
# Learning outcomes

The successful student will be able to:

* use the tidyverse
* use pipes to link operations together.  
* carry out some common data tidying tasks such as reshaping, renaming and recoding variables, and cleaning cell contents

---
# Outline

You should be able to code along with the examples. When you see the film clapper it is ..

🎬 .. an instruction to do something!!

🎬 Start a new RStudio Project called `module-04-tidy`

🎬 Add directories called `scripts`, `data-raw`, and `data-processed`.

🎬 Use a new script for each example.

---
class: inverse

#  What is tidy data?

---
# What is tidy data?

Adhere to a consistent structure which makes it easier to manipulate, model and visualize them.

1. Each variable has its own column.  
2. Each observation has its own row.  
3. Each value has its own cell.

Closely allied to the relational algebra (Codd, 1990).  
Underlies the enforced rectangular formatting in SPSS, STATA and R's dataframe.  
The term 'tidy data' was popularised by Wickham (2014).

Note: There may be more than one potential tidy structure.

---
# Tidy format

Suppose we had just 3 individuals in each of two populations:

.pull-left[
**NOT TIDY!**
<table>
 <thead>
  <tr>
   <th style="text-align:right;"> A </th>
   <th style="text-align:right;"> B </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:right;"> 12.4 </td>
   <td style="text-align:right;"> 12.6 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 11.2 </td>
   <td style="text-align:right;"> 11.3 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 11.6 </td>
   <td style="text-align:right;"> 12.1 </td>
  </tr>
</tbody>
</table>
]

.pull-right[

**TIDY!**
<table>
 <thead>
  <tr>
   <th style="text-align:left;"> population </th>
   <th style="text-align:right;"> distance </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> A </td>
   <td style="text-align:right;"> 12.4 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> A </td>
   <td style="text-align:right;"> 11.2 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> A </td>
   <td style="text-align:right;"> 11.6 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> B </td>
   <td style="text-align:right;"> 12.6 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> B </td>
   <td style="text-align:right;"> 11.3 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> B </td>
   <td style="text-align:right;"> 12.1 </td>
  </tr>
</tbody>
</table>
]

---
class: inverse

#  Tidyverse

---
# Tidyverse

The [tidyverse](https://www.tidyverse.org/)  (Wickham Averick et al., 2019) is both a paradigm for coding in R and a metapackage (a collection of packages).

Contributors describe it as "an opinionated" collection of R packages designed for data science.

The key feature is that **`tidyverse`** packages share an underlying design philosophy, grammar, and data structures.

This means they work well together and learning new tidyverse packages is quick. This consistency is intended to make data work more efficient.

---
# Tidyverse

**`tidyverse`** packages have a reputation for making code which is easy to read and write for humans, and for the connection of tools together into reproducible workflows.

The R Views [article by Joe Rickert](https://rviews.rstudio.com/2017/06/08/what-is-the-tidyverse/) gives a good overview.

---
# Tidyverse

The are many other extremely useful packages that are not part of the tidyverse but which also use a common design across packages.

An example is the [Bioconductor Project](https://www.bioconductor.org/).

However, Bioconductor packages that take a tidyverse approach are beginning to accumulate. For example:

* **`tidybulk`** (Mangiola, 2020). 
* **`tidyHeatmap`** (Mangiola and Papenfuss, 2020)
* See also [A Tidy Transcriptomics introduction to RNA sequencing analyses](https://stemangiola.github.io/rpharma2020_tidytranscriptomics/articles/tidytranscriptomics.html)
* **`biobroom`** (Bass Robinson et al., 2020)

---
# Tidyverse

You should already have `tidyverse` installed. Packages need installing only once (unless you wish to update them) but must be loaded every session.

🎬 Load the core tidyverse packages with:

```r
library(tidyverse)
```

--
[Core packages](https://www.tidyverse.org/packages/) are: **`ggplot2`**, **`dplyr`**, **`tidyr`**, **`readr`**, **`purr`**, **`tibble`**, **`stringr`** and **`forcats`**

Actually, **`ggplot2`** predates the tidyverse which is why it uses `+` to link functions together. It does use the same sort of grammar and works very well within the tidyverse.

---
# Tidyverse

The tidyverse also includes many other packages with more specialised usage. They are not loaded automatically with `library(tidyverse)`and need their own `library()` calls.

Examples are `readxl`, `haven`, `rvest` and `lubridate.`

---
class: inverse

#  The pipe |>

---
# The pipe |>

The base pipe `|>` operator was based on the original tidyverse pipe `%>%` from the `magrittr` package .

It is the feature that allows us to connect tools together in a readable way.

It can improve code readability by:

* structuring sequences of data operations left-to-right or top-to-bottom (as opposed to from inside-out),

* minimizing the need for intermediates,

* making it easy to add steps anywhere in the sequence of operations.

---
# The pipe |>

The pipe means that instead of using:

```r
function(object) 
```

We can use

```r
object |> function()
```

This is useful when you have multiple functions to apply.
---
# The pipe |>

Instead of:

```r
function2(function1(object)) 
```

We can use

```r
object |> function1()  |> function2()
```

---
# The pipe |>

As a simple example, suppose we want to apply a log-squareroot transformation<sup>1</sup> to some proportion data.

.font70[
.footnote[1. a transformation commonly applied to proportion data to make it less platykurtic `$x^{t} = log(\sqrt{x})$`
]
]

🎬 Generate a random sample of ten proportions to work with:

```r
nums <- sample((1:100)/100, size = 10, replace = FALSE)
```

---
# The pipe |>

Two ways we *could* apply the log-squareroot transformation are to:

1. nest the squareroot and log functions  
2. create intermediate variables

---
# The pipe |>

🎬 Nest the `sqrt()` and `log()` functions:

```r
tnums <- log(sqrt(nums))
```

Code must read from inside to outside.

Increasingly difficult to read as number of functions increases.

Also makes simple debugging harder.

---
# The pipe |>

🎬 Create intermediate variables:

```r
sqrtnums <- sqrt(nums)
tnums <- log(sqrtnums)
```

Easier to read than nesting.

But you have extra variables you don't need and which become increasingly difficult to name appropriately creating code and workspace clutter.

---
# The pipe |>

Using the pipe avoids these by taking the output of one operation as the input of the next.

The pipe has long been used by Unix operating systems (where the pipe operator is `|`).

The R pipe operator is `|>`

🎬 The keyboard short cut is ctrl-shift-M.

---
# The pipe |>

🎬 Use pipes to code the functions in sequence:

```r
tnums <- nums |> 
  sqrt() |> 
  log()
```

This can be read as: take `nums` *and then* squareroot it *and then* log it.

---
class: inverse

#  Data tidying tasks

---
# Data tidying tasks

Tidying data includes reshaping it in to 'tidy' format but also other tasks such as:

* renaming variables for consistency  
* recoding variables  
* cleaning content for consistency with respect to valid values, missing values and NA

👀 Key point!

* Keep the raw data exactly as it came to you and do not alter/edit.
* Script and document all tidying tasks.

---
class: inverse
# Converting "wide" to "long"