class: center, middle, inverse, title-slide .title[ # Tidying data and the tidyverse including the pipe. ] .subtitle[ ## White Rose BBSRC DTP Training: An Introduction to Reproducible Analyses in R. ] .author[ ### Emma Rand ] .institute[ ### University of York, UK ] --- <style> div.blue { background-color:#b0cdef; border-radius: 5px; padding: 20px;} div.grey { background-color:#d3d3d3; border-radius: 0px; padding: 0px;} </style> # Outline The aim of this section is to introduce you to the tidyverse (Wickham Averick et al., 2019), the pipe `|>`, and some commonly applied data tidying operations. --- # Learning outcomes The successful student will be able to: * use the tidyverse * use pipes to link operations together. * carry out some common data tidying tasks such as reshaping, renaming and recoding variables, and cleaning cell contents --- # Outline You should be able to code along with the examples. When you see the film clapper it is .. π¬ .. an instruction to do something!! -- π¬ Start a new RStudio Project called `module-04-tidy` -- π¬ Add directories called `scripts`, `data-raw`, and `data-processed`. -- π¬ Use a new script for each example. --- class: inverse # What is tidy data? --- # What is tidy data? Adhere to a consistent structure which makes it easier to manipulate, model and visualize them. 1. Each variable has its own column. 2. Each observation has its own row. 3. Each value has its own cell. -- Closely allied to the relational algebra (Codd, 1990). Underlies the enforced rectangular formatting in SPSS, STATA and R's dataframe. The term 'tidy data' was popularised by Wickham (2014). Note: There may be more than one potential tidy structure. --- # Tidy format Suppose we had just 3 individuals in each of two populations: .pull-left[ **NOT TIDY!** <table> <thead> <tr> <th style="text-align:right;"> A </th> <th style="text-align:right;"> B </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 12.4 </td> <td style="text-align:right;"> 12.6 </td> </tr> <tr> <td style="text-align:right;"> 11.2 </td> <td style="text-align:right;"> 11.3 </td> </tr> <tr> <td style="text-align:right;"> 11.6 </td> <td style="text-align:right;"> 12.1 </td> </tr> </tbody> </table> ] .pull-right[ **TIDY!** <table> <thead> <tr> <th style="text-align:left;"> population </th> <th style="text-align:right;"> distance </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> A </td> <td style="text-align:right;"> 12.4 </td> </tr> <tr> <td style="text-align:left;"> A </td> <td style="text-align:right;"> 11.2 </td> </tr> <tr> <td style="text-align:left;"> A </td> <td style="text-align:right;"> 11.6 </td> </tr> <tr> <td style="text-align:left;"> B </td> <td style="text-align:right;"> 12.6 </td> </tr> <tr> <td style="text-align:left;"> B </td> <td style="text-align:right;"> 11.3 </td> </tr> <tr> <td style="text-align:left;"> B </td> <td style="text-align:right;"> 12.1 </td> </tr> </tbody> </table> ] --- class: inverse # Tidyverse --- # Tidyverse The [tidyverse](https://www.tidyverse.org/) (Wickham Averick et al., 2019) is both a paradigm for coding in R and a metapackage (a collection of packages). Contributors describe it as "an opinionated" collection of R packages designed for data science. -- The key feature is that **`tidyverse`** packages share an underlying design philosophy, grammar, and data structures. This means they work well together and learning new tidyverse packages is quick. This consistency is intended to make data work more efficient. --- # Tidyverse **`tidyverse`** packages have a reputation for making code which is easy to read and write for humans, and for the connection of tools together into reproducible workflows. The R Views [article by Joe Rickert](https://rviews.rstudio.com/2017/06/08/what-is-the-tidyverse/) gives a good overview. --- # Tidyverse The are many other extremely useful packages that are not part of the tidyverse but which also use a common design across packages. An example is the [Bioconductor Project](https://www.bioconductor.org/). -- However, Bioconductor packages that take a tidyverse approach are beginning to accumulate. For example: * **`tidybulk`** (Mangiola, 2020). * **`tidyHeatmap`** (Mangiola and Papenfuss, 2020) * See also [A Tidy Transcriptomics introduction to RNA sequencing analyses](https://stemangiola.github.io/rpharma2020_tidytranscriptomics/articles/tidytranscriptomics.html) * **`biobroom`** (Bass Robinson et al., 2020) --- # Tidyverse You should already have `tidyverse` installed. Packages need installing only once (unless you wish to update them) but must be loaded every session. π¬ Load the core tidyverse packages with: ```r library(tidyverse) ``` -- [Core packages](https://www.tidyverse.org/packages/) are: **`ggplot2`**, **`dplyr`**, **`tidyr`**, **`readr`**, **`purr`**, **`tibble`**, **`stringr`** and **`forcats`** -- Actually, **`ggplot2`** predates the tidyverse which is why it uses `+` to link functions together. It does use the same sort of grammar and works very well within the tidyverse. --- # Tidyverse The tidyverse also includes many other packages with more specialised usage. They are not loaded automatically with `library(tidyverse)`and need their own `library()` calls. -- Examples are `readxl`, `haven`, `rvest` and `lubridate.` --- class: inverse # The pipe |> --- # The pipe |> The base pipe `|>` operator was based on the original tidyverse pipe `%>%` from the `magrittr` package . It is the feature that allows us to connect tools together in a readable way. It can improve code readability by: -- * structuring sequences of data operations left-to-right or top-to-bottom (as opposed to from inside-out), -- * minimizing the need for intermediates, -- * making it easy to add steps anywhere in the sequence of operations. --- # The pipe |> The pipe means that instead of using: ```r function(object) ``` We can use ```r object |> function() ``` -- This is useful when you have multiple functions to apply. --- # The pipe |> Instead of: ```r function2(function1(object)) ``` We can use ```r object |> function1() |> function2() ``` --- # The pipe |> As a simple example, suppose we want to apply a log-squareroot transformation<sup>1</sup> to some proportion data. .font70[ .footnote[1. a transformation commonly applied to proportion data to make it less platykurtic `\(x^{t} = log(\sqrt{x})\)` ] ] -- π¬ Generate a random sample of ten proportions to work with: ```r nums <- sample((1:100)/100, size = 10, replace = FALSE) ``` --- # The pipe |> Two ways we *could* apply the log-squareroot transformation are to: 1. nest the squareroot and log functions 2. create intermediate variables --- # The pipe |> π¬ Nest the `sqrt()` and `log()` functions: ```r tnums <- log(sqrt(nums)) ``` -- Code must read from inside to outside. Increasingly difficult to read as number of functions increases. Also makes simple debugging harder. --- # The pipe |> π¬ Create intermediate variables: ```r sqrtnums <- sqrt(nums) tnums <- log(sqrtnums) ``` -- Easier to read than nesting. But you have extra variables you don't need and which become increasingly difficult to name appropriately creating code and workspace clutter. --- # The pipe |> Using the pipe avoids these by taking the output of one operation as the input of the next. The pipe has long been used by Unix operating systems (where the pipe operator is `|`). The R pipe operator is `|>` -- π¬ The keyboard short cut is ctrl-shift-M. --- # The pipe |> π¬ Use pipes to code the functions in sequence: ```r tnums <- nums |> sqrt() |> log() ``` -- This can be read as: take `nums` *and then* squareroot it *and then* log it. --- class: inverse # Data tidying tasks --- # Data tidying tasks Tidying data includes reshaping it in to 'tidy' format but also other tasks such as: * renaming variables for consistency * recoding variables * cleaning content for consistency with respect to valid values, missing values and NA -- π Key point! * Keep the raw data exactly as it came to you and do not alter/edit. * Script and document all tidying tasks. --- class: inverse # Converting "wide" to "long" --- # Converting "wide" to "long" Data commonly need to be reshaped from a format with more than one observation per row. -- The data given in [biomass.txt](data-raw/biomass.txt) are taken from an experiment in which the insect pest biomass (g) was measured on plots sprayed with water (control) or one of five different insecticides. Also in the data file are variables indicating the replicate number and the identity of the tray in which the plant was grown. --- # Converting "wide" to "long" π¬ Save a copy of this file to your `data-raw` folder and read it in. ```r file <- "data-raw/biomass.txt" biomass <- read_table(file) ``` `read_table()` is a function from the tidyverse's **`readr`** package. It works a lot like base R's `read.table()` but with some useful defaults (e.g., header assumed) and extra functionality (e.g.,it's faster and makes better column type decisions). --- # Converting "wide" to "long" π¬ View the dataframe. .scroll-output-height[ .code70[ ``` ## # A tibble: 10 Γ 7 ## WaterControl A B C D E rep_tray ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> ## 1 350. 159. 150. 80.0 267. 350. 1_x ## 2 324. 146. 154. 266. 110. 320. 2_x ## 3 359. 116. 69.5 161. 221. 359. 3_x ## 4 255. 135. 151. 161. 160. 255. 4_y ## 5 208. 137. 213. 51.2 198. 208. 5_y ## 6 326. 81.8 144. 184. 270. 326. 6_y ## 7 295. 116. 150. 176. 224. 295. 7_z ## 8 285. 114. 134. 136. 141. 285. 8_z ## 9 383. 155. 153. 159. 290. 383. 9_z ## 10 279. 144. 198. 126. 189. 279. 10_z ``` ] ] A tibble is the tidyverse's dataframe. In fact, tibbles are dataframes as well as tibbles. --- # Converting "wide" to "long" These data are in "wide" format and can be converting to "long" format using the `tidyr` package function `pivot_longer()`. -- `pivot_longer()` collects the values from specified columns (`cols`) into a single column (`values_to`) and creates a column to indicate the group (`names_to`). -- We want to gather the first 6 columns but the `rep_tray` column contents should be repeated. --- # Converting "wide" to "long" π¬ Gather all the columns of biomass except `rep_tray`: ```r biomass2 <- biomass |> pivot_longer(names_to = "spray", values_to = "mass", cols = -rep_tray) ``` The values will be in a column called `mass`, the treatments in a column called `spray`. --- # Converting "wide" to "long" π¬ View the dataframe. .scroll-output-height[ .code70[ ``` ## # A tibble: 60 Γ 3 ## rep_tray spray mass ## <chr> <chr> <dbl> ## 1 1_x WaterControl 350. ## 2 1_x A 159. ## 3 1_x B 150. ## 4 1_x C 80.0 ## 5 1_x D 267. ## 6 1_x E 350. ## 7 2_x WaterControl 324. ## 8 2_x A 146. ## 9 2_x B 154. ## 10 2_x C 266. ## # βΉ 50 more rows ``` ] ] --- # Converting "wide" to "long" <span style=" font-weight: bold; color: #fdf9f6 !important;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #25496b !important;" >Extra exercise:</span> Write this dataframe to file to your `data-processed` folder. See the result: [biomass2-tidyverse.txt](data-processed/biomass2-tidyverse.txt) --- class: inverse # Splitting column contents --- # Splitting column contents We sometimes have single columns which contain more than one type of encoded information. -- For example, a column might contain a full species name and you might want to separate the genus and species names to perform a by-genus analysis. -- Another example arises when you read in multiple files with names that encode a date, experiment or treatment. Typically, we add the file name to a column in the dataframe before combining the dataframes for analysis and then split the name into columns for the date, experiment of treatment. --- # Splitting column contents For the `biomass2` data we might wish to separate the replicate number from the tray identity and put them in two separate columns. -- We can do this with a 'regular expression' or [regex](https://en.wikipedia.org/wiki/Regular_expression). A regex defines a pattern for matching text. -- It's a big topic and there are many tutorials. I remember a few bits and google "how to match ... regex". [A quick reference](https://www.rexegg.com/regex-quickstart.html) --- # Splitting column contents The `extract()` function from the `tidyr` helps us achieve this. We give: * the names of the new columns we want to create * the patterns matching the part of the `rep_tray` value we want to go in each column. --- # Splitting column contents π¬ Extract parts of `rep_tray` and put in new columns `replicate_number`, `tray_id`: ```r biomass3 <- biomass2 |> extract(rep_tray, c("replicate_number", "tray_id"), "([0-9]{1,2})\\_([a-z])") ``` -- .font80[ * The patterns to save into columns are inside `( )`. * The pattern going into `replicate_number`, `[0-9]{1,2}`, means 1 or 2 numbers. * The pattern going into `tray_id`, `[a-z]` means one lowercase letter. * the bit between the two `( )`, `\_` is a pattern that matches what is in `rep_tray` but is not to be saved. ] --- class: inverse # Case study --- # Overivew You will now work through an example of some real data from [The Genever Group](https://www.geneverlab.info/). The arrangement and format of these data are typical of many protein and gene expression datasets so the processing is representative of that needed in a variety of situations. -- The data are mass spectrometry data of the soluble protein fraction from five immortalised mesenchymal stromal cell (MSC) lines. The data are normalised protein abundances. Each row is a protein. --- # Data description π¬ Save a copy of [Y101_Y102_Y201_Y202_Y101-5.csv](data-raw/Y101_Y102_Y201_Y202_Y101-5.csv) file to your `data-raw` folder. π¬ You may wish to view it in excel while reading the information on the next page. --- # Data description The cells lines are Y101, Y102, Y201, Y202 and Y101.5 and there are three replicates for each cell line arranged in columns. Also in the file are columns for: .font70[ * the protein accession * the number of peptides used to identify the protein * the number of unique peptides used to identify the protein * a measure of confidence in that identification * the maximum fold change between the mean abundances of two cell lines (i.e., highest mean / lowest mean) * a p value for a comparison test between the highest mean and lowest mean * a q value (corrected p value) * a measure of the power of that test * the cell line with the highest mean * the cell line with the lowest mean * the protein mass * whether at least two peptides were detected for a protein. ] --- # Data Import Column names are spread over three rows but are primarily in the third row. -- We can read in from the third row by skipping the first two. We can also use the `clean()` function from the `janitor` package to improve the column names. --- # Data Import π¬ Read in the file using `read_csv()` from the `tidyverse`'s `readr` package. ```r # define file name filesol <- "data-raw/Y101_Y102_Y201_Y202_Y101-5.csv" # skip first two lines sol <- read_csv(filesol, skip = 2) |> janitor::clean_names() ``` --- # Data Import π the `::` notation gives you access to a package's functions without first using the `library()` command. This is useful when you want to use a single function from a package, or you need to specify which package when a function name is used in two loaded packages. --- # Filtering rows This dataset includes bovine serum proteins from the medium on which the cells were grown which need to be filtered out. -- We also filter out proteins for which fewer than 2 peptides were detected since we can not be confident about their identity. This is common practice for such proteomic data. --- # Filtering rows **`dplyr`** (yep, a tidyverse package) has the `filter()` function. It is easier to use than selection : `dataframe[rows, cols]` π¬ Keep rows of human proteins identified by more than one peptide: ```r sol <- sol |> filter(str_detect(description, "OS=Homo sapiens")) |> filter(x1pep == "x") ``` --- # Case study π `str_detect(string, pattern)` returns a logical vector according to whether 'pattern' is found in 'string'. π Notice that we have applied `filter()` twice using the pipe. --- # Processing cells contents It would be useful to extract the genename from the description and put it in a column. One entry from the description column looks like this: .scroll-output-width[ .code60[ ```r sol$description[1] ``` ``` ## [1] "Neuroblast differentiation-associated protein AHNAK OS=Homo sapiens GN=AHNAK PE=1 SV=2" ``` ] ] -- The genename is after `GN=`. We need to extract the part of the string with the genename and put it in a new column. --- # Processing cells contents A way to problem-solve your way through this is work with *one* value carrying out one operation at a time until you've worked out what to do before implementing on an entire column. The pipe makes it especially easy to break problems down like this. --- # Processing cells contents ### One step at a time on one value π¬ Extract the first value of the description to work with: ```r one_description <- sol$description[1] ``` -- π¬ Extract the part of the string after `GN=` using a regex: ```r str_extract(one_description,"GN=[^\\s]+") ``` ``` ## [1] "GN=AHNAK" ``` Explanation on next slide. --- # Processing cells contents ### One step at a time on one value * `[ ]` means some characters * `^` means 'not' when inside `[ ]` * `\s` means white space * the `\` before is an escape character to indicate that the next character, `\` should not be taken literally (because it's part of `\s`) * `+` means one or more -- So `GN=[^\\s]+` means `GN=` followed by one or more characters that are not whitespace. This means the pattern *stops* matching at the first white space after "GN=". --- # Processing cells contents ### One step at a time on one value We're close. Now we will drop the `GN=` part by replacing it with nothing: π¬ Add replacing `GN=` with an empty string, `""`, to the pipeline: ```r str_extract(one_description, "GN=[^\\s]+") |> str_replace("GN=", "") ``` ``` ## [1] "AHNAK" ``` -- .font140[ π ] --- # Processing cells contents ### Creating a new column Now we know how to get the result for one value, we need to apply the same process to the whole column -- `mutate()` is the **`dplyr`** function that adds new variables and preserves existing ones. It takes `name` = `value` pairs of expressions where: * `name` is the name for the new variable and * `value` is the value it takes. This is usually an expression. --- # Processing cells contents ### Creating a new column π¬ Add a variable `genename` which contains the processed string from the `description` variable: ```r sol <- sol |> mutate(genename = str_extract(description,"GN=[^\\s]+") |> str_replace("GN=", "")) ``` --- class: inverse # Extra exercises to try --- # Extra exercises to try <span style=" font-weight: bold; color: #fdf9f6 !important;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #25496b !important;" >Extra exercise:</span> Add a column, `protid`, for the top protein identifier. This is the first Uniprot ID after the "1::" in the `accession` column. <span style=" font-weight: bold; color: #fdf9f6 !important;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #25496b !important;" >Extra exercise:</span> Create a second dataframe, `sol2` in which the protein abundances are in a single column, `abundance` and the cell lineage and replicate, `lineage_rep`, is indicated in another. All the other variables should also be in the new data frame. You might find [`starts_with()`](https://www.rdocumentation.org/packages/tidyselect/versions/1.0.0/topics/select_helpers) useful. <span style=" font-weight: bold; color: #fdf9f6 !important;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #25496b !important;" >Extra exercise:</span> Create separate columns in `sol2` for the cell lineage and the replicate. --- # Answers <span style=" font-weight: bold; color: #fdf9f6 !important;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #25496b !important;" >Extra exercise:</span> Add a column, `protid`, for the top protein identifier. This is the first Uniprot ID after the "1::" in the `accession` column. ```r # trying it out on one value accession <- sol$accession[2] str_extract(accession, "1::[^;]+") |> str_replace("1::", "") ``` ``` ## [1] "Q15149" ``` ```r # adding a new column sol <- sol |> mutate(protid = str_extract(accession, "1::[^;]+") |> str_replace("1::", "")) ``` --- # Answers <span style=" font-weight: bold; color: #fdf9f6 !important;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #25496b !important;" >Extra exercise:</span> Create a second dataframe, `sol2` in which the protein abundances are in a single column, `abundance` and the cell lineage and replicate, `lineage_rep`, is indicated in another. All the other variables should also be in the new data frame. ```r sol2 <- sol |> pivot_longer(names_to = "lineage_rep", values_to = "abundance", cols = starts_with("y")) ``` --- # Answers <span style=" font-weight: bold; color: #fdf9f6 !important;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #25496b !important;" >Extra exercise:</span> Create separate columns in `sol2` for the cell lineage and the replicate. ```r sol2 <- sol2 |> extract(lineage_rep, c("line", "rep"), "(y[0-9]{3,4})\\_([a-c])") ``` --- # Summary .font80[ * `tidyverse` is a collection of packages with a common design; many excellent packages are not part of it * `library(tidyverse)` loads the core packages; other `tidyverse` packages need their own `library()` call * the pipe `|>` is key to connecting `tidyverse` tools together to create highly readable code * reshaping data from wide to long format is common: `pivot_longer()` * splitting cell contents is common: `extract()` * regular expressions allow you to match text patterns; expect to have to google and do a lot of trial and error * `clean_names()` is well useful! * `::` gives you access to a package's functions without using `library()` * use particular rows `filter()` ] --- # Reading * [article by Joe Rickert](https://rviews.rstudio.com/2017/06/08/what-is-the-tidyverse/) gives a good overview. * Tidy Data (Wickham, 2014) sections 1 and 2 * Welcome to the Tidyverse * Tidy Data (Wickham, 2014) sections 3 - 6 * R for Data Science (Wickham Γetinkaya-Rundeland et al., 2023) The second edition is really great. --- # References .footnote[ .font60[ Slides made with with xaringan (Xie, 2024) and xaringanExtra (Aden-Buie, 2020) ] ] .font60[ Aden-Buie, G. (2020). _xaringanExtra: Extras And Extensions for Xaringan Slides_. R package version 0.2.3.9000. URL: [https://github.com/gadenbuie/xaringanExtra](https://github.com/gadenbuie/xaringanExtra). Bass, A. J., D. G. Robinson, et al. (2020). _biobroom: Turn Bioconductor objects into tidy data frames_. R package version 1.20.0. URL: [https://github.com/StoreyLab/biobroom](https://github.com/StoreyLab/biobroom). Codd, E. F. (1990). _The Relational Model for Database Management: Version 2_. Boston, MA, USA: Addison-Wesley Longman Publishing Co., Inc. Mangiola, S. (2020). _tidybulk: Brings transcriptomics to the tidyverse_. R package version 1.0.2. URL: [https://bioconductor.org/packages/release/bioc/html/tidybulk.html](https://bioconductor.org/packages/release/bioc/html/tidybulk.html). Mangiola, S. and A. Papenfuss (2020). "tidyHeatmap: an R package for modular heatmap production based on tidy principles". In: _Journal of Open Source Software_ 5.52, p. 2472. DOI: [10.21105/joss.02472](https://doi.org/10.21105%2Fjoss.02472). URL: [https://doi.org/10.21105/joss.02472](https://doi.org/10.21105/joss.02472). Wickham, H. (2014). "Tidy Data". In: _Journal of Statistical Software, Articles_ 59.10, pp. 1-23. Wickham, H., M. Averick, et al. (2019). "Welcome to the tidyverse". In: _Journal of Open Source Software_ 4.43, p. 1686. DOI: [10.21105/joss.01686](https://doi.org/10.21105%2Fjoss.01686). ] --- # References .font60[ Wickham, H., M. Γetinkaya-Rundeland, et al. (2023). _R for Data Science (2e)_. 2nd. O'Reilly Media, Inc. URL: [https://r4ds.hadley.nz/](https://r4ds.hadley.nz/). Xie, Y. (2024). _xaringan: Presentation Ninja_. R package version 0.29. URL: [https://CRAN.R-project.org/package=xaringan](https://CRAN.R-project.org/package=xaringan). ] --- # Intro to Repro in R Emma Rand [emma.rand@york.ac.uk](mailto:emma.rand@york.ac.uk) Twitter: [@er13_r](https://twitter.com/er13_r) GitHub: [3mmaRand](https://github.com/3mmaRand) blog: https://buzzrbeeline.blog/ <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a><br /><span xmlns:dct="http://purl.org/dc/terms/" property="dct:title">Licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License</a>. Rand, E. (2023). White Rose BBSRC DTP Training: An Introduction to Reproducible Analyses in R (Version v1.2). https://doi.org/10.5281/zenodo.3859818