class: center, middle, inverse, title-slide .title[ # Project-oriented workflow ] .subtitle[ ## White Rose BBSRC DTP Training: An Introduction to Reproducible Analyses in R. ] .author[ ### Emma Rand ] .institute[ ### University of York, UK ] --- <style> div.blue { background-color:#b0cdef; border-radius: 5px; padding: 20px;} div.grey { background-color:#d3d3d3; border-radius: 0px; padding: 0px;} </style> # Outline The aim of this section is train you to take a project-oriented approach to analysis. The majority of the ideas apply in any analysis environment (ot just R). --- # Learning outcomes The successful student will be able to: * use a project-oriented workflow * know what is meant by a working directory, absolute and relative paths * list files in their current working directory * use RStudio projects to appropriately organise a piece of work * write code with relative paths given appropriate to your project organisation * write dataframes and figures to file 🎬 An instruction to do something!! --- class: inverse # Organising analyses --- # Organising analyses What does it mean to be be organised? At least: * Use a project based approach, e.g., RStudio project or similar (most IDEs have them) * Have a hierarchical folder structure * Have a consistent and informative naming system that 'plays nice' * Document code with comments and analyses with README More advanced * Generalise with functions and packages * version control * [Pipeline and workflow tools](https://github.com/pditommaso/awesome-pipeline) --- # What is a project? A project is a discrete piece of work which has a number of files associated with it such as the data and scripts for an analysis and the production reports. Using a project-oriented workflow means to have a hierarchical folder structure with everything needed to reproduce an analysis. One research project might have several organisational projects associated with it, for example: * data files and metadata (which may be made into a package) * analysis and reporting * a package developed for the analysis * an app for allowing data to be explored by others --- # Example .pull-left[ * structured * systematic * repeatable **Naming** * human and machine readable * no spaces * use snake/kebab case * ordering: numbers (zero left padded), dates * file extensions ] .pull-right[ .code40[ ``` -- stem_cell_rna_2019 |__stem_cell_rna_2019.Rproj |__raw_ data |__2019-03-21_donor_1.csv |__2019-03-21_donor_2.csv |__2019-03-21_donor_3.csv |__2019-05-14_donor_1.csv |__2019-05-14_donor_2.csv |__2019-05-14_donor_3.csv |__processed_data |__all_long.txt |__figures |__01_volcano_donor_1_vs_donor_2.eps |__02_volcano_donor_1_vs_donor_3.eps |__functions |__normalise.R |__theme_pca.R |__theme_volcano.R |__pics |__01_image.png |__01_image.png |__README.md |__refs |__r_refs.bib |__proj_refs.bib |__analyses |__01_import.R |__02_data_processing.R |__03_exploratory.R |__04_modelling.R |__05_figures.R |__report.Rmd ``` ] ] --- class: inverse # What is a path? --- # What is a path? A path gives the address - or location - of a filesystem object, such as a file or directory. Paths appear in the address bar of your browser or file explorer. -- We need to know a file path whenever we want to read, write or refer to a file using code rather than interactively pointing and clicking to navigate. -- A path can be **absolute** or **relative** --- # Absolute paths An Absolute path is given from the "root directory" of the object. The root directory of a file system is the first or top directory in the hierarchy. For example, `C:\` or `M:\` on windows or `/` on a Mac which is displayed as Macintosh HD in Finder. --- # Absolute paths The absolute path for a file, `pigeon.txt` could be: * windows: `C:/Users/er13/Desktop/pigeons/data-raw/pigeon.txt` <sup>1</sup> .footnote[ .font60[ 1. this appears as `C:\Users\er13\Desktop\pigeons\data-raw\pigeon.txt` in Windows Explorer because Microsoft DOS didn't have directories in 1981 when it was released. At the time it used the `/` character for 'switches' (instead of the existing convention `-` 🙄) so when it did start using directories it couldn't use `/` ] ] -- * Mac/unix systems: `/Users/er13/Desktop/pigeons/data-raw/pigeon.txt` -- * web: `http://www-users.york.ac.uk/~er13/58M_BDS_2019/data/pigeon.txt` --- class: inverse # What is a working directory? --- # What is a directory? Directory is the old word for what many now call a folder 📂. Commands that act on directories in most programming languages and environments reflect this. For example, all of these mean "tell me my working directory": * `getwd()` **get** **w**orking **d**irectory in R * `pwd` **p**rint **w**orking **d**irectory in Unix systems * `os.getcwd()` **get** **c**urrent **w**orking **d**irectory in Python --- # What is a working directory? The working directory is the default location a program is using. It is where the program will read and write files by default. You have only one working directory at a time. The terms 'working directory', 'current working directory' and 'current directory' all mean the same thing. 🎬 Find your current working directory with: .scroll-output-width[ ```r getwd() ``` ``` ## [1] "C:/Users/er13/OneDrive - University of York/Desktop/Desktop/PGR/pgr_reproducibility/slides" ``` ] --- # Relative paths A relative path gives the location of a filesystem object *relative* to the working directory, (i.e., that returned by `getwd()`). When `pigeon.txt` is in the working directory the relative path is just the file name: `pigeon.txt` -- If there is a folder in the working directory called `data-raw` and `pigeon.txt` is in there then the relative path is `data-raw/pigeon.txt` --- # Paths: moving up the hierarchy `../` allows you to look in the directory above the working directory When `pigeon.txt` is in folder above the working the relative path is `../pigeon.txt` -- And if it is in a folder called `data-raw` which is in the directory above the working directory then the relative path is `../data-raw/pigeon.txt` --- # What's in my directory? You can list the contents of a directory using the `dir()` command * `dir()` list the contents of the working directory * `dir("..")` list the contents of the directory above the working directory * `dir("../..")` list the contents of the directory two directories above the working directory * `dir("data-raw")` list the contents of a folder call data-raw which is in the working directory. --- # Relative or absolute Most of the time you should use relative paths because that makes your work portable. 🥳 The tab key is your friend! -- You only need to use absolute paths when you are referring to filesystem outside the one you are using. --- class: inverse # Example --- # Example 🎬 Download and unzip [pigeons.zip](../pigeons.zip) which has the following structure: ``` -- pigeons |__data-processed |__pigeon_long.txt |__data-raw |__pigeon.txt |__figures |__fig1.tiff |__scripts |__analysis.R |__import_reshape.R |__pigeons.Rproj ``` --- --- class: inverse # RStudio Projects --- # RStudio Projects Project is obviously a commonly used word. When I am referring to an [RStudio Project](https://support.posit.co/hc/en-us/articles/200526207-Using-Projects) I will use the capitalised words 'RStudio Project' or 'Project'. In other cases, I will use 'project'. An RStudio Project is a directory with an `.Rproj` file in it. The name of the RStudio Project is the same as the name of the top level directory which is referred to as the Project directory. --- # RStudio Projects For example, if you create an RStudio Project `stem_cell_rna` your folder structure would look something like this: .pull-left[ .code50[ ``` -- stem_cell_rna |__stem_cell_rna.Rproj |__raw_ data |__2019-03-21_donor_1.csv |__2019-03-21_donor_2.csv |__2019-05-14_donor_1.csv |__2019-05-14_donor_2.csv |__figures |__01_volcano_donor_1_vs_donor_2.eps |__functions |__normalise.R |__theme_pca.R |__theme_volcano.R |__README.md |__refs |__r_refs.bib |__proj_refs.bib |__analyses |__01_import.R |__02_data_processing.R |__03_exploratory.R |__04_modelling.R |__05_figures.R |__report.Rmd ``` ] ] --- # RStudio Projects .pull-left[ .code50[ ``` *-- stem_cell_rna |__stem_cell_rna.Rproj |__raw_ data |__2019-03-21_donor_1.csv |__2019-03-21_donor_2.csv |__2019-05-14_donor_1.csv |__2019-05-14_donor_2.csv |__figures |__01_volcano_donor_1_vs_donor_2.eps |__functions |__normalise.R |__theme_pca.R |__theme_volcano.R |__README.md |__refs |__r_refs.bib |__proj_refs.bib |__analyses |__01_import.R |__02_data_processing.R |__03_exploratory.R |__04_modelling.R |__05_figures.R |__report.Rmd ``` ] ] .pull-right[ the Project directory ] --- # RStudio Projects .pull-left[ .code50[ ``` -- stem_cell_rna * |__stem_cell_rna.Rproj |__raw_ data |__2019-03-21_donor_1.csv |__2019-03-21_donor_2.csv |__2019-05-14_donor_1.csv |__2019-05-14_donor_2.csv |__figures |__01_volcano_donor_1_vs_donor_2.eps |__functions |__normalise.R |__theme_pca.R |__theme_volcano.R |__README.md |__refs |__r_refs.bib |__proj_refs.bib |__analyses |__01_import.R |__02_data_processing.R |__03_exploratory.R |__04_modelling.R |__05_figures.R |__report.Rmd ``` ] ] .pull-right[ the `.RProj` file which is the defining feature of an RStudio Project ] --- # RStudio Projects When you open an RStudio Project, the working directory is set to the Project directory (i.e., the location of the `.Rproj` file). This makes your work portable. You can zip up the project folder and send it to any person, including future you, or any computer. They will be able to unzip, open the project and have all the code just work. 🎂 --- class: inverse # Directory structure --- # Directory structure You are aiming for structured , systematic and repeatable. For example, the Project directory might contain: * .RProj file * README - tell people what the project is and how to use it * License - tell people what they are allowed to do with your project * Directories * data-raw/ * images/ * scripts/ * functions/ * figures/ --- # README READMEs are a form of documentation which have been widely used for a long time. They contain all the information about the other files in a directory. They can be extensive. * Wikipedia [README page](https://en.wikipedia.org/wiki/README) * GitHub Doc's [About READMEs ](https://docs.github.com/en/free-pro-team@latest/github/creating-cloning-and-archiving-repositories/about-readmes) --- # README A minimal README might give: * Title * Description, 50 words or so on what the project is * Technical Description of the project * What software and packages are needed including versions * Any instructions needed to run the analysis/use the software * Any issues that a user might face in running the analysis/using the software * Instructions on how to use the work --- # License A license tells others what they can and can't do with your work. [choosealicense.com](https://choosealicense.com/) is a useful explainer. I typically use: * [MIT License](https://choosealicense.com/licenses/mit/) for software * [CC-BY-SA-4.0](https://choosealicense.com/licenses/cc-by-sa-4.0/) for other work --- # Reading Suggested reading [Chapter 2 Project-oriented workflow](https://whattheyforgot.org/project-oriented-workflow.html) of What they forgot to teach you about R (Bryan and Hester, b) --- class: inverse # Your turn! --- # Your turn! You are going to create an RStudio Project with some directories and use it to organise a very simple analysis. The analysis will import a data file, reformat it and write the new format to file. It will then create a figure and write the image to file. You'll get practice with tidying data and plotting data. --- # RStudio Project infrastructure 🎬 create a new Project called `seals` by clicking on the drop-down menu on top right where it says **Project: (None)** and choosing New Project, then New Directory, then New Project. Name the RStudio Project `seals`. -- 🎬 Create folders in `seals` called `data-raw`, `data-processed` and `figures`. -- 🎬 Start new scripts called `01-import.R`, `02-tidy.R`, and `03-figures.R` --- # Save and Import 🎬 Save a copy of [seal.txt](data/seal.txt ) to your `data-raw` folder. These data give the muscle myoglobin concentration for three species of seal. -- 🎬 In your `01-import.R` script, load the tidyverse set of packages. ```r library(tidyverse) ``` --- # Save and Import 🎬 Add the command to import the data: ```r seal <- read_table("data-raw/seal.txt") ``` -- The relative path is `data-raw/seal.txt` because your working directory is the Project directory, `seals`. --- # Reformat the data This dataset has three observations in a row - it is not 'tidy'. 🎬 Open your `02-tidy.R` script, and reshape the data using: ```r seal <- pivot_longer(data = seal, cols = everything(), names_to = "species", values_to = "myoglobin") ``` -- This reformats the dataframe in R but does not overwrite the text file of the data. --- class: inverse # Writing files --- # Writing files Often we want to write to files. My main reasons for doing so are to save copies of data that have been processed and to save manuscripts and graphics. -- 🎬 Write your dataframe `seal` to a csv file named `seal-long.csv` in your `data-processed` folder: ```r file <- "data-processed/seal-long.csv" write_csv(seal, file) ``` -- Putting file paths into variables often makes your code easier to read especially when file paths are long or used multiple times. --- # Create a plot 🎬 Open your `03-figures.R` script and create a simple plot of this data with: ```r fig1 <- ggplot(data = seal, aes(x = species, y = myoglobin)) + geom_boxplot() + scale_x_discrete(name = "Species") + scale_y_continuous(name = "Myoglobin", expand = c(0, 0), limits = c(0, 80)) + theme_classic() ``` --- # View plot 🎬 View plot with: ```r fig1 ``` <img src="03_rstudio_projects_files/figure-html/unnamed-chunk-7-1.png" width="288" /> --- # Write ggplot figure to file A useful function for saving ggplot figures is `ggsave()`. It has arguments for the size, resolution and device for the image. See the [`ggsave()` reference page](https://ggplot2.tidyverse.org/reference/ggsave.html). -- Since I often make more than one figure, I might set these arguments first. --- # Write ggplot figure to file 🎬 Assign `ggsave` argument values to variables: ```r # figure saving settings units <- "in" fig_w <- 3.2 fig_h <- fig_w dpi <- 600 device <- "tiff" ``` .footnote[ "tiff" is a format often required by journals; you may want png or jpg.] --- # Write ggplot figure to file 🎬 Save the figure to your figures directory: ```r ggsave("figures/fig1.tiff", plot = fig1, device = device, width = fig_w, height = fig_h, units = units, dpi = dpi) ``` 🎬 Check it is there! --- # Summary * Use a project based approach * Have a hierarchical folder structure * Make your code portable by using relative paths * Have a consistent and informative naming system that 'plays nice' * Make your analysis easy to understand by breaking it down into different scripts and by using variables for repeatedly used or long values. * Document code with comments and analyses with README --- # Reading **Strongly recommended** * Chapter 2 Project-oriented workflow | What They Forgot to Teach You About R (Bryan and Hester, b) **Further** * Ten simple rules for reproducible computational research (Sandve Nekrutenko et al., 2013) * Good enough practices in scientific computing (Wilson Bryan et al., 2017) * Excuse Me, Do You Have a Moment to Talk About Version Control? (Bryan, 2018) --- # References .footnote[ .font60[ Slides made with with xaringan (Xie, 2019) and xaringanExtra (Aden-Buie, 2020) ] ] .font60[ Aden-Buie, G. (2020). _xaringanExtra: Extras And Extensions for Xaringan Slides_. R package version 0.2.3.9000. URL: [https://github.com/gadenbuie/xaringanExtra](https://github.com/gadenbuie/xaringanExtra). Bryan, J. (2018). "Excuse Me, Do You Have a Moment to Talk About Version Control?" In: _Am. Stat._ 72.1, pp. 20-27. Bryan, J. and J. Hester _Chapter 2 Project-oriented workflow | What They Forgot to Teach You About R_. <https://whattheyforgot.org/project-oriented-workflow.html>. Accessed: 2019-9-26. Sandve, G. K., A. Nekrutenko, et al. (2013). "Ten simple rules for reproducible computational research". En. In: _PLoS Comput. Biol._ 9.10, p. e1003285. Wilson, G., J. Bryan, et al. (2017). "Good enough practices in scientific computing". En. In: _PLoS Comput. Biol._ 13.6, p. e1005510. Xie, Y. (2019). _xaringan: Presentation Ninja_. R package version 0.12. URL: [https://CRAN.R-project.org/package=xaringan](https://CRAN.R-project.org/package=xaringan). ] --- # Intro to Repro in R Emma Rand [emma.rand@york.ac.uk](mailto:emma.rand@york.ac.uk) Twitter: [@er13_r](https://twitter.com/er13_r) GitHub: [3mmaRand](https://github.com/3mmaRand) blog: https://buzzrbeeline.blog/ <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a><br /><span xmlns:dct="http://purl.org/dc/terms/" property="dct:title">Licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License</a>. Rand, E. (2023). White Rose BBSRC DTP Training: An Introduction to Reproducible Analyses in R (Version v1.2). https://doi.org/10.5281/zenodo.3859818