class: center, middle, inverse, title-slide # Introduction and Principles of reproducibility. ## White Rose BBSRC DTP Training: An Introduction to Reproducible Analyses in R. ### Emma Rand ### University of York, UK --- <style> div.blue { background-color:#b0cdef; border-radius: 5px; padding: 20px;} div.grey { background-color:#d3d3d3; border-radius: 0px; padding: 0px;} </style> --- class: inverse # Programme Overview --- # What this training *is* and *is not* Chosen topics are: foundational, widely applicable, and transferable conceptually. .pull-left[ .font90[ **It is** * An introduction to R for those without previous experience * About using RStudio projects and good practice for code and project documentation and organisation * An introduction to the tidyverse, RMarkdown, some more advanced data import ] ] -- .pull-right[ .font90[ **It is not** * An introduction to statistics * Magic ] ] --- # Programme overview Modules, tutor-led or supported study. The selection of modules you undertake will depend on your previous experience. .font60[ [1. Introduction and Principles of reproducibility](01_intro_and_principles_of_repro.html) Audience: Everyone * Rationale for scripting * Why R? * Organisation of data * Organisation of analyses [2. A. Introduction to R and working with data](02_intro_to_r_and_working_with_data.html) Audience: Those without previous experience of R * Finding your way round RStudio * Typing in data, doing some calculations on it, plotting it * Understanding the manual * Importing data: working directories and paths * Summarising and visualising with the [`tidyverse`](https://www.tidyverse.org/) * Installing and loading packages ] --- # Have experience? .font60[ [2. B. Tidying data and the tidyverse including the pipe](04_tidying_data_and_the_tidyverse.html) Audience: For those with previous experience of R but little of 'tidy data' and the tidyverse * Using the tidyverse including the pipe to link operations together. * Carrying out some common data tidying tasks such as reshaping, renaming and recoding variable and cleaning cell contents [2. C. Advanced data import](05_advanced_data_import.html) Audience: For those with previous experience of R and the tidyverse including the pipe." * Understanding what matters in data import * Importing plain text and proprietary data formats stored locally and on the web * Carrying out some simple web scraping * Packages available for importing publicly accessible data from APIs ] --- # Programme overview .font60[ [3. RStudio Projects](03_rstudio_projects.html) Audience: Those without experience of RStudio projects * Organising your work in a logical, consistent and reproducible way using RStudio Projects * Writing code with relative paths given appropriate to your project organisation * Writing dataframes and figures to file [4. R Markdown for Reproducible Reports](06_r_markdown_for_reproducible_reports.html) Audience: For those with previous experience of R such as having done "Introduction to R and working with data." * Making more advanced figures * Creating reproducible reports in a variety of output formats. ] --- class: inverse # What is your previous experience? <br>Survey Results --- # Survey results 1/3 The distribution of ratings you (n = 34) gave in the survey were: <img src="01_intro_and_principles_of_repro_files/figure-html/unnamed-chunk-2-1.png" width="864" /> --- # Survey results 2/3 <img src="01_intro_and_principles_of_repro_files/figure-html/unnamed-chunk-3-1.png" width="864" /> --- # Survey results 3/3 Please rate your level of comfort with... <img src="01_intro_and_principles_of_repro_files/figure-html/unnamed-chunk-4-1.png" width="864" /> --- class: inverse # Data analysis --- # Data analysis How much of data analysis is using statistics? Less than you probably think ~80% of your time on getting data, cleaning data, aggregating data, reshaping data, and exploring data using exploratory data analysis and data visualization. --- # Reproducibility is key! One definition *"... obtaining consistent results using the same input data; computational steps, methods, and code; and conditions of analysis. This definition is synonymous with “computational reproducibility"*. (National Academies of Sciences, Engineering Medicine, et al., 2019) Also see - The Turing Way "Definitions for Reproducibility" - National Science Foundation (Bollen Cacioppo, et al., 2015) --- # Reproducibility is key! <img src="../pics/reproducible-matrix.jpg" width="700px" /> .font60[ How the Turing Way defines reproducible research ] --- # Who cares? * Many high profile cases of work which did not reproduce e.g. Anil Potti unravelled by Baggerly and Coombes (2009) * Five selfish reasons to work reproducibly (Markowetz, 2015). Alternatively, see the [talk](https://youtu.be/yVT07Sukv9Q) * **Will** become standard in Science and publishing e.g OECD Global Science Forum Building digital workforce capacity and skills for data-intensive science (OECD Global Science Forum, 2020) --- # Open Science <img src="../pics/Foster.png" width="700px" /> .font60[ By Petr Knoth and Nancy Pontika - https://en.wikipedia.org/wiki/Open_science#/media/File:Os_taxonomy.png, CC BY 3.0, https://commons.wikimedia.org/w/index.php?curid=61125075 ] FAIR - Findable, Accessible, Interoperable, Reusable (Wilkinson Dumontier, et al., 2016) --- class: inverse # Rationale for scripting analysis --- # Rationale for scripting analysis Science is the generation of ideas, designing work to test them and reporting the results. <img src="../pics/rationale1.png" width="600px" /> .pull-left[ Generating the results ] .pull-right[ Analysing and reporting them ] --- # Rationale for scripting analysis We ensure reproducibility of laboratory and field work by planning and recording in lab books and using standard protocols. <img src="../pics/rationale2.png" width="600px" /> Even so replicating results can be hard. --- # Rationale for scripting analysis We ensure reproducibility of laboratory and field work by planning and recording in lab books and using standard protocols. <img src="../pics/rationale3.png" width="600px" /> Workflows for computational projects and the data analysis and reporting of other work can, and should, be 100% reproducible! Scripting is the way to achieve this. --- # Rationale for scripting analysis That reproducibility applies to all aspects of the data workflow. <img src="../pics/rationale5.png" width="600px" /> -- From importing or collecting the data, processing it for analysis, building statistical models and communicating the methods and results. -- These are usually iterative and that process of iteration (the development of the analysis) should also be captured. --- class: inverse # Why R? --- # Why R? Open source and free -- .......But so is Python -- R has reputation for catering to users who do not see themselves as programmers, and allowing them to slide gradually into programming.  -- Designed for data analysis and graphics - which means it is often easier to achieve those tasks in R than a general purpose programming language. --- # Why R? The R community is one of R's greatest assets, being vibrant, inclusive and supportive of users at all levels. .pull-left[ * [#rstats](https://twitter.com/hashtag/rstats?lang=en) on twitter is very active * [RForwards](https://forwards.github.io/about/) the widening participation task force <sup>1</sup> * [RLadies](https://rladies.org/) gender diversity promotion * [Hey! You there! You are welcome here](https://ropensci.org/blog/2017/06/23/community/) .font70[ .footnote[ 1. I am member of the Core Team for Forwards ] ] ] .pull-right[ <img src="../pics/welcome_to_rstats_twitter.png" width="400px" /> .font70[ .footnote[ Artwork by @allison_horst ] ] ] --- # Why R? # Why R? R Markdown is sometimes called R's "killer feature". Turns analyses into fully reproducible high quality reports, presentations and dashboards. Can run Python and other languages in R Markdown documents <img src="../pics/rmarkdown_wizards.png" width="500px" /> .font70[ .footnote[ Artwork by @allison_horst ] ] --- class: inverse # Aspects of Reproducibility --- # Organisation of data Within files should be 'tidy' Tidy data adhere to a consistent structure which makes it easier to manipulate, model and visualize them. 1. Each variable has its own column. 2. Each observation has its own row. 3. Each value has its own cell. Closely allied to the relational algebra of relational databases (Codd, 1990). Underlies the enforced rectangular formatting in SPSS, STATA and R's dataframe. The term 'tidy data' was popularised by Wickham (2014). There may be more than one potential tidy structure. --- # Tidy format Suppose we had just 3 individuals in each of two populations: .pull-left[ **Not tidy** <table> <thead> <tr> <th style="text-align:right;"> A </th> <th style="text-align:right;"> B </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 12.4 </td> <td style="text-align:right;"> 12.6 </td> </tr> <tr> <td style="text-align:right;"> 11.2 </td> <td style="text-align:right;"> 11.3 </td> </tr> <tr> <td style="text-align:right;"> 11.6 </td> <td style="text-align:right;"> 12.1 </td> </tr> </tbody> </table> ] .pull-right[ **Tidy!** <table> <thead> <tr> <th style="text-align:left;"> population </th> <th style="text-align:right;"> distance </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> A </td> <td style="text-align:right;"> 12.4 </td> </tr> <tr> <td style="text-align:left;"> A </td> <td style="text-align:right;"> 11.2 </td> </tr> <tr> <td style="text-align:left;"> A </td> <td style="text-align:right;"> 11.6 </td> </tr> <tr> <td style="text-align:left;"> B </td> <td style="text-align:right;"> 12.6 </td> </tr> <tr> <td style="text-align:left;"> B </td> <td style="text-align:right;"> 11.3 </td> </tr> <tr> <td style="text-align:left;"> B </td> <td style="text-align:right;"> 12.1 </td> </tr> </tbody> </table> ] --- # Organisation of analyses * Project based approach, e.g., RStudio project or similar * directory structure * naming * commenting * readme * version control --- # Organisation of analyses .pull-left[ **Directories** * structured * systematic * repeatable **Naming** * human and machine readable * no spaces * use snake/kebab case * ordering: numbers (zero left padded), dates * file extensions ] .pull-right[ .code40[ ``` -- stem_cell_rna_2019 |__stem_cell_rna_2019.Rproj |__raw_ data |__2019-03-21_donor_1.csv |__2019-03-21_donor_2.csv |__2019-03-21_donor_3.csv |__2019-05-14_donor_1.csv |__2019-05-14_donor_2.csv |__2019-05-14_donor_3.csv |__processed_data |__all_long.txt |__figures |__01_volcano_donor_1_vs_donor_2.eps |__02_volcano_donor_1_vs_donor_3.eps |__functions |__01_file_import |__02_normalise.R |__theme_pca.R |__theme_volcano.R |__pics |__01_image.png |__01_image.png |__README.md |__refs |__r_refs.bib |__proj_refs.bib |__analyses |__01_data_processing.R |__02_exploratory.R |__03_modelling.R |__04_figures.R |__05_report.Rmd ``` ] ] --- # Organisation of analyses ## Further Reading * "Ten simple rules for reproducible computational research" (Sandve Nekrutenko, et al., 2013) * "Best practices for scientific computing" (Wilson Aruliah, et al., 2014) * "Good enough practices in scientific computing" (Wilson Bryan, et al., 2017) * "Excuse Me, Do You Have a Moment to Talk About Version Control?" (Bryan, 2018) --- # Summary * The course is: * an introduction to reproducible analyses rather than statistics * not enough, you need to practice! * comprised of modules so you can opt out where you already have the skills * Scripting makes your work reproducible * Focus is on R but principles are widely applicable; use Python if you prefer * Recognising structure in your data and organising in 'tidy' format will pay dividends * The structured, systematic and consistent organisation of analyses will pay dividends --- # References .footnote[ .font60[ Slides made with with xaringan (Xie, 2019) and xaringanExtra (Aden-Buie, 2020) ] ] .font60[ Aden-Buie, G. (2020). _xaringanExtra: Extras And Extensions for Xaringan Slides_. R package version 0.2.3.9000. URL: [https://github.com/gadenbuie/xaringanExtra](https://github.com/gadenbuie/xaringanExtra). Baggerly, K. A. and K. R. Coombes (2009). "DERIVING CHEMOSENSITIVITY FROM CELL LINES: FORENSIC BIOINFORMATICS AND REPRODUCIBLE RESEARCH IN HIGH-THROUGHPUT BIOLOGY". In: _Ann. Appl. Stat._ 3.4, pp. 1309-1334. Bollen, K., J. T. Cacioppo, et al. (2015). _Social, Behavioral, and Economic Sciences Perspectives on Robust and Reliable Science_. National Science Foundation. Bryan, J. (2018). "Excuse Me, Do You Have a Moment to Talk About Version Control?" In: _Am. Stat._ 72.1, pp. 20-27. Codd, E. F. (1990). _The Relational Model for Database Management: Version 2_. Boston, MA, USA: Addison-Wesley Longman Publishing Co., Inc. Markowetz, F. (2015). "Five selfish reasons to work reproducibly". En. In: _Genome Biol._ 16, p. 274. National Academies of Sciences, Engineering, Medicine, et al. (2019). _Understanding Reproducibility and Replicability_. National Academies Press (US). ] --- # References .font60[ OECD Global Science Forum (2020). _Building digital workforce capacity and skills for data-intensive science_. OECD. Sandve, G. K., A. Nekrutenko, et al. (2013). "Ten simple rules for reproducible computational research". En. In: _PLoS Comput. Biol._ 9.10, p. e1003285. Wickham, H. (2014). "Tidy Data". In: _Journal of Statistical Software, Articles_ 59.10, pp. 1-23. Wilkinson, M. D., M. Dumontier, et al. (2016). "The FAIR Guiding Principles for scientific data management and stewardship". En. In: _Sci Data_ 3, p. 160018. Wilson, G., D. A. Aruliah, et al. (2014). "Best practices for scientific computing". En. In: _PLoS Biol._ 12.1, p. e1001745. Wilson, G., J. Bryan, et al. (2017). "Good enough practices in scientific computing". En. In: _PLoS Comput. Biol._ 13.6, p. e1005510. Xie, Y. (2019). _xaringan: Presentation Ninja_. R package version 0.12. URL: [https://CRAN.R-project.org/package=xaringan](https://CRAN.R-project.org/package=xaringan). ] --- # Intro to Repro in R Emma Rand [emma.rand@york.ac.uk](mailto:emma.rand@york.ac.uk) Twitter: [@er13_r](https://twitter.com/er13_r) GitHub: [3mmaRand](https://github.com/3mmaRand) blog: https://buzzrbeeline.blog/ <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a><br /><span xmlns:dct="http://purl.org/dc/terms/" property="dct:title">Licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License</a>. Rand, E. (2021). White Rose BBSRC DTP Training: An Introduction to Reproducible Analyses in R (Version v1.1). https://doi.org/10.5281/zenodo.4701167