Introduction and Principles of reproducibility

Published

14 July, 2025

Introduction

Programme Overview

What this training is and is not

Chosen topics are: foundational, widely applicable, and transferable conceptually.

It is

✅ An introduction to R for those without previous experience
✅ About using RStudio Projects and good practice for code and project documentation and organisation
✅ An introduction to the tidyverse

It is not

❌ An introduction to statistics
❌ Magic

Learning Objectives

After this workshop the successful learner will be able to:

Explain the rationale for scripting analysis
Find their way around the RStudio windows
Create and plot data using ggplot
Know how to load packages
Understand what is meant by the working directory, absolute and relative paths and be able to apply these concepts to data import
Summarise data in a single group or in multiple groups
Develop highly organised analyses including well-commented scripts that can be understood by future you and others

Principles of reproducibility

What is reproducibility?

Two by Two cell matrix. Columns are Data, either same or different. Rows are Analysis either same or different. Each of cells contain one of the definitions for reproducibility — The Turing Way’s definitions of reproducible research

Reproducible: Same data + same analysis = identical results

“…obtaining consistent results using the same input data; computational steps, methods, and code; and conditions of analysis. This definition is synonymous with”computational reproducibility” (National Academies of Sciences et al. 2019). This is what we are concentrating on in the Supporting Information.

Replicable: Different data + same analysis = qualitatively similar results. The work is not dependent on the specificities of the data.
Robust: Same data + different analysis = qualitatively similar or identical results. The work is not dependent on the specificities of the analysis.
Generalisable: Different data + different analysis = qualitatively similar results and same conclusions.

Why does reproducibility matter?

Many high profile cases of work which did not reproduce e.g. Anil Potti unravelled by Baggerly and Coombes (2009)
Five selfish reasons to work reproducibly (Markowetz 2015). Alternatively, see the very entertaining talk
Will become standard in Science and publishing e.g OECD Global Science Forum Building digital workforce capacity and skills for data-intensive science (OECD Global Science Forum 2020)

How to achieve reproducibility

Reproducibility is a continuum. Some is better than none!
Script everything
Organisation: Project-oriented workflows with file and folder structure, naming things
Code: follow a consistent style, organise into sections and scripts (be modular), Code algorithmically
Documentation: Readme files, code comments, metadata,
More advanced: version, control, continuous integration and testing

Rationale for scripting

Science is the generation of ideas, designing work to test them and reporting the results.
We ensure laboratory and field work is replicable, robust and generalisable by planning and recording in lab books and using standard protocols. Repeating results is still hard.
Workflows for computational projects, and the data analysis and reporting of other work can, and should, be 100% reproducible!
Scripting is the way to achieve this.

Project-oriented workflow

use folders to organise your work
you are aiming for structured, systematic and repeatable.
inputs and outputs should be clearly identifiable from structure and/or naming

References

Allaire, J. J., Charles Teague, Carlos Scheidegger, Yihui Xie, Christophe Dervieux, and Gordon Woodhull. 2024. “Quarto.” https://doi.org/10.5281/zenodo.5960048.

Baggerly, Keith A, and Kevin R Coombes. 2009. “DERIVING CHEMOSENSITIVITY FROM CELL LINES: FORENSIC BIOINFORMATICS AND REPRODUCIBLE RESEARCH IN HIGH-THROUGHPUT BIOLOGY.” Ann. Appl. Stat. 3 (4): 1309–34. https://doi.org/10.2307/27801549.

Broman, Karl W, and Kara H Woo. 2018. “Data Organization in Spreadsheets.” Am. Stat. 72 (1): 2–10. https://doi.org/10.1080/00031305.2017.1375989.

Bryan, Jennifer. 2018. “Excuse Me, Do You Have a Moment to Talk about Version Control?” Am. Stat. 72 (1): 20–27.

Markowetz, Florian. 2015. “Five Selfish Reasons to Work Reproducibly.” Genome Biol. 16 (December): 274. https://doi.org/10.1186/s13059-015-0850-7.

National Academies of Sciences, Engineering, Medicine, Policy, Global Affairs, Engineering, Medicine Committee on Science, Public Policy, Board on Research Data, et al. 2019. Understanding Reproducibility and Replicability. National Academies Press (US). https://www.ncbi.nlm.nih.gov/books/NBK547546/.

OECD Global Science Forum. 2020. “Building Digital Workforce Capacity and Skills for Data-Intensive Science.” http://www.oecd.org/officialdocuments/publicdisplaydocumentpdf/?cote=DSTI/STP/GSF(2020)6/FINAL&docLanguage=En.

R Core Team. 2025. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.

Sandve, Geir Kjetil, Anton Nekrutenko, James Taylor, and Eivind Hovig. 2013. “Ten Simple Rules for Reproducible Computational Research.” PLoS Comput. Biol. 9 (10): e1003285.

Wilson, Greg, D A Aruliah, C Titus Brown, Neil P Chue Hong, Matt Davis, Richard T Guy, Steven H D Haddock, et al. 2014. “Best Practices for Scientific Computing.” PLoS Biol. 12 (1): e1001745.

Wilson, Greg, Jennifer Bryan, Karen Cranston, Justin Kitzes, Lex Nederbragt, and Tracy K Teal. 2017. “Good Enough Practices in Scientific Computing.” PLoS Comput. Biol. 13 (6): e1005510.

Xie, Yihui. 2014. “Knitr: A Comprehensive Tool for Reproducible Research in R.” In Implementing Reproducible Computational Research, edited by Victoria Stodden, Friedrich Leisch, and Roger D. Peng. Chapman; Hall/CRC.

———. 2015. Dynamic Documents with R and Knitr. 2nd ed. Boca Raton, Florida: Chapman; Hall/CRC. https://yihui.org/knitr/.

———. 2024. Knitr: A General-Purpose Package for Dynamic Report Generation in r. https://yihui.org/knitr/.

Zhu, Hao. 2024. kableExtra: Construct Complex Table with ’Kable’ and Pipe Syntax. https://CRAN.R-project.org/package=kableExtra.