Workshop

Organising Reproducible Data Analyses

Author

Emma Rand

Published

18 September, 2024

Introduction

Session overview

In this workshop we will discuss why reproducibility matters and how to organise your work to make it reproducible. We will cover:

Reproducibility

What is reproducibility?

  • Reproducible: Same data + same analysis = identical results. “… obtaining consistent results using the same input data; computational steps, methods, and code; and conditions of analysis. This definition is synonymous with”computational reproducibility” (National Academies of Sciences et al. 2019)

  • Replicable: Different data + same analysis = qualitatively similar results. The work is not dependent on the specificities of the data.

  • Robust: Same data + different analysis = qualitatively similar or identical results. The work is not dependent on the specificities of the analysis.

  • Generalisable: Different data + different analysis = qualitatively similar results and same conclusions. The findings can be generalised

Two by Two cell matrix. Columns are Data, either same or different. Rows are Analysis either same or different. Each of cells contain one of the definitions for reproducibility

The Turing Way's definitions of reproducible research

Why does it matter?

Person working at a computer with an offstage person asking 'How is the analysis going?' The person at the computer replies 'Can't understand the date...and the data collector does not answer my emails or calls' Person offstage: 'That's terrible! So cruel! Who did collect the data? I will sack them!' Person at the computer: 'um...I did, 3 years ago.'

futureself, CC-BY-NC, by Julen Colomb
  • Five selfish reasons to work reproducibly (Markowetz 2015). Alternatively, see the very entertaining talk

  • Many high profile cases of work which did not reproduce e.g. Anil Potti unravelled by Baggerly and Coombes (2009)

  • Will become standard in Science and publishing e.g OECD Global Science Forum Building digital workforce capacity and skills for data-intensive science (OECD Global Science Forum 2020)

How to achieve reproducibility

  • Scripting

  • Organisation: Project-oriented workflows with file and folder structure, naming things

  • Documentation: Readme files, code comments, metadata, version control

Scripting

Rationale for scripting?

  • Science is the generation of ideas, designing work to test them and reporting the results.

  • We ensure laboratory and field work is replicable, robust and generalisable by planning and recording in lab books and using standard protocols. Repeating results is still hard.

  • Workflows for computational projects, and the data analysis and reporting of other work can, and should, be 100% reproducible!

  • Scripting is the way to achieve this.

Organisation

Project-oriented workflow

  • use folders to organise your work

  • you are aiming for structured, systematic and repeatable.

  • inputs and outputs should be clearly identifiable from structure and/or naming

Examples

-- liver_transcriptome/
   |__data
      |__raw/
      |__processed/
   |__images/
   |__code/
   |__reports/
   |__figures/

Naming things

A comic figure is looking over the shoulder of another and is shocked by a list of files with names like 'Untitled 138 copy.docx' and 'Untitled 243.doc'. Caption: 'Protip: Never look in someone else's documents folder'

documents, CC-BY-NC, https://xkcd.com/1459/

Guiding principle - Have a convention! Good file names are:

  • machine readable

  • human readable

  • play nicely with sorting

I suggest

  • no spaces in names

  • use snake_case or kebab-case rather than CamelCase or dot.case

  • use all lower case except very occasionally where convention is otherwise, e.g., README, LICENSE

  • ordering: use left-padded numbers e.g., 01, 02….99 or 001, 002….999

  • dates ISO 8601 format: 2020-10-16

  • write down your conventions

-- liver_transcriptome/
   |__data
      |__raw/
         |__2022-03-21_donor_1.csv
         |__2022-03-21_donor_2.csv
         |__2022-03-21_donor_3.csv
         |__2022-05-14_donor_1.csv
         |__2022-05-14_donor_2.csv
         |__2022-05-14_donor_3.csv
      |__processed/
   |__images/
   |__code/
      |__functions/
         |__summarise.R
         |__normalise.R
         |__theme_volcano.R
      |__01_data_processing.py
      |__02_exploratory.R
      |__03_modelling.R
      |__04_figures.R
   |__reports/
      |__01_report.qmd
      |__02_supplementary.qmd
   |__figures/
      |__01_volcano_donor_1_vs_donor_2.eps
      |__02_volcano_donor_1_vs_donor_3.eps

Documentation

Readme files

READMEs are a form of documentation which have been widely used for a long time. They contain all the information about the other files in a directory. They can be extensive but need not be. Concise is good. Bullet points are good

  • Give a project title and description, brief

  • start date, last updated date and contact information

  • Outline the folder structure

  • Give software requirements: programs and versions used or required. There are packages that give session information in R Wickham et al. (2021) and Python Ostblom, Joel (2019)

R:

sessioninfo::session_info()

Python:

import session_info

session_info.show()

  • Instructions run the code, build reports, and reproduce the figures etc

  • Where to find the data, outputs

  • Any other information that needed to understand and recreate the work

  • Ideally, a summary of changes with the date

-- liver_transcriptome/
   |__data
      |__raw/
         |__2022-03-21_donor_1.csv
         |__2022-03-21_donor_2.csv
         |__2022-03-21_donor_3.csv
         |__2022-05-14_donor_1.csv
         |__2022-05-14_donor_2.csv
         |__2022-05-14_donor_3.csv
      |__processed/
   |__images/
   |__code/
      |__functions/
         |__summarise.R
         |__normalise.R
         |__theme_volcano.R
      |__01_data_processing.py
      |__02_exploratory.R
      |__03_modelling.R
      |__04_figures.R
   |__README.md
   |__reports/
      |__01_report.qmd
      |__02_supplementary.qmd
   |__figures/
      |__01_volcano_donor_1_vs_donor_2.eps
      |__02_volcano_donor_1_vs_donor_3.eps

Code comments

  • Comments are notes in the code which are not executed. They are ignored by the computer but are read by humans. They are used to explain what the code is doing and why. They are also used to temporarily remove code from execution.

Github co-pilot demo

Quarto demo

Useful exercises

You’re finished!

🥳 Well Done! 🎉

Independent study following the workshop

Consolidate

Pages made with R (R Core Team 2024), Quarto (Allaire et al. 2024), knitr [Xie (2024); knitr2; knitr3], kableExtra (Zhu 2021)

References

Allaire, J. J., Charles Teague, Carlos Scheidegger, Yihui Xie, and Christophe Dervieux. 2024. Quarto.” https://doi.org/10.5281/zenodo.5960048.
Baggerly, Keith A, and Kevin R Coombes. 2009. “DERIVING CHEMOSENSITIVITY FROM CELL LINES: FORENSIC BIOINFORMATICS AND REPRODUCIBLE RESEARCH IN HIGH-THROUGHPUT BIOLOGY.” Ann. Appl. Stat. 3 (4): 1309–34. https://doi.org/10.2307/27801549.
Markowetz, Florian. 2015. “Five Selfish Reasons to Work Reproducibly.” Genome Biol. 16 (December): 274. https://doi.org/10.1186/s13059-015-0850-7.
National Academies of Sciences, Engineering, Medicine, Policy, Global Affairs, Engineering, Medicine Committee on Science, Public Policy, Board on Research Data, et al. 2019. Understanding Reproducibility and Replicability. National Academies Press (US). https://www.ncbi.nlm.nih.gov/books/NBK547546/.
OECD Global Science Forum. 2020. “Building Digital Workforce Capacity and Skills for Data-Intensive Science.” http://www.oecd.org/officialdocuments/publicdisplaydocumentpdf/?cote=DSTI/STP/GSF(2020)6/FINAL&docLanguage=En.
Ostblom, Joel. 2019. Session_info. https://gitlab.com/joelostblom/session_info.
R Core Team. 2024. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.
Wickham, Hadley, Winston Chang, Robert Flight, Kirill Müller, and Jim Hester. 2021. Sessioninfo: R Session Information. https://github.com/r-lib/sessioninfo#readme.
Xie, Yihui. 2024. Knitr: A General-Purpose Package for Dynamic Report Generation in r. https://yihui.org/knitr/.
Zhu, Hao. 2021. “kableExtra: Construct Complex Table with ’Kable’ and Pipe Syntax.” https://CRAN.R-project.org/package=kableExtra.