Workshop
Organising Reproducible Data Analyses
Introduction
Session overview
In this workshop we will discuss why reproducibility matters and how to organise your work to make it reproducible. We will cover:
Reproducibility
What is reproducibility?
Reproducible: Same data + same analysis = identical results. “… obtaining consistent results using the same input data; computational steps, methods, and code; and conditions of analysis. This definition is synonymous with”computational reproducibility” (National Academies of Sciences et al. 2019)
Replicable: Different data + same analysis = qualitatively similar results. The work is not dependent on the specificities of the data.
Robust: Same data + different analysis = qualitatively similar or identical results. The work is not dependent on the specificities of the analysis.
Generalisable: Different data + different analysis = qualitatively similar results and same conclusions. The findings can be generalised
Why does it matter?
Five selfish reasons to work reproducibly (Markowetz 2015). Alternatively, see the very entertaining talk
Many high profile cases of work which did not reproduce e.g. Anil Potti unravelled by Baggerly and Coombes (2009)
Will become standard in Science and publishing e.g OECD Global Science Forum Building digital workforce capacity and skills for data-intensive science (OECD Global Science Forum 2020)
How to achieve reproducibility
Scripting
Organisation: Project-oriented workflows with file and folder structure, naming things
Documentation: Readme files, code comments, metadata, version control
Scripting
Rationale for scripting?
Science is the generation of ideas, designing work to test them and reporting the results.
We ensure laboratory and field work is replicable, robust and generalisable by planning and recording in lab books and using standard protocols. Repeating results is still hard.
Workflows for computational projects, and the data analysis and reporting of other work can, and should, be 100% reproducible!
Scripting is the way to achieve this.
Organisation
Project-oriented workflow
use folders to organise your work
you are aiming for structured, systematic and repeatable.
inputs and outputs should be clearly identifiable from structure and/or naming
Examples
-- liver_transcriptome/
|__data
|__raw/
|__processed/
|__images/
|__code/
|__reports/
|__figures/
Naming things
Guiding principle - Have a convention! Good file names are:
machine readable
human readable
play nicely with sorting
I suggest
no spaces in names
use snake_case or kebab-case rather than CamelCase or dot.case
use all lower case except very occasionally where convention is otherwise, e.g., README, LICENSE
ordering: use left-padded numbers e.g., 01, 02….99 or 001, 002….999
dates ISO 8601 format: 2020-10-16
write down your conventions
-- liver_transcriptome/
|__data
|__raw/
|__2022-03-21_donor_1.csv
|__2022-03-21_donor_2.csv
|__2022-03-21_donor_3.csv
|__2022-05-14_donor_1.csv
|__2022-05-14_donor_2.csv
|__2022-05-14_donor_3.csv
|__processed/
|__images/
|__code/
|__functions/
|__summarise.R
|__normalise.R
|__theme_volcano.R
|__01_data_processing.py
|__02_exploratory.R
|__03_modelling.R
|__04_figures.R
|__reports/
|__01_report.qmd
|__02_supplementary.qmd
|__figures/
|__01_volcano_donor_1_vs_donor_2.eps
|__02_volcano_donor_1_vs_donor_3.eps
Documentation
Readme files
READMEs are a form of documentation which have been widely used for a long time. They contain all the information about the other files in a directory. They can be extensive but need not be. Concise is good. Bullet points are good
Give a project title and description, brief
start date, last updated date and contact information
Outline the folder structure
Give software requirements: programs and versions used or required. There are packages that give session information in R Wickham et al. (2021) and Python Ostblom, Joel (2019)
R:
sessioninfo::session_info()
Python:
import session_info
session_info.show()
Instructions run the code, build reports, and reproduce the figures etc
Where to find the data, outputs
Any other information that needed to understand and recreate the work
Ideally, a summary of changes with the date
-- liver_transcriptome/
|__data
|__raw/
|__2022-03-21_donor_1.csv
|__2022-03-21_donor_2.csv
|__2022-03-21_donor_3.csv
|__2022-05-14_donor_1.csv
|__2022-05-14_donor_2.csv
|__2022-05-14_donor_3.csv
|__processed/
|__images/
|__code/
|__functions/
|__summarise.R
|__normalise.R
|__theme_volcano.R
|__01_data_processing.py
|__02_exploratory.R
|__03_modelling.R
|__04_figures.R
|__README.md
|__reports/
|__01_report.qmd
|__02_supplementary.qmd
|__figures/
|__01_volcano_donor_1_vs_donor_2.eps
|__02_volcano_donor_1_vs_donor_3.eps
Code comments
- Comments are notes in the code which are not executed. They are ignored by the computer but are read by humans. They are used to explain what the code is doing and why. They are also used to temporarily remove code from execution.
Github co-pilot demo
Quarto demo
Useful exercises
Want github co-pilot?
🎬 Create a GitHub account
🎬 Apply for student benefits
Update R and RStudio
🎬 Update R
🎬 Update RStudio. You will need the prelease Dessert Sunflower for github Copilot integration
Install package building tools
🎬 Windows Install Rtools
🎬 Mac install Xcode from Mac App Store
Update packages:
🎬 devtools, tidyverse, BiocManager, readxl
Install Quarto
Install Zotero
🎬 Install Zotero
You’re finished!
🥳 Well Done! 🎉
Independent study following the workshop
Pages made with R (R Core Team 2024), Quarto (Allaire et al. 2024), knitr
[Xie (2024); knitr2; knitr3], kableExtra
(Zhu 2021)