1 About this book – Computational Analysis for Bioscientists

1.1 Who is this book for?

This book is primarily to support Bioscience students at the University of York. The ultimate aim is to support the full spectrum of computational skills that a bioscience undergraduate or postgraduate at York - and elsewhere - might need. This is live book with ideas being added and material drafted – and hopefully improved – over time. Each pages is labelled with its status as one of the following:

Incomplete

You are reading a live document. This page is a dumping ground for one or more ideas. Sections maybe missing, or in bullet form and code may not be explained.

Draft

You are reading a live document. This page is a draft but is mainly complete should be readable.

Complete

You are reading a live document. This page is compete but suggestions for improvements are welcome. Follow the link to ‘Make a suggestion’ to suggest improvements.

The content included so far is described in the Overview of contents section below.

It is being written in the open so that it can be used by anyone who finds it useful. It is also being written in the open so that anyone can contribute to it.

1.2 Approach of this book

explanations followed by worked examples

1.3 Overview of contents

It is in sections

Section 1: What they forgot to teach you about computers

This chapter tries to teach the computer skills that you might have missed if you have used mainly the mobile devices. I focus on the knowledge gaps that often appear when people are learning computational data analysis. Primarily these are to do with finding and organising their files and folders in the file systems.

Section 2 Getting started with data analysis in R

The first steps into analysing data with R. The first chapter in this part covers important concepts about data: whether they are discrete and continuous and how we summarise them using descriptive statistics. These ideas are not specific to analysing data with R, they apply whatever analysis tool you use. The second chapter introduces you to R and RStudio for the first time. It explains what R, RStudio and R packages are and where to find them. It then explores the layout and appearance. In the third chapter you will start coding and lean the most common data types and data structures used for data analysis. The fourth chapter describes some useful workflow patterns and tools for organising your work in RStudio. Using these will make learning R easier. Finally, we will go through a complete workflow from importing data from a file to saving a figure for reporting.

Section 3 Statistical Analysis in R - Part 1

This section is a first course in Statistical inference which is the process of inferring the characteristics of populations from samples using data analysis. In this first course we take what is called a frequentist - or classical - approach to statistical inference. This is the approach that is most commonly taught in introductory statistics courses. We will learn about the logic of hypothesis testing and confidence intervals. You will also get an introduction to statistical models, what is a statistical model and in particular a linear model.

Section 4 Statistical Analysis in R - Part 2

Section 5 Working with Gel data in R

1.4 Conventions used in the book

I use some conventions most of which I hope are intuitive. I have tried to articulate them here. If you recognise conventions I have used that are not listed here please let me know.

Code and any output appears in blocks formatted like this:

# import the chaff data
chaff <- read_table("data-raw/chaff.txt")
glimpse(chaff)
## Rows: 40
## Columns: 2
## $ subspecies <chr> "coelebs", "coelebs", "coelebs", "coelebs", "coelebs", "coe…
## $ mass       <dbl> 18.3, 22.1, 22.4, 18.5, 22.2, 19.3, 17.8, 20.2, 22.1, 16.6,…

Lines of output start with a ## to distinguish from code comments which begin with a single #. You will learn more about comments in the Using Scripts section in First Steps in RStudio

Within the text:

packages are indicated in bold code font like this: ggplot2
functions are indicated in code font with brackets after their name like this: ggplot()
R objects are indicated in code font like this: stag

The content of a code block can be copied using the icon in its top right corner.

I use packages from the tidyverse (Wickham et al. 2019) including ggplot2 (Wickham 2016), dplyr (Wickham et al. 2023), tidyr (Wickham, Vaughan, and Girlich 2024) and readr (Wickham, Hester, and Bryan 2024) throughout the book. All the code assumes you have loaded the core tidyverse packages with:

library(tidyverse)

If you run examples and get an error like this:

# Error in read_table("data-raw/stag.txt") : 
#  could not find function "read_table"

It is likely you need to load the tidyverse as shown above.

All other packages will be loaded explicitly with library() statements where needed.

When you see “🎬 Your turn!” indicates that you might want to code along with examples or that there is an opportunity to check your understanding by answering a question. Questions are answered in words or with a piece of code. The answers are given in collapsed sections so you can try to answer them before checking the answer. For example, a question answered in words looks like this:

🎬 Your turn! Use the file system above to answer these questions.

What is the absolute path for the documentdoc4.txt on a Mac computer?

📖

/home/user1/docs/data/doc4.txt

And a question answered with a piece of code looks like this:

🎬 Your turn! Assign the value of 4 to a variable called y:

Code

y <- 4

1.5 Annotating this book

This page has annotating with Hypothesis enabled. Hypothesis allows you to annotate this book with your own private notes or make notes shared with friends. You need to create a free personal account. You can make annotations that are public, private only to you or shared with a private group. Please follow the code of conduct in your annotations.

1.6 Code of Conduct

We are dedicated to providing a welcoming and supportive learning environment for all readers, regardless of background or identity. As such, we do not tolerate comments that are disrespectful to fellow learners or that excludes, intimidates, or causes discomfort to others. The following bullet points set out explicitly what we hope you will consider to be appropriate community guidelines:

Be respectful of different viewpoints and experiences. Do not use in homophobic, racist, transphobic, ageist, ableist, sexist, or otherwise exclusionary language.
Use welcoming and inclusive language. Do not address others in an angry, intimidating, or demeaning manner. Be considerate of the ways the words you choose may impact others. Be patient and respectful of the fact that English is a second (or third or fourth!) language for many.
Respect the privacy and safety of others. Do not share their information without their express permission.
As an overriding general rule, please be intentional in your actions and humble in your mistakes.

1.7 Contributing

This book is being written in the open so that anyone can contribute to it. If you find a mistake, or have a suggestion for improvement you can create an issue.

1.8 License

This work is licensed under CC BY-NC 4.0 This license requires that reusers give credit to the creator. It allows reusers to distribute, remix, adapt, and build upon the material in any medium or format, for noncommercial purposes only.

1.9 Please cite as

Please cite this book as:

Rand, E. (2025). Computational Analysis for Bioscientists (Version 0.2) https://3mmarand.github.io/comp4biosci/

1.10 Credits

This book is written with R (R Core Team 2025), Quarto (Allaire et al. 2022), knitr (Xie 2025, 2015, 2014), kableExtra (Zhu 2024). My R session information is shown below:

sessionInfo()
## R version 4.5.0 (2025-04-11 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 22631)
## 
## Matrix products: default
##   LAPACK version 3.12.1
## 
## locale:
## [1] LC_COLLATE=English_United Kingdom.utf8 
## [2] LC_CTYPE=English_United Kingdom.utf8   
## [3] LC_MONETARY=English_United Kingdom.utf8
## [4] LC_NUMERIC=C                           
## [5] LC_TIME=English_United Kingdom.utf8    
## 
## time zone: Europe/London
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices datasets  utils     methods   base     
## 
## other attached packages:
##  [1] patchwork_1.3.0 lubridate_1.9.4 forcats_1.0.0   stringr_1.5.1  
##  [5] dplyr_1.1.4     purrr_1.0.4     readr_2.1.5     tidyr_1.3.1    
##  [9] tibble_3.3.0    ggplot2_3.5.2   tidyverse_2.0.0
## 
## loaded via a namespace (and not attached):
##  [1] gtable_0.3.6       jsonlite_2.0.0     crayon_1.5.3       compiler_4.5.0    
##  [5] renv_0.17.0        tidyselect_1.2.1   scales_1.4.0       yaml_2.3.10       
##  [9] fastmap_1.2.0      R6_2.6.1           generics_0.1.4     knitr_1.50        
## [13] pillar_1.10.2      RColorBrewer_1.1-3 tzdb_0.5.0         rlang_1.1.6       
## [17] stringi_1.8.7      xfun_0.52          timechange_0.3.0   cli_3.6.5         
## [21] withr_3.0.2        magrittr_2.0.3     digest_0.6.37      grid_4.5.0        
## [25] rstudioapi_0.17.1  hms_1.1.3          lifecycle_1.0.4    vctrs_0.6.5       
## [29] evaluate_1.0.3     glue_1.8.0         farver_2.1.2       rmarkdown_2.29    
## [33] tools_4.5.0        pkgconfig_2.0.3    htmltools_0.5.8.1