52M Data Analysis in R
Introduction
This module introduces you to data analysis in R. The first 4 weeks covers core concepts about scientific computing, types of variable, the role of variables in analysis and how to use RStudio to organise analysis and import, summarise and plot data. In weeks 5 to 9, you will learn about the logic of hypothesis testing, confidence intervals, what is meant by a statistical model, two-sample tests and one-way analysis of variance (ANOVA). In week 10 we will introduce the format of the assessment and in week 11 there will be a dedicated drop in session to consolidate learning and ask any further questions about the module or assessment.
This module complement the work you will do in BIO00070M Research, Professional and Team Skills where you will you will learn how to organise reproducible data analyses using a project-oriented workflow and analyses RNA sequence data. It will be important to use the skills and tools you learn in 52M and apply them in 70M.
Module Learning Objectives
The Module Learning outcomes are:
- Explain the purpose of data analysis and the rationale for scripting analysis in the biosciences 
- Recognise when statistics such as t-tests, one-way ANOVA, correlation and regression can be applied, and use R to perform these analyses on data in a variety of formats 
- Summarise data in single or multiple groups, recognise tidy data formats, and carry out some typical data tidying tasks 
- Understand what parts of your results are important to report 
How 52M is organised
A key feature of 52M is that you really do learn as you go along and you should not need to revise very much. To support this learning, every week is structured in the same way with contact time and well-guided independent study to prepare you for the contact time and consolidate what you have learned.
Each week has:
- An overview on the “About” page which gives the Learning Objectives, a topic summary and the instructions for the week. You should read this first. 
- Some independent study on the “Prepare!” page to prepare you for the workshop. This will be reading from the course book Computational Analysis for Bioscientists (Rand 2023), watching a video, or doing some coding or set up. It is designed to take about 30-45 mins on average. You will most likely learn best if you can find people to study with. 
- A two-hour workshop using R. This will usually start with me doing a short demonstration of one or more of the examples that were in “Prepare!” but you will spend most of the session going through some exercises. Anything you have not done before is explained and guided but you will also have to use the skills gained in previous workshops. I often remind you to take care of future you by making notes so you can look up your previous work but you can also search the Data Analysis in R site (search is top right). Talking to other people in the workshop about the exercises and working together will really help you understand more. There will be plenty of help from me and my demonstrators. 
- Some independent study on the “Consolidate!” page to give you more practice. The exercises are usually similar to those in the workshop but with less guidance. Occasionally, there will be reading to do. It is designed to take about 30-45 mins on average but may be quicker if you understood the workshop very well or slower if you need to revisit the workshop. 
Learning Data Analysis in R is like learning to speak a new language or play an instrument or a technical sport - you can’t really rush it or cram for it. You need regular practice.
- a little bit of engagement and practice is always better than none 
- if you get behind, just pick up where you left off rather than jumping in. It is fine to work on a previous week’s workshop 
Content
Week 1: Understanding file systems
This week you will carry out some independent study to ensure you have some understanding of computer file systems. We will introduce you to the concepts of paths and working directories.
Week 2: Introduction to R and project organisation
This week you will start writing R code in RStudio and will create your first graph! You will learn about data types such as “numerics” and “characters” and some of the different types of objects in R such as “vectors” and “dataframes”. These are the building blocks for the rest of your R journey. You will also learn a workflow and about the layout of RStudio and using RStudio Projects.
Week 3: Types of variable, summarising and plotting data
This week you will start writing R code in RStudio and will create your first graph! You will learn about data types such as “numerics” and “characters” and some of the different types of objects in R such as “vectors” and “dataframes”. These are the building blocks for the rest of your R journey. You will also learn a workflow and about the layout of RStudio and using RStudio Projects.
Week 4: Summarising data with several variables
Last week you summarised and plotted data sets with one variable. This week you will start plotting data sets with more than one variable. This means you need to be able determine which variable is the response and which is the explanatory. You will find out what is meant by “tidy” data and how to perform a simple data tidying task. Finally you will discover how to save your figures to file.
Week 5: The logic of hypothesis testing and CI
This week we will cover the logic of consider the logic of hypothesis testing and type 1 and type 2 errors. We will also find out what the sampling distribution of the mean and the standard error are, and how to calculate confidence intervals.
Week 6: Introduction to statistical models: Single regression
This week, you’ll learn about statistical models, which are mathematical representations of data relationships. Specifically, you’ll explore the general linear model (GLM), a broad framework for analysing data patterns.
Your first GLM will be simple linear regression, which fits a straight line to data to predict a response variable (outcome) based on an explanatory variable (predictor). We’ll examine the two key parameters estimated in this model: the slope (which shows how the predictor influences the outcome) and the intercept (the value when the predictor is zero). We’ll also assess whether these values are significantly different from zero.
Week 7: Two-sample tests
This week you will how to use and interpret the general linear model when the explanatory (or x) variable is categorical with two possible values. These tests are also known as t-tests. Just as with single linear regression, the response variable is continuous, the model puts a line of best through data and has two parameters called the intercept and the slope. These have the same in interpretation as they do in linear regression. The intercept is one of the group means, and the slope is the difference between that mean and the other group mean. You will also learn about the non-parametric tests we use when the assumptions of the general linear model are not met.
Week 8: One-way ANOVA and Kruskal-Wallis
Last week you learnt how to use and interpret the general linear model when the x variable was categorical with two groups. You will now extend that to situations when there are more than two groups. This is often known as the one-way ANOVA (analysis of variance). You will also learn about the Kruskal-Wallis test which can be used when the assumptions of the general linear model are not met.
Week 9: Two-way ANOVA
This week you will learn how to use and interpret a two-way ANOVA. Last week you learnt how to interpret a general linear model with one categorical explanatory variable and this week we will extend this to include two categorical explanatory variables. It allows us to design experiments which test whether a response is influenced by two variables and whether those variables act independently or interact.
Week 10: Introduction to the assessment
This week, we will introduce you to the assessment for this module. We will look at a specimen assessment using techniques you have already learned and apply them to analysing a dataset. Your assessment will use a different dataset but you will apply the same principles. We will be covering what your assessment submission should contain using this example.
Week 11: Drop-in
This session contains no set material however, we will cover topics that people have had difficulty with during the course and cover any material you may still be struggling with from the workshops. This will be our last timetabled session to ask questions prior to the asessment release and a good opportunity to ask any outstanding questions. No questions are silly questions!