ggplot(plant, aes(x = day, y = mass)) +
geom_point()
Workshop
Introduction to statistical models: Single linear regression
Introduction
Session overview
In this session you will carry out, interpret and report a single linear regression.
Philosophy
Workshops are not a test. It is expected that you often don’t know how to start, make a lot of mistakes and need help. It is expected that you are familiar with independent study content before the workshop. However, you need not remember or understand every detail as the workshop should build and consolidate your understanding. Tips
- don’t worry about making mistakes
- don’t let what you can not do interfere with what you can do
- discussing code with your neighbours will help
- look things up in the independent study material
- look things up in your own code from earlier
- there are no stupid questions
These four symbols are used at the beginning of each instruction so you know where to carry out the instruction.
Something you need to do on your computer. It may be opening programs or documents or locating a file.
Something you should do in RStudio. It will often be typing a command or using the menus but might also be creating folders, locating or moving files.
Something you should do in your browser on the internet. It may be searching for information, going to the VLE or downloading a file.
A question for you to think about and answer. Record your answers in your script for future reference.
Getting started
Start RStudio from the Start menu.
Make an RStudio project for this workshop by clicking on the drop-down menu on top right where it says
Project: (None)
and choosing New Project, then New Directory, then New Project. Navigate to the data-analysis-in-r-2
folder and name the RStudio Project week-2
.
Make new folders called
data-raw
and figures
. You can do this on the Files Pane by clicking New Folder and typing into the box that appears.
Make a new script then save it with a name like
single-linear-regression.R
to carry out the rest of the work.
Add a comment to the script:
# Introduction to statistical models: Single linear-regression
and load the tidyverse
(Wickham et al. 2019) package
Exercises
Linear Regression
The data in plant.xlsx is a set of observations of plant growth over two months. The researchers planted the seeds and harvested, dried and weighed a plant each day from day 10 so all the data points are independent of each other.
Save a copy of plant.xlsx to your
data-raw
folder and import it.
What type of variables do you have? Which is the response and which is the explanatory? What is the null hypothesis?
Exploring
Do a quick plot of the data:
What are the assumptions of linear regression? Do these seem to be met?
Applying, interpreting and reporting
We now carry out a regression assigning the result of the
lm()
procedure to a variable and examining it with summary()
.
Call:
lm(formula = mass ~ day, data = plant)
Residuals:
Min 1Q Median 3Q Max
-32.810 -11.253 -0.408 9.075 48.869
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -8.6834 6.4729 -1.342 0.186
day 1.6026 0.1705 9.401 1.5e-12 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 17.92 on 49 degrees of freedom
Multiple R-squared: 0.6433, Adjusted R-squared: 0.636
F-statistic: 88.37 on 1 and 49 DF, p-value: 1.503e-12
The Estimates in the Coefficients table give the intercept (first line) and the slope (second line) of the best fitting straight line. The p-values on the same line are tests of whether that coefficient is different from zero.
The F value and p-value in the last line are a test of whether the model as a whole explains a significant amount of variation in the dependent variable. For a single linear regression this is exactly equivalent to the test of the slope against zero.
What is the equation of the line? What do you conclude from the analysis?
Does the line go through (0,0)?
What percentage of variation is explained by the line?
It might be useful to assign the slope and the intercept to variables in case we need them later. The can be accessed in the mod$coefficients
variable:
mod$coefficients
(Intercept) day
-8.683379 1.602606
Assign
mod$coefficients[1]
to b0
and mod$coefficients[1]
to b1
:
I also rounded the values to two decimal places.
Checking assumptions
We need to examine the residuals. Very conveniently, the object which is created by lm()
contains a variable called $residuals
. Also conveniently, the R’s plot()
function can used on the output objects of lm()
. The assumptions of the GLM demand that each y is drawn from a normal distribution for each x and these normal distributions have the same variance. Therefore, we plot the residuals against the fitted values to see if the variance is the same for all the values of x. The fitted - or predicted - values are the values on the line of best fit. Each residual is the difference between the fitted values and the observed value.
Plot the model residuals against the fitted values like this:
plot(mod, which = 1)
What to you conclude?
To examine normality of the model residuals we can plot them as a histogram and do a normality test on them.
Plot a histogram of the residuals:
Use the
shapiro.test()
to test the normality of the model residuals
shapiro.test(mod$residuals)
Shapiro-Wilk normality test
data: mod$residuals
W = 0.96377, p-value = 0.1208
Usually, when we are doing statistical tests we would like the the test to be significant because it means we have evidence of a biological effect. However, when doing normality tests we hope it will not be significant. A non-significant result means that there is no significant difference between the distribution of the residuals and a normal distribution and that indicates the assumptions are met.
What to you conclude?
Illustrating
We want a figure with the points and the statistical model, i.e., the best fitting straight line.
Create a scatter plot using
geom_point()
The
geom_smooth()
function will add a variety of fitted lines to a plot. We want a straight line so we need to specify method = "lm"
:
ggplot(plant, aes(x = day, y = mass)) +
geom_point() +
geom_smooth(method = "lm",
se = FALSE,
colour = "black") +
theme_classic()
What do the
se
and colour
arguments do? Try changing them.
Let’s add the equation of the line to the figure using
annotate()
:
ggplot(plant, aes(x = day, y = mass)) +
geom_point() +
geom_smooth(method = "lm",
se = FALSE,
colour = "black") +
annotate("text", x = 20, y = 110,
label = "mass = 1.61 * day - 8.68") +
theme_classic()
We have to tell annotate()
what type of geom we want - text
in this case, - where to put it, and the text we want to appear.
Instead of hardcoding the equation, we can use the values of b0
and b1
that we assigned earlier. This is a neat trick that makes or work more reproducible. If the data were updated, we would not need to change the equation in the figure.
ggplot(plant, aes(x = day, y = mass)) +
geom_point() +
geom_smooth(method = "lm",
se = FALSE,
colour = "black") +
annotate("text", x = 20, y = 110,
label = glue::glue("mass = { b1 } * day {if (b0<0) \"-\" else \"+\"} { abs(b0) }") ) +
theme_classic()
- The
glue::glue()
function allows us to use variables in the text - the variables go inside curly braces -
{ b1 }
and{ abs(b0) }
are the variables we assigned earlier and they will be replaced with their values in the text. - The
if (b0<0) \"-\" else \"+\"
part is a conditional that adds a minus sign if the intercept is negative and a plus sign if it is positive. Theabs(b0)
part makes sure we only show the absolute value of the intercept.
Improve the axes. You may need to refer back Changing the axes from the Week 7 workshop in BABS1 (Rand 2023)
Save your figure to your
figures
folder. Make sure you script figure saving with ggsave()
.
Linear Regression for estimating molecular weights from gels
In this section we will use a linear regression to estimate the molecular weights of proteins from a gel. Gels are run with a marker lane containing proteins of known molecular weights. The positions of these marker proteins are used to create a standard curve that allows us to estimate the molecular weights of other proteins in the gel.
This is a different use for linear regression than the previous example where we wanted to statistically test whether the an x-variable explained the variation in a y-variable.
Here we already know there is a very tight linear relationship between the position of the marker proteins on the gel and their molecular weights. We are using the linear regression to find the equation of the line that describes this relationship so we can use it to estimate the molecular weights of other proteins in the gel.
We will use this gel image (Figure 1) as an example: sds-page-gel.jpg. This gel has two sets of results (one each side) – since only 4 lanes were for each practical group, we put two groups on one gel.
Lane 1 (and 10) is the protein ladder which are the proteins of known molecular weights.
Lane 2 (and 9) is Uninduced E. coli lysate
Lane 3 (and 8) is Induced E. coli lysate
Lane 4 (and 7) is ShPatB
You don’t need worry about the details of the gel, only that our aim is to estimate the molecule weight of the ShPatB protein in lane 4 from its position on the gel and the standard curve created from the marker proteins in lane 1. Figure 1 illustrates the measurements needed from the gel image.
We will cover two options:
Where you have measured the length of the gel, the positions of ShPatB and the marker proteins on the gel manually – by opening the gel image in Powerpoint for example – and have a file containing the molecular weights and positions of the marker proteins. The measures can be in centimetres or pixels, it does not matter as long as they are all in the same units.
Where you have a file containing the molecular weights of the marker proteins and use R to measure the length of the gel, the positions of ShPatB and the marker proteins by importing the gel image. This method relies on the
locator()
which stores the coordinate positions when you click on a plot! Magic!
Make a new script then save it with a name like
mw-from-gel-regression.R
to carry out the rest of the work.
Option 1: Manual measurements
You have measured the positions of the marker proteins and added them to a file containing the molecular weights. Your file is: standard-mw-with-positions.csv. The molecular weights of the marker proteins are in kilodaltons (kDa) and the positions of the marker proteins are in pixels. The length of the gel is 810 pixels and ShPatB is at 394 pixels from the top of the gel.
Save standard-mw-with-positions.csv to
data-raw/
and import it.
mw_positions <- read_csv("data-raw/standard-mw-with-positions.csv")
Assign the position of ShPatB and the length of the gel to variables:
gel_length <- 810
pos_patB <- 394
We need to calculate \(R_f\) values for the marker proteins:
\[R_f = \frac{L - d}{L}\]
where \(L\) is the length of the gel and \(d\) is the distance to the band.
We also need to log the molecular weights of the marker proteins to make a linear relationship. We can add these two new variables to the data frame using the mutate()
.
Calculate Rf values for the marker proteins:
mw_positions <- mw_positions |>
mutate(Rf = (gel_length - dist_to_band) / gel_length,
log_kda = log(kda))
Plot the data with
geom_point()
and add a linear regression line with geom_smooth(method = "lm")
:
ggplot(mw_positions, aes(x = Rf, y = log_kda)) +
geom_point() +
geom_smooth(method = "lm",
se = FALSE) +
theme_classic()
Fit a linear model so we have the equation of the line
mod <- lm(log_kda ~ Rf, data = mw_positions)
Print the model:
mod
Call:
lm(formula = log_kda ~ Rf, data = mw_positions)
Coefficients:
(Intercept) Rf
1.091 5.231
We only need to print the coefficients – we don’t care about the statistical tests here. You can tell from the plot that the relationship is very tight and linear. The equation of the line is: \(MW\)= 5.231 * \(R_f\) + 1.091
You can substitute the values of the coefficients and the \(R_f\) of ShpatB to find the log molecular weight of ShPatB. Or you can use the predict()
function to do this for you.
Calculate the \(R_f\) of ShPatB:
patB_Rf = (gel_length - pos_patB) / gel_length
Predict the molecular weight ShPatB:
patB_kda <- predict(mod, newdata = data.frame(Rf = patB_Rf)) |>
exp()
patB_kda
1
43.69469
Option 2: Measurements in R
You have a file containing the molecular weights of the marker proteins standard-mw.txt and the image of the gel sds-page-gel.jpg. The molecular weights of the marker proteins are in kilodaltons (kDa)
Make a folder call
data-image
and save standard-mw.txt and sds-page-gel.jpg to it
Imort the molecular weights of the marker proteins from standard-mw.txt
mw <- read_csv("data-image/standard-mw.txt")
Load the
imager
package
Import the gel image:
gel <- load.image("data-image/sds-page-gel.jpg")
Base R’s generic plot()
function can handle image files and plot axes default. These are in pixels and will help us mark the top and bottom of the gel.
Plot the gel image:
plot(gel)
The imager
package has a function that will crop the edges of the image. This is certainly not essential but can have two benefits:
- the image is smaller which means plotting is quicker – this is especially useful when your images are many pixels
- it makes it a little easier determine where the axes numbers are on the gel
Crop the image:
# crop
gel_cropped <- crop.borders(gel, nx = 300, ny = 150)
crop.borders()
removes ny
from the top and the bottom and nx
from each side. You may wish to adjust the numbers.
Plot the cropped image:
plot(gel_cropped)
To make sure we measure distances from the same place we need to add lines to mark the top and the bottom of the gel. Notice that the y-axis is 0 at the top. I think the top is at about 180 and the bottom is about 990. We will assign these values to variables because they will be needed in calculations later. We will also need the gel length (bottom position - top position).
Assign values for the top and bottom of the gel to variables and calculate the length of the gel:
gel_top <- 180
gel_bottom <- 990
gel_length <- gel_bottom - gel_top
Plot the cropped gel with lines marking the top and bottom of the gel:
Make sure you run all theses commands. The base plotting system works a little differently that ggplot
. We do not use +
but we have to make sure we have recently run the plot()
command before the abline()
(and other functions that modify plots) will work. You will get Error in int_abline(a = a, b = b, h = h, v = v, untf = untf, ...) :
plot.new has not been called yet
At this point you have check that you are happy with the numbers used for the top and bottom of the gel. Adjust and replot if needed.
I like to add a vertical line in the marker lane to help guide my later measures.
Add a vertical line in the marker lane. Again, make sure you run all of the plotting code:
We are now ready to measure the band positions using the locator()
function. We need to measure the position of shPatB and the all the marker proteins. We will start with shPatB:
Run the
locator()
command and click on the shPatB in lane 4:
dist_to_patB <- locator(n = 1)
The (x,y) position of shPatB is now stored in an R object called pos_patB
.
The distance between the shPatB band and the top of the gel will be the y0value in dist_to_patB
minus the distance to the top of the gel.
Calculate the distance from the top of the gel to shPatB:
pos_patB <- dist_to_patB$y - gel_top
We have 9 bands so pass the argument n = 9
to the function. This means you will need to click on the graph 9 times. Start at the top – the heaviest band – and work your way down. You only need to click once on each band. The R cursor will disappear until you have clicked 9 times. Don’t worry if you make a mistake, just click until the cursor is returned and run the locator command again to start afresh.
Run the
locator()
command and click on the 9 bands in order from top to bottom:
# Number of bands in your marker lane
# click from the top or gel to the bottom
# i.e., high MW to low
marker_positions <- locator(n = 9)
Magic! The (x,y) position of each band is now stored in an R object called marker_positions
.
We need to
- combine the y positions with molecular weights in
mw
- calculate the distance to each band by substracting
gel_top
- calculate \(R_f\) values for the marker proteins using \(R_f = \frac{L - d}{L}\) where \(L\) is the length of the gel and \(d\) is the distance to the band.
- log the molecular weights of the marker proteins to make a linear relationship.
We can put all these new columns in a dataframe called mw_positions
frame using the mutate()
.
Create
mw_positions
from mw
and the y positions in marker_positions
:
mw_positions <- mw |>
mutate(y = marker_positions$y,
dist_to_band = y - gel_top,
Rf = (gel_length - dist_to_band) / gel_length,
log_kda = log(kda))
The process is now exactly the same as it was for Option 1.
Plot the data with
geom_point()
and add a linear regression line with geom_smooth(method = "lm")
:
ggplot(mw_positions, aes(x = Rf, y = log_kda)) +
geom_point() +
geom_smooth(method = "lm",
se = FALSE) +
theme_classic()
Fit a linear model so we have the equation of the line
mod <- lm(log_kda ~ Rf, data = mw_positions)
Print the model:
mod
Call:
lm(formula = log_kda ~ Rf, data = mw_positions)
Coefficients:
(Intercept) Rf
1.091 5.231
We only need to print the coefficients – we don’t care about the statistical tests here. You can tell from the plot that the relationship is very tight and linear. The equation of the line is: \(MW\)= 5.231 * \(R_f\) + 1.091
You can substitute the values of the coefficients and the \(R_f\) of ShpatB to find the log molecular weight of ShPatB. Or you can use the predict()
function to do this for you.
Calculate the \(R_f\) of ShPatB:
patB_Rf = (gel_length - pos_patB) / gel_length
Predict the molecular weight ShPatB:
patB_kda <- predict(mod, newdata = data.frame(Rf = patB_Rf)) |>
exp()
patB_kda
1
43.69469
If you would like to practice Option 2 again, you could repeat the the steps using the set of results on the other side of the gel.
Look after future you!
Future you will submit scripts for the assessment and these need to be organised, well-formatted, concise and well-commented so others can understand what you did and why.
You might need to:
- collect together library statements at the beginning of the code
- edit your comments for clarity and include a paragraph explaining what the analysis is about add code section headers (Ctrl+Shift+R)
- rename variables for consistency or clarity
- remove house keeping or exploratory code or mark it for later removal
- restyle code indentation (Ctrl+i) if needed
- check whitespace and line length.
You will find it useful to change some options in RStudio to make your life easier. Revisit Changing some defaults to make life easier.
If you need to make additional notes that do not belong in the script, you can add them a text file called README.txt
that future you – and anyone else will know exactly what to do with!
You’re finished!
🥳 Well Done! 🎉
Independent study following the workshop
The Code file
This contains all the code needed in the workshop even where it is not visible on the webpage.
The workshop.qmd
file is the file I use to compile the practical. Qmd stands for Quarto markdown. It allows code and ordinary text to be interweaved to produce well-formatted reports including webpages. View the source code for this workshop using the </> Code
button at the top of the page. Coding and thinking answers are marked with #---CODING ANSWER---
and #---THINKING ANSWER---
Pages made with R (R Core Team 2024), Quarto (Allaire et al. 2022), knitr
(Xie 2024, 2015, 2014), kableExtra
(Zhu 2024)