Workshop

The logic of hypothesis testing and confidence intervals

Introduction

A little monster flying a biplane wearing aviator glasses, pulling a banner that says 'Fully expecting to hate this class.' Below, a teacher wearing a cheerleading outfit labeled 'STATS' with a bullhorn labeled 'CODE' cheering desperately with pom-poms, trying to help students believe stats is actually going to be awesomely life-changing.

Artwork by Horst (2023): “love this class”

Session overview

In this session you will remind yourself how to import files, and calculate confidence intervals on large and small samples.

Philosophy

Workshops are not a test. It is expected that you often don’t know how to start, make a lot of mistakes and need help. It is expected that you are familiar with independent study content before the workshop. However, you need not remember or understand every detail as the workshop should build and consolidate your understanding. Tips

  • don’t worry about making mistakes
  • don’t let what you can not do interfere with what you can do
  • discussing code with your neighbours will help
  • look things up in the independent study material
  • look things up in your own code from earlier
  • there are no stupid questions
Key

These four symbols are used at the beginning of each instruction so you know where to carry out the instruction.

Something you need to do on your computer. It may be opening programs or documents or locating a file.

Something you should do in RStudio. It will often be typing a command or using the menus but might also be creating folders, locating or moving files.

Something you should do in your browser on the internet. It may be searching for information, going to the VLE or downloading a file.

A question for you to think about and answer. Record your answers in your script for future reference.

Getting started

Start RStudio from the Start menu.

Go the Files tab in the lower right pane and click on the ... on the right. This will open a “Go to folder” window. Navigate to a place on your computer where you keep your work. Click Open.

Also on the Files tab click on New Folder. Type “data-analysis-in-r-2” in to the box. This will be the folder that we work in throughout the Data Analysis in R part BABS2.

Make an RStudio project for this workshop by clicking on the drop-down menu on top right where it says Project: (None) and choosing New Project, then New Directory, then New Project. Name the RStudio Project ‘week-1’.

Make a new script then save it with a name like analysis.R to carry out the rest of the work.

Add a comment to the script: # confidence intervals and load the tidyverse (Wickham et al. 2019) package

Make a new folder called data-raw.

Exercises

Remind yourself how to import files!

Importing data from files was covered in BABS 1 (Rand 2023) if you need to remind yourself.

Check your settings

Changing some defaults to make life easier

Some useful settings

Confidence intervals (large samples)

The data in beewing.txt are left wing widths of 100 honey bees (mm). The confidence interval for large samples is given by:

\(\bar{x} \pm 1.96 \times s.e.\)

Where 1.96 is the quantile for 95% confidence.

Save beewing.txt to your data-raw folder.

Read in the data and check the structure of the resulting dataframe.

Calculate and assign to variables: the mean, standard deviation and standard error:

# mean
m <- mean(beewing$wing_mm)

# standard deviation
sd <- sd(beewing$wing_mm)

# sample size (needed for the se)
n <- length(beewing$wing_mm)

# standard error
se <- sd / sqrt(n)

To calculate the 95% confidence interval we need to look up the quantile (multiplier) using qnorm()

q <- qnorm(0.975)

This should be about 1.96.

Now we can use it in our confidence interval calculation

lcl <- m - q * se
ucl <- m + q * se

Print the values

lcl
[1] 4.473176
ucl
[1] 4.626824

This means we are 95% confident the population mean lies between 4.47 mm and 4.63 mm. The usual way of expressing this is that the mean is 4.55 \(\pm\) 0.07 mm

Between what values would you be 99% confident of the population mean being?

Confidence intervals (small samples)

The confidence interval for small samples is given by:

\(\bar{x} \pm \sf t_{[d.f]} \times s.e.\)

The only difference between the calculation for small and large sample is the multiple. For large samples we use the “the standard normal distribution” accessed with qnorm(); for small samples we use the “t distribution” assessed with qt().The value returned by q(t) is larger than that returned by qnorm() which reflects the greater uncertainty we have on estimations of population means based on small samples.

The fatty acid Docosahexaenoic acid (DHA) is a major component of membrane phospholipids in nerve cells and deficiency leads to many behavioural and functional deficits. The cross sectional area of neurons in the CA 1 region of the hippocampus of normal rats is 155 \(\mu m^2\). A DHA deficient diet was fed to 8 animals and the cross sectional area (csa) of neurons is given in neuron.txt

Save neuron.txt to your data-raw folder

Read in the data and check the structure of the resulting dataframe

Assign the mean to m.

Calculate and assign the standard error to se.

To work out the confidence interval for our sample mean we need to use the t distribution because it is a small sample. This means we need to determine the degrees of freedom (the number in the sample minus one).

We can assign this to a variable, df, using:

df <- length(neur$csa) - 1

The t value is found by:

t <- qt(0.975, df = df)

Note that we are using qt() rather than qnorm() but that the probability, 0.975, used is the same. Finally, we need to put our mean, standard error and t value in the equation. \(\bar{x} \pm \sf t_{[d.f]} \times s.e.\).

The upper confidence limit is:

(m + t * se) |> round(2)
[1] 151.95

The first part of the command, (m + t * se) calculates the upper limit. This is ‘piped’ in to the round() function to round the result to two decimal places.

Calculate the lower confidence limit:

Given the upper and lower confidence values for the estimate of the population mean, what do you think about the effect of the DHA deficient diet?

Look after future you!

Have a look at the Assessment Overview on the VLE. Part of the assessment for BIO00026C Becoming a Bioscientist: Grand Challenges the RStudio Project which supports the figures and analysis in your report. You will zip up the RStudio Project folder and submit it to the VLE. That folder should contain all the data, code and figures you have used in report and all of the results should be reproducible. Reproducible means that if someone downloads that zipped folder and unzips it they should be able to understand what the analysis was, what you did and why and be able to run all the code to reproduce the figures and results in your report without issue.

You can practice this every week!

Make sure your script is saved. Close down the “week-1” RStudio project using either the file menu or the menu on the top right where the Project name appears.

Locate the week-1 folder in Windows explorer. Right click on the folder and select “Send to” and then “Compressed (zipped) folder”. This will create a file called week-1.zip. Email this file to someone near you have have them email you with theirs. Your neighbour should be able to download week-1.zip, unzip it and then open the project in RStudio and run the code to reproduce all your work. Note: Save the downloaded week-1.zip some where that is NOT your “data-analysis-in-r-2” to avoid naming conflicts. Also do not save it in any RStudio project folder.

You’re finished!

🥳 Well Done! 🎉

Header text 'R learners' above five friendly monsters holding up signs that together read 'we believe in you.'

Artwork by Horst (2023): “We belive in you!”

Independent study following the workshop

Consolidate

The Code file

This contains all the code needed in the workshop even where it is not visible on the webpage.

The workshop.qmd file is the file I use to compile the practical. Qmd stands for Quarto markdown. It allows code and ordinary text to be interweaved to produce well-formatted reports including webpages. View the Qmd in Browser. Coding and thinking answers are marked with #---CODING ANSWER--- and #---THINKING ANSWER---

Pages made with R (R Core Team 2023), Quarto (Allaire et al. 2022), knitr (Xie 2022), kableExtra (Zhu 2021)

References

Allaire, J. J., Charles Teague, Carlos Scheidegger, Yihui Xie, and Christophe Dervieux. 2022. Quarto. https://doi.org/10.5281/zenodo.5960048.
Horst, Allison. 2023. “Data Science Illustrations.” https://allisonhorst.com/allison-horst.
R Core Team. 2023. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.
Rand, Emma. 2023. Data Analysis in r for Becoming a Bioscientist. https://3mmarand.github.io/R4BABS/.
Wickham, Hadley, Mara Averick, Jennifer Bryan, Winston Chang, Lucy D’Agostino McGowan, Romain François, Garrett Grolemund, et al. 2019. “Welcome to the Tidyverse 4: 1686. https://doi.org/10.21105/joss.01686.
Xie, Yihui. 2022. “Knitr: A General-Purpose Package for Dynamic Report Generation in r.” https://yihui.org/knitr/.
Zhu, Hao. 2021. “kableExtra: Construct Complex Table with ’Kable’ and Pipe Syntax.” https://CRAN.R-project.org/package=kableExtra.