Workshop

The logic of hypothesis testing and confidence intervals

Introduction

A little monster flying a biplane wearing aviator glasses, pulling a banner that says 'Fully expecting to hate this class.' Below, a teacher wearing a cheerleading outfit labeled 'STATS' with a bullhorn labeled 'CODE' cheering desperately with pom-poms, trying to help students believe stats is actually going to be awesomely life-changing.

Artwork by Horst (2023): “love this class”

Session overview

In this session you will remind yourself how to import files, and calculate confidence intervals on large and small samples. You will also learn how to create a zip file - this is what you will submit for the assessment.

Philosophy

Workshops are not a test. It is expected that you often don’t know how to start, make a lot of mistakes and need help. It is expected that you are familiar with independent study content before the workshop. However, you need not remember or understand every detail as the workshop should build and consolidate your understanding. Tips

  • don’t worry about making mistakes
  • don’t let what you can not do interfere with what you can do
  • discussing code with your neighbours will help
  • look things up in the independent study material
  • look things up in your own code from earlier
  • there are no stupid questions
Key

These four symbols are used at the beginning of each instruction so you know where to carry out the instruction.

Something you need to do on your computer. It may be opening programs or documents or locating a file.

Something you should do in RStudio. It will often be typing a command or using the menus but might also be creating folders, locating or moving files.

Something you should do in your browser on the internet. It may be searching for information, going to the VLE or a file.

A question for you to think about and answer. Record your answers in your script for future reference.

Getting started

Start RStudio from the Start menu.

Go the Files tab in the lower right pane and click on the ... on the right. This will open a “Go to folder” window. Navigate to a place on your computer where you keep your work. Click Open.

Also on the Files tab click on New Folder. Type “data-analysis-in-r-2” in to the box. This will be the folder that we work in throughout the Data Analysis in R part BABS2.

Make an RStudio project for this workshop by clicking on the drop-down menu on top right where it says Project: (None) and choosing New Project, then New Directory, then New Project. Name the RStudio Project ‘week-1’.

Make a new script then save it with a name like analysis.R to carry out the rest of the work.

Add a comment to the script: # confidence intervals and load the tidyverse (Wickham et al. 2019) package

Make a new folder called data-raw.

Exercises

Remind yourself how to import files!

Importing data from files was covered in BABS 1 (Rand 2023) if you need to remind yourself.

Check your settings

Changing some defaults to make life easier

Some useful settings

Confidence intervals (large samples)

The data in beewing.txt are left wing widths of 100 honey bees (mm). The confidence interval for large samples is given by:

\(\bar{x} \pm 1.96 \times s.e.\)

This means the upper limit is \(\bar{x} + 1.96 \times s.e.\) and the lower limit is \(\bar{x} - 1.96 \times s.e.\). 1.96 is the “quantile” for 95% confidence and can be found using the qnorm() function.

Save beewing.txt to your data-raw folder.

Read in the data and check the structure of the resulting dataframe.

Use the summarise() function to calculate: the mean, standard deviation, sample size and standard error and save them in a dataframe called beewing_summary:

beewing_summary <- beewing |> 
  summarise(mean = mean(wing_mm),
            n = length(wing_mm),
            sd = sd(wing_mm),
            se = sd / sqrt(n))

To see the values in the beewing_summary dataframe you can either type and run beewing_summary or click on the dataframe in the Environment tab.

Where did you do this before?

In the BABS 1 week 8 workshop we summarised the mass of cats using this method. You can look back at that if you need to.

To calculate the 95% confidence interval we need to look up the quantile (multiplier) using qnorm():

qnorm(0.975)
[1] 1.959964

This should be about 1.96. We can use this piece of code along with the mean and se columns to add the upper and lower confidence limits to the beewing_summary dataframe using mutate():

Add the upper and lower confidence limits to the beewing_summary dataframe:

beewing_summary <- beewing_summary |> 
  mutate(lcl95 = mean - qnorm(0.975) * se,
         ucl95 = mean + qnorm(0.975) * se)

I used the names lcl95 and ucl95 to stand for “95% lower confidence limit” and “95% upper confidence limit” respectively.

This means we are 95% confident the population mean lies between 4.47 mm and 4.63 mm.

How would you write this up in a report?

Between what values would you be 99% confident of the population mean being?

Hint

You will need to think about what probability to give qnorm(). For 95% confidence intervals we had to give 0.975. This is because 95% is 2.5% (0.025) in each tail of the distribution.

Confidence intervals (small samples)

The confidence interval for small samples is given by:

\(\bar{x} \pm t_{[d.f]} \times s.e.\)

The only difference between the C.I. calculation for small samples and the C.I. calculation for large sample is the multiplier. For large samples we use the “the standard normal distribution” accessed with qnorm(); for small samples we use the “t distribution” assessed with qt().

The value returned by q(t) is larger than that by qnorm() which reflects the greater uncertainty we have on estimations of population means based on small samples.

The fatty acid Docosahexaenoic acid (DHA) is a major component of membrane phospholipids in nerve cells and deficiency leads to many behavioural and functional deficits. The cross sectional area of neurons in the CA 1 region of the hippocampus of normal rats is 155 \(\mu m^2\). A DHA deficient diet was fed to 8 animals and the cross sectional area (csa) of neurons is given in neuron.txt

Save neuron.txt to your data-raw folder

Read in the data and check the structure of the resulting dataframe

Calculate: the mean, standard deviation, sample size and standard error and save them in a dataframe called neuron_summary:

neuron_summary <- neuron |> 
  summarise(mean = mean(csa),
            n = length(csa),
            sd = sd(csa),
            se = sd / sqrt(n))

To work out the confidence interval for our sample mean we need to use the t distribution because it is a small sample. This means we need to determine the degrees of freedom (the number in the sample minus one).

The t value is found using: qt(0.975, df = n - 1). Note that we are using qt() rather than qnorm() but that the probability, 0.975, used is the same.

This should be about 2.36. This is bigger than 1.96 to reflect the lower confidence we have in a mean from a small sample. We can use this piece of code along with the mean and se columns to add the upper and lower confidence limits to the neuron_summary dataframe using mutate():

Add the upper and lower confidence limits to the neuron_summary dataframe:

neuron_summary <- neuron_summary |> 
  mutate(lcl95 = mean - qt(0.975, df = n - 1) * se,
         ucl95 = mean + qt(0.975, df = n - 1) * se)

Given the upper and lower confidence values for the estimate of the population mean, what do you think about the effect of the DHA deficient diet?

Look after future you!

Have a look at the Assessment Overview on the VLE. Part of the assessment for BIO00026C Becoming a Bioscientist: Grand Challenges is to submit the RStudio Project which supports the figures and analysis in your report. You will zip up the RStudio Project folder and submit it to the VLE. That folder should contain all the data, code and figures you have used in your report and all of the results should be reproducible. Reproducible means that if someone downloads that zipped folder and unzips it they should be able to understand what the analysis was, what you did and why and be able to run all the code to reproduce the figures and results in your report without issue.

You can practice this every week!

Make sure your script is saved. Close down the “week-1” RStudio project using either the file menu or the menu on the top right where the Project name appears.

Locate the week-1 folder in Windows explorer. Right click on the folder and select “Send to” and then “Compressed (zipped) folder”. This will create a file called week-1.zip. Email this file to someone near you have have them email you with theirs. Your neighbour should be able to download week-1.zip, unzip it and then open the project in RStudio and run the code to reproduce all your work. ** Note: Save the downloaded week-1.zip some where that is NOT your “data-analysis-in-r-2” to avoid naming conflicts.** Also do not save it in any RStudio project folder.

You’re finished!

🥳 Well Done! 🎉

Header text 'R learners' above  five friendly monsters holding up signs that together read 'we believe  in you.'

Artwork by Horst (2023): “We belive in you!”

Independent study following the workshop

Consolidate

The Code file

This contains all the code needed in the workshop even where it is not visible on the webpage.

The workshop.qmd file is the file I use to compile the practical. Qmd stands for Quarto markdown. It allows code and ordinary text to be interweaved to produce well-formatted reports including webpages. View the Qmd in Browser. Coding and thinking answers are marked with #---CODING ANSWER--- and #---THINKING ANSWER---

Pages made with R (R Core Team 2024), Quarto (Allaire et al. 2022), knitr (Xie 2024, 2015, 2014), kableExtra (Zhu 2024)

References

Allaire, J. J., Charles Teague, Carlos Scheidegger, Yihui Xie, and Christophe Dervieux. 2022. Quarto. https://doi.org/10.5281/zenodo.5960048.
Horst, Allison. 2023. “Data Science Illustrations.” https://allisonhorst.com/allison-horst.
R Core Team. 2024. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.
Rand, Emma. 2023. Data Analysis in r for Becoming a Bioscientist. https://3mmarand.github.io/R4BABS/.
Wickham, Hadley, Mara Averick, Jennifer Bryan, Winston Chang, Lucy D’Agostino McGowan, Romain François, Garrett Grolemund, et al. 2019. “Welcome to the Tidyverse 4: 1686. https://doi.org/10.21105/joss.01686.
Xie, Yihui. 2014. “Knitr: A Comprehensive Tool for Reproducible Research in R.” In Implementing Reproducible Computational Research, edited by Victoria Stodden, Friedrich Leisch, and Roger D. Peng. Chapman; Hall/CRC.
———. 2015. Dynamic Documents with R and Knitr. 2nd ed. Boca Raton, Florida: Chapman; Hall/CRC. https://yihui.org/knitr/.
———. 2024. Knitr: A General-Purpose Package for Dynamic Report Generation in r. https://yihui.org/knitr/.
Zhu, Hao. 2024. kableExtra: Construct Complex Table with ’Kable’ and Pipe Syntax. https://CRAN.R-project.org/package=kableExtra.