9 From importing to reporting

Complete

You are reading a live document. This page is compete but suggestions for improvements are welcome. Follow the link to ‘Make a suggestion’ to suggest improvements.

In Chapter 7 we typed data into R. This is not very practical when you have a lot of data! Instead, we much more commonly import data from a file. In this chapter we go through the workflow from importing, through summarising and plotting to saving a saving the figure.

9.1 Importing data from files

There are two things you need to know before you can import data from a file.

What format the data are in

The format of the data determines what function you will use to import it. Often the file extension indicates format.
- .txt a plain text file¹, where the columns are often separated by a space but might also be separated by a tab, a backslash or forward slash, or some other character
- .csv a plain text file where the columns are separated by commas
- .xlsx an Excel file More detail on file types was covered in Section 3.1
However, you should always check the file to make sure it is in the format you expect because there is little to force a match between a file’s contents and the extension in its name. You can check by opening the file in a text editor (e.g., Notepad on Windows, TextEdit on Mac) or in RStudio (see below).
Where the file is relative to your working directory

R can only read in a file if you say where it is, i.e., you give its relative path. More detail on relative file paths and working directories was covered in Section 3.8

🎬 Your turn! If you want to code along you will need to start a new (see Section 8.1.2) then a new script.

This chapter covers reading .txt files and .csv files using tidyverse (Wickham et al. 2019) functions and excel files using the readxl (Wickham and Bryan 2025) package. We will demonstrate what needs to be done differently if the file is not in your working directory.

library(tidyverse)
library(readxl)

9.1.1 Importing data from `.txt` file

The data in adipocytes.txt give the concentration of a hormone called adiponectin in some cells. There are two columns: the first gives the adiponectin concentration and the second, treatment, indicates whether the cells were treated with nicotinic acid or not. Save this file to the project folder.

A .txt extension suggests this is plain text file with columns separated by spaces. However, before we attempt to read it in, when should take a look at it. We can do this from RStudio by clicking on the file in the Files pane. Any plain text file will open in the top left pane.

**Figure** 9.1: **A screenshot of RStudio showing the adipocytes.txt data file open over the script.** We can see the columns are separated by spaces which means we can use the `read_table()` function to read it in.

The files are separated by spaces as we suspected. We use the read_table() command to read in plain text files of single columns, or where the columns are separated by spaces:

adipo <- read_table("adipocytes.txt")

The data from the file has been read into a dataframe called adipo. You will and you will be able to see it in the Environment window. Clicking on it in the Environment window will open a spreadsheet-like view of the dataframe.

9.1.2 Importing a from a`.csv` file

The data seal.csv give the myoglobin concentration of skeletal muscle for three species of seal. There are two columns: the first gives the myoglobin concentration and the second indicates species.

The .csv extension suggests this is plain text file with columns separated by commas. We will again check this before we attempt to read it in. Click on the file in the Files pane - a pop-up will appear for files ending .csv or .xlsx. Choose View File².

**Figure** 9.2: **Screenshot of the Rstudio Files pane**. When you click on a file in the Files pane a pop-up will appear for files ending `.csv` or `.xlsx`. Choose *View File*. `.csv` files, as plain text files, will open on the left over the script.

CSV files will open in the top left pane (Excel files will launch Excel). You should be able to see that the file does contain comma separated values. There is aread_csv() function which works very like `read_table():

seal <- read_csv("seals.csv")

9.1.3 Importing a from a`.xlsx` file

The data in blood.xlsx are measurements of several blood parameters from fifty people with Crohn’s disease, a lifelong condition where parts of the digestive system become inflamed. Twenty-five of people are in the early stages of diagnosis and 25 have started treatment.

blood <- read_excel("blood.xlsx")

9.1.4 Importing data from a file not in your working directory

When using an RStudio project, your working directory is the project folder (the folder containing the .Rproj file. Suppose our file adipocytes.txt is in a folder, data-raw, in an Rstudio project called myproject. The folder myproject will be our working directory.

-- myproject
   |__myproject.Rproj
   |__import.R
   |__data-raw
      |__adipocytes.txt
      |__blood.xlsx
      |__seal.csv

From myproject/ we need to go first into data-raw/ before finding adipocytes.txt. This means the relative path to adipocytes.txt is data-raw/adipocytes.txt:

adipo <- read_table("data-raw/adipocytes.txt")

The Section 3.10 for a file is the path from the working directory to the file. In this case, the working directory is myproject and the relative path is data-raw/adipocytes.txt.

🎬 Your turn! Create a folder called data-raw inside the project folder and move the data files to it. Now modify the data import code to import seal.csv from data-raw.

Code

seal <- read_csv("data-raw/seal.csv")

9.2 Summarising data

We summarise data using the the dplyr (Wickham et al. 2023) package, which provides a set of functions designed for efficient data manipulation. This is a tidyverse (Wickham et al. 2019) package which means it was loaded when you used library(tidyverse). The approach replies on the data being organised in a tidy format, meaning each column represents a variable, each row represents an observation, and each cell contains a single value. The pipeline to summarise data is:

Group the data: If you want to summarize your data based on certain groups, you can use the group_by()
Summarise: Once your data grouped (if necessary), you use summarise() with functions likemean(), median(), sd(), min()and max() within it.

We will demonstrate summarising using the adipo dataframe. adiponectin is the response and is continuous and treatment is an explanatory with categorical with two levels (groups).

The most useful summary statistics for a continuous variable like adiponectin are the means, standard deviations, sample sizes and standard errors. We use the group_by() and summarise() functions along with the functions that do the calculations.

To create a data frame called adipo_summary that contains the means, standard deviations, sample sizes and standard errors for the control and nicotinic acid treated samples:

adipo_summary <- adipo  |> 
  group_by(treatment) |>
  summarise(mean = mean(adiponectin),
            std = sd(adiponectin),
            n = length(adiponectin),
            se = std/sqrt(n))

To print the contents of adipo_summary you can type its name in the console:

adipo_summary
## # A tibble: 2 × 5
##   treatment  mean   std     n    se
##   <chr>     <dbl> <dbl> <int> <dbl>
## 1 control    5.55  1.48    15 0.381
## 2 nicotinic  7.51  1.79    15 0.463

Or click on adipo_summary listed in the environment to open a spreadsheet-like view of the dataframe.

9.2.1 Visualise data

Most commonly, we put the explanatory variable on the x axis and the response variable on the y axis. In my opinion, you should also show all the raw data points along with some summary values. A continuous response, particularly one that follows the normal distribution, is best summarised with the mean and the standard error.

We are going to create a figure like this:

Each dots represents a single measurement of adiponectin in a standardised sample of cells – these are the raw data points. Within each treatment group, the data points have been jittered – this is adding a small amount of random noise in the x direction so points with the same y-value do not overlap. The error bars show the mean and standard error of the mean for each treatment group.

9.2.1.1 `ggplot2`

ggplot2 (Wickham 2016) is a powerful data visualisation package in R that is part of the tidyverse. It provides a flexible and layered approach to creating high-quality and customizable graphics.

The core concept of ggplot2 is to build a plot layer by layer. The basic structure consists of three main components:

Data: the data frame to be used for plotting.
Aesthetic mappings (aes): how variables in the dataset map to visual elements such as x and y positions, colours, shapes, etc.
Geometric objects (geoms): the actual graphical elements used to visualize the data, such as points, lines, bars, etc.

To create a basic plot, you start with the ggplot() function and provide the data and aesthetic mappings.

ggplot(data = adipo, aes(x = treatment, y = adiponectin))

You can add geometric layers to the plot using specific functions such as geom_point(), geom_line(), geom_bar(), etc. These functions define the type of plot you want to create:

ggplot(data = adipo, aes(x = treatment, y = adiponectin)) +
  geom_point()

In the figure we are aiming for, we are plotting two dataframes:

the adipo dataframe which contains the data points themselves
the adipo_summary dataframe containing and the means and standard errors.

The dataframes and aesthetics for ggplot can be specified within a geom_xxxx (rather than in the ggplot()). This is very useful if the geom only applies to some of the data you want to plot.

Tip: ggplot()

You put the data argument and aes() inside ggplot() if you want all the geoms to use that dataframe and variables. If you want a different dataframe for a geom, put the data argument and aes() inside the geom_xxxx()

I will build the plot up in small steps. You do not need to repeat the code at each step - simply edit your existing ggplot() command as we go.

We will plot the data points first. Notice that we have given the data argument and the aesthetics inside the geom_point(). This is because it is only in this geom that we want to use the dataframe with the raw data. The variables treatment and adiponectin are in the adipo dataframe.

ggplot() +
  geom_point(data = adipo, 
             aes(x = treatment, y = adiponectin))

So that the data points don’t overlap, we can add some random jitter in the x direction (edit your existing code):

ggplot() +
  geom_point(data = adipo, 
             aes(x = treatment, y = adiponectin),
             position = position_jitter(width = 0.1, height = 0))

Note that position = position_jitter(width = 0.1, height = 0) is inside the geom_point() parentheses, after the aes() and a comma.

We’ve set the vertical jitter to 0 because, in contrast to the categorical x-axis, movement on the y-axis has meaning (the adiponectin levels).

Let’s make the points a light grey (edit your existing code):

ggplot() +
  geom_point(data = adipo, 
             aes(x = treatment, y = adiponectin),
             position = position_jitter(width = 0.1, height = 0),
             colour = "grey30")

Now to add the error bars. These go from one standard error below the mean to one standard error above the mean.

Add a geom_errorbar() for error bars (edit your existing code):

ggplot() +
  geom_point(data = adipo, aes(x = treatment, y = adiponectin),
             position = position_jitter(width = 0.1, height = 0),
             colour = "grey30") +
  geom_errorbar(data = adipo_summary, 
                aes(x = treatment, ymin = mean - se, ymax = mean + se),
                width = 0.3)

We have specified the adipo_summary dataframe and the variables three variables given –treatment, mean and se – are in that dataframe.

There are several ways you could add the mean. You could use geom_point(). I like to use geom_errorbar() again with the ymin and ymax both set to the mean so that I just get a line at the mean.

Add a geom_errorbar() for the mean (edit your existing code):

ggplot() +
  geom_point(data = adipo, aes(x = treatment, y = adiponectin),
             position = position_jitter(width = 0.1, height = 0),
             colour = "grey30") +
  geom_errorbar(data = adipo_summary, 
                aes(x = treatment, ymin = mean - se, ymax = mean + se),
                width = 0.3) +
  geom_errorbar(data = adipo_summary, 
                aes(x = treatment, ymin = mean, ymax = mean),
                width = 0.2)

Alter the axis labels and limits using scale_y_continuous() and scale_x_discrete() (edit your existing code):

ggplot() +
  geom_point(data = adipo, aes(x = treatment, y = adiponectin),
             position = position_jitter(width = 0.1, height = 0),
             colour = "grey30") +
  geom_errorbar(data = adipo_summary, 
                aes(x = treatment, ymin = mean - se, ymax = mean + se),
                width = 0.3) +
  geom_errorbar(data = adipo_summary, 
                aes(x = treatment, ymin = mean, ymax = mean),
                width = 0.2) +
  scale_y_continuous(name = "Adiponectin (pg/mL)", 
                     limits = c(0, 12), 
                     expand = c(0, 0)) +
  scale_x_discrete(name = "Treatment", 
                   labels = c("Control", "Nicotinic acid"))

You only need to use scale_y_continuous() and scale_x_discrete() to use labels that are different from those in the dataset. Often this is to use proper terminology or to captialise.

Format the figure in a way that is more suitable for including in a report using theme_classic() (edit your existing code):

ggplot() +
  geom_point(data = adipo, aes(x = treatment, y = adiponectin),
             position = position_jitter(width = 0.1, height = 0),
             colour = "gray30") +
  geom_errorbar(data = adipo_summary, 
                aes(x = treatment, ymin = mean - se, ymax = mean + se),
                width = 0.3) +
  geom_errorbar(data = adipo_summary, 
                aes(x = treatment, ymin = mean, ymax = mean),
                width = 0.2) +
  scale_y_continuous(name = "Adiponectin (pg/mL)", 
                     limits = c(0, 12), 
                     expand = c(0, 0)) +
  scale_x_discrete(name = "Treatment", 
                   labels = c("Control", "Nicotinic acid")) +
  theme_classic()

The ggsave() function is used to save a ggplot object as an image file. It provides a convenient way to export your plots to various file formats, such as PNG, PDF, SVG, or JPEG.

The basic syntax of ggsave() is as follows:

ggsave(filename,
       plot,
       device,
       width,
       height,
       units,
       dpi)

You must give a file name for the output file but all the other options have defaults.

plot: The ggplot object you want to save. Defaults to the last created plot.
device: one of “png”, “eps”, “ps”, “tex”, “pdf”, “jpeg”, “tiff”, “png”, “bmp”, “svg” or “wmf” (windows only). Defaults to the format given by the file extension in filename
width, height units: Plot size in units (“in”, “cm”, “mm”, or “px”). Defaults to the size of the plot in the Plots window.
dpi: Plot resolution.

It is valuable to give explicit values even where there are defaults to ensure you can reproduce the figure.

We can save the figure we just created as a 3 inch x 3 inch png file as follows:

ggsave("adipocytes.png",
       device = "png",
       width = 3,
       height = 3,
       units = "in",
       dpi = 300)

It is often a good idea to explicitly assign the ggplot object to a variable and use that in the ggsave():

fig1 <- ggplot() +
  geom_point(data = adipo, aes(x = treatment, y = adiponectin),
             position = position_jitter(width = 0.1, height = 0),
             colour = "gray30") +
  geom_errorbar(data = adipo_summary, 
                aes(x = treatment, ymin = mean - se, ymax = mean + se),
                width = 0.3) +
  geom_errorbar(data = adipo_summary, 
                aes(x = treatment, ymin = mean, ymax = mean),
                width = 0.2) +
  scale_y_continuous(name = "Adiponectin (pg/mL)", 
                     limits = c(0, 12), 
                     expand = c(0, 0)) +
  scale_x_discrete(name = "Treatment", 
                   labels = c("Control", "Nicotinic acid")) +
  theme_classic()

ggsave("adipocytes.png",
       plot = fig1,
       device = "png",
       width = 3,
       height = 3,
       units = "in",
       dpi = 300)

9.3 Summary

To import data you need to match the file format (e.g., .txt, .csv, .xlsx) to the appropriate import function (e.g., read_table(), read_csv(), or read_excel() and give the relative path to the file.
Opening files in RStudio or Excel before trying to import will help you determine the format and delimiter used.
To summarising data in more than one group, we use group_by() then summarise() with functions like mean(), sd() and length().
We build plots using ggplot() and add layers such as geom_point() and geom_errorbar(). Axes can be customised axes using scale_x_####() or scale_y_####()and theme_classic() makes the plot look more those in a scientific report.
Save plots made with ggplot2 with ggsave().

Plain text files can be opened in notepad or other similar editor and still be readable.↩︎
There is also and option to import the dataset. Do not be tempted to import data this way! Unless you are careful and know what you are doing, your data import will not be scripted or will not be scripted correctly.↩︎

# From importing to reporting {#sec-import-to-report} ```{r} #| results: "asis" #| echo: false source("_common.R") status("complete") ``` In [Chapter -@sec-first-steps-rstudio] we typed data into R. This is not very practical when you have a lot of data! Instead, we much more commonly import data from a file. In this chapter we go through the workflow from importing, through summarising and plotting to saving a saving the figure. ## Importing data from files There are two things you need to know before you can import data from a file. 1. What format the data are in The format of the data determines what *function* you will use to import it. Often the file extension indicates format. - `.txt` a plain text file[^import-to-report-1], where the columns are often separated by a space but might also be separated by a tab, a backslash or forward slash, or some other character - `.csv` a plain text file where the columns are separated by commas - `.xlsx` an Excel file More detail on file types was covered in [@sec-file-systems-files] However, you should always check the file to make sure it is in the format you expect because there is little to force a match between a file's contents and the extension in its name. You can check by opening the file in a text editor (e.g., Notepad on Windows, TextEdit on Mac) or in RStudio (see below). 2. Where the file is relative to your working directory `R` can only read in a file if you say where it is, *i.e.*, you give its **relative path**. More detail on relative file paths and working directories was covered in [@sec-file-systems-files-working-directories] [^import-to-report-1]: Plain text files can be opened in notepad or other similar editor and still be readable. 🎬 Your turn! If you want to code along you will need to start a new (see [@sec-workflow-rstudio-creating-rsp]) then a new script. This chapter covers reading `.txt` files and `.csv` files using **`tidyverse`** [@tidyverse] functions and excel files using the **`readxl`** [@readxl] package. We will demonstrate what needs to be done differently if the file is not in your working directory. ```{r} library(tidyverse) library(readxl) ``` ### Importing data from `.txt` file The data in [adipocytes.txt](data-raw/adipocytes.txt) give the concentration of a hormone called adiponectin in some cells. There are two columns: the first gives the adiponectin concentration and the second, treatment, indicates whether the cells were treated with nicotinic acid or not. Save this file to the project folder. A `.txt` extension suggests this is plain text file with columns separated by spaces. However, before we attempt to read it in, when should take a look at it. We can do this from RStudio by clicking on the file in the Files pane. Any plain text file will open in the top left pane. ::: {#fig-text-data-file-open} ![](images/text-data-file-open.png) **A screenshot of RStudio showing the adipocytes.txt data file open over the script.** We can see the columns are separated by spaces which means we can use the `read_table()` function to read it in. ::: The files are separated by spaces as we suspected. We use the `read_table()` command to read in plain text files of single columns, or where the columns are separated by spaces: ```{r} #| eval: false adipo <- read_table("adipocytes.txt") ``` The data from the file has been read into a dataframe called `adipo`. You will and you will be able to see it in the Environment window. Clicking on it in the Environment window will open a spreadsheet-like view of the dataframe. ### Importing a from a`.csv` file The data [seal.csv](data-raw/seal.csv) give the myoglobin concentration of skeletal muscle for three species of seal. There are two columns: the first gives the myoglobin concentration and the second indicates species. The `.csv` extension suggests this is plain text file with columns separated by commas. We will again check this before we attempt to read it in. Click on the file in the Files pane - a pop-up will appear for files ending `.csv` or `.xlsx`. Choose View File[^import-to-report-2]. [^import-to-report-2]: There is also and option to import the dataset. **Do not** be tempted to import data this way! Unless you are careful and know what you are doing, your data import will not be scripted or will not be scripted correctly. ::: {#fig-csv-data-file-view} ![](images/csv-data-file-view.png) **Screenshot of the Rstudio Files pane**. When you click on a file in the Files pane a pop-up will appear for files ending `.csv` or `.xlsx`. Choose *View File*. `.csv` files, as plain text files, will open on the left over the script. ::: CSV files will open in the top left pane (Excel files will launch Excel). You should be able to see that the file does contain comma separated values. There is a`read_csv()` function which works very like \`read_table(): ```{r} #| eval: false seal <- read_csv("seals.csv") ``` ### Importing a from a`.xlsx` file The data in [blood.xlsx](data-raw/blood.xlsx) are measurements of several blood parameters from fifty people with Crohn's disease, a lifelong condition where parts of the digestive system become inflamed. Twenty-five of people are in the early stages of diagnosis and 25 have started treatment. ```{r} #| eval: false blood <- read_excel("blood.xlsx") ``` ### Importing data from a file *not* in your working directory When using an RStudio project, your working directory is the project folder (the folder containing the `.Rproj` file. Suppose our file [adipocytes.txt](data-raw/adipocytes.txt) is in a folder, `data-raw`, in an Rstudio project called `myproject`. The folder `myproject` will be our working directory. ``` -- myproject |__myproject.Rproj |__import.R |__data-raw |__adipocytes.txt |__blood.xlsx |__seal.csv ``` From `myproject/` we need to go first into `data-raw/` before finding `adipocytes.txt`. This means the *relative path* to `adipocytes.txt` is `data-raw/adipocytes.txt`: ```{r} adipo <- read_table("data-raw/adipocytes.txt") ``` The [@sec-file-systems-files-relative-paths] for a file is the path from the working directory to the file. In this case, the working directory is `myproject` and the relative path is `data-raw/adipocytes.txt`. 🎬 Your turn! Create a folder called `data-raw` inside the project folder and move the data files to it. Now modify the data import code to import seal.csv from `data-raw`. ```{r} #| code-fold: true seal <- read_csv("data-raw/seal.csv") ``` ## Summarising data We summarise data using the the **`dplyr`** [@dplyr] package, which provides a set of functions designed for efficient data manipulation. This is a **`tidyverse`** [@tidyverse] package which means it was loaded when you used `library(tidyverse)`. The approach replies on the data being organised in a tidy format, meaning each column represents a variable, each row represents an observation, and each cell contains a single value. The pipeline to summarise data is: - Group the data: If you want to summarize your data based on certain groups, you can use the `group_by()` - Summarise: Once your data grouped (if necessary), you use `summarise()` with functions like`mean()`, `median()`, `sd()`, `min()`and `max()` within it. We will demonstrate summarising using the `adipo` dataframe. `adiponectin` is the response and is continuous and `treatment` is an explanatory with categorical with two levels (groups). The most useful summary statistics for a continuous variable like `adiponectin` are the means, standard deviations, sample sizes and standard errors. We use the `group_by()` and `summarise()` functions along with the functions that do the calculations. To create a data frame called `adipo_summary` that contains the means, standard deviations, sample sizes and standard errors for the control and nicotinic acid treated samples: ```{r} #| include: false adipo <- read_table("data-raw/adipocytes.txt") ``` ```{r} adipo_summary <- adipo |> group_by(treatment) |> summarise(mean = mean(adiponectin), std = sd(adiponectin), n = length(adiponectin), se = std/sqrt(n)) ``` To print the contents of `adipo_summary` you can type its name in the console: ```{r} adipo_summary ``` Or click on `adipo_summary` listed in the environment to open a spreadsheet-like view of the dataframe. ### Visualise data Most commonly, we put the explanatory variable on the *x* axis and the response variable on the *y* axis. In my opinion, you should also show all the raw data points along with some summary values. A continuous response, particularly one that follows the normal distribution, is best summarised with the mean and the standard error. We are going to create a figure like this: ```{r} #| echo: false ggplot() + geom_point(data = adipo, aes(x = treatment, y = adiponectin), position = position_jitter(width = 0.1, height = 0), colour = "gray50") + geom_errorbar(data = adipo_summary, aes(x = treatment, ymin = mean - se, ymax = mean + se), width = 0.3) + geom_errorbar(data = adipo_summary, aes(x = treatment, ymin = mean, ymax = mean), width = 0.2) + scale_y_continuous(name = "Adiponectin (pg/mL)", limits = c(0, 12), expand = c(0, 0)) + scale_x_discrete(name = "Treatment", labels = c("Control", "Nicotinic acid")) + theme_classic() ``` Each dots represents a single measurement of adiponectin in a standardised sample of cells -- these are the raw data points. Within each treatment group, the data points have been `jittered` -- this is adding a small amount of random noise in the *x* direction so points with the same *y*-value do not overlap. The error bars show the mean and standard error of the mean for each treatment group. #### **`ggplot2`** **`ggplot2`** [@ggplot2] is a powerful data visualisation package in R that is part of the tidyverse. It provides a flexible and layered approach to creating high-quality and customizable graphics. The core concept of **`ggplot2`** is to build a plot layer by layer. The basic structure consists of three main components: 1. *Data*: the data frame to be used for plotting. 2. *Aesthetic mappings (aes)*: how variables in the dataset map to visual elements such as *x* and *y* positions, colours, shapes, etc. 3. *Geometric objects (geoms)*: the actual graphical elements used to visualize the data, such as points, lines, bars, etc. To create a basic plot, you start with the `ggplot()` function and provide the data and aesthetic mappings. ``` ggplot(data = adipo, aes(x = treatment, y = adiponectin)) ``` You can add geometric layers to the plot using specific functions such as `geom_point()`, `geom_line()`, `geom_bar()`, etc. These functions define the type of plot you want to create: ``` ggplot(data = adipo, aes(x = treatment, y = adiponectin)) + geom_point() ``` In the figure we are aiming for, we are plotting two dataframes: - the `adipo` dataframe which contains the data points themselves - the `adipo_summary` dataframe containing and the means and standard errors. The dataframes and aesthetics for ggplot can be specified *within a `geom_xxxx`* (rather than in the `ggplot()`). This is very useful if the geom only applies to some of the data you want to plot. ::: callout-tip ## Tip: `ggplot()` You put the `data` argument and `aes()` inside `ggplot()` if you want all the `geoms` to use that dataframe and variables. If you want a *different* dataframe for a `geom`, put the `data` argument and `aes()` inside the `geom_xxxx()` ::: I will build the plot up in small steps. You do not need to repeat the code at each step - simply edit your *existing* `ggplot()` command as we go. We will plot the data points first. Notice that we have given the data argument and the aesthetics *inside* the `geom_point()`. This is because it is only in this geom that we want to use the dataframe with the raw data. The variables `treatment` and `adiponectin` are in the `adipo` dataframe. ```{r} ggplot() + geom_point(data = adipo, aes(x = treatment, y = adiponectin)) ``` So that the data points don't overlap, we can add some random jitter in the *x* direction (edit your existing code): ```{r} ggplot() + geom_point(data = adipo, aes(x = treatment, y = adiponectin), position = position_jitter(width = 0.1, height = 0)) ``` Note that `position = position_jitter(width = 0.1, height = 0)` is inside the `geom_point()` parentheses, after the `aes()` and a comma. We've set the vertical jitter to 0 because, in contrast to the categorical *x*-axis, movement on the *y*-axis has meaning (the adiponectin levels). Let's make the points a light grey (edit your existing code): ```{r} ggplot() + geom_point(data = adipo, aes(x = treatment, y = adiponectin), position = position_jitter(width = 0.1, height = 0), colour = "grey30") ``` Now to add the error bars. These go from one standard error *below* the mean to one standard error *above* the mean. Add a `geom_errorbar()` for error bars (edit your existing code): ```{r} ggplot() + geom_point(data = adipo, aes(x = treatment, y = adiponectin), position = position_jitter(width = 0.1, height = 0), colour = "grey30") + geom_errorbar(data = adipo_summary, aes(x = treatment, ymin = mean - se, ymax = mean + se), width = 0.3) ``` We have specified the `adipo_summary` dataframe and the variables three variables given –`treatment`, `mean` and `se` – are in that dataframe. There are several ways you could add the mean. You could use `geom_point()`. I like to use `geom_errorbar()` again with the `ymin` and `ymax` both set to the mean so that I just get a line at the mean. Add a `geom_errorbar()` for the mean (edit your existing code): ```{r} ggplot() + geom_point(data = adipo, aes(x = treatment, y = adiponectin), position = position_jitter(width = 0.1, height = 0), colour = "grey30") + geom_errorbar(data = adipo_summary, aes(x = treatment, ymin = mean - se, ymax = mean + se), width = 0.3) + geom_errorbar(data = adipo_summary, aes(x = treatment, ymin = mean, ymax = mean), width = 0.2) ``` Alter the axis labels and limits using `scale_y_continuous()` and `scale_x_discrete()` (edit your existing code): ```{r} ggplot() + geom_point(data = adipo, aes(x = treatment, y = adiponectin), position = position_jitter(width = 0.1, height = 0), colour = "grey30") + geom_errorbar(data = adipo_summary, aes(x = treatment, ymin = mean - se, ymax = mean + se), width = 0.3) + geom_errorbar(data = adipo_summary, aes(x = treatment, ymin = mean, ymax = mean), width = 0.2) + scale_y_continuous(name = "Adiponectin (pg/mL)", limits = c(0, 12), expand = c(0, 0)) + scale_x_discrete(name = "Treatment", labels = c("Control", "Nicotinic acid")) ``` You only need to use `scale_y_continuous()` and `scale_x_discrete()` to use labels that are different from those in the dataset. Often this is to use proper terminology or to captialise. Format the figure in a way that is more suitable for including in a report using `theme_classic()` (edit your existing code): ```{r} ggplot() + geom_point(data = adipo, aes(x = treatment, y = adiponectin), position = position_jitter(width = 0.1, height = 0), colour = "gray30") + geom_errorbar(data = adipo_summary, aes(x = treatment, ymin = mean - se, ymax = mean + se), width = 0.3) + geom_errorbar(data = adipo_summary, aes(x = treatment, ymin = mean, ymax = mean), width = 0.2) + scale_y_continuous(name = "Adiponectin (pg/mL)", limits = c(0, 12), expand = c(0, 0)) + scale_x_discrete(name = "Treatment", labels = c("Control", "Nicotinic acid")) + theme_classic() ``` The `ggsave()` function is used to save a `ggplot` object as an image file. It provides a convenient way to export your plots to various file formats, such as PNG, PDF, SVG, or JPEG. The basic syntax of `ggsave()` is as follows: ``` ggsave(filename, plot, device, width, height, units, dpi) ``` You *must* give a file name for the output file but all the other options have defaults. - plot: The ggplot object you want to save. Defaults to the last created plot. - device: one of "png", "eps", "ps", "tex", "pdf", "jpeg", "tiff", "png", "bmp", "svg" or "wmf" (windows only). Defaults to the format given by the file extension in filename - width, height units: Plot size in units ("in", "cm", "mm", or "px"). Defaults to the size of the plot in the Plots window. - dpi: Plot resolution. It is valuable to give explicit values even where there are defaults to ensure you can reproduce the figure. We can save the figure we just created as a 3 inch x 3 inch png file as follows: ```{r} ggsave("adipocytes.png", device = "png", width = 3, height = 3, units = "in", dpi = 300) ``` It is often a good idea to explicitly assign the ggplot object to a variable and use that in the `ggsave()`: ```{r} fig1 <- ggplot() + geom_point(data = adipo, aes(x = treatment, y = adiponectin), position = position_jitter(width = 0.1, height = 0), colour = "gray30") + geom_errorbar(data = adipo_summary, aes(x = treatment, ymin = mean - se, ymax = mean + se), width = 0.3) + geom_errorbar(data = adipo_summary, aes(x = treatment, ymin = mean, ymax = mean), width = 0.2) + scale_y_continuous(name = "Adiponectin (pg/mL)", limits = c(0, 12), expand = c(0, 0)) + scale_x_discrete(name = "Treatment", labels = c("Control", "Nicotinic acid")) + theme_classic() ggsave("adipocytes.png", plot = fig1, device = "png", width = 3, height = 3, units = "in", dpi = 300) ``` ## Summary 1. To import data you need to match the file format (e.g., `.txt`, `.csv`, `.xlsx`) to the appropriate import function (e.g., `read_table()`, `read_csv()`, or `read_excel()` and give the *relative path* to the file. 2. Opening files in RStudio or Excel before trying to import will help you determine the format and delimiter used. 3. To summarising data in more than one group, we use `group_by()` then `summarise()` with functions like `mean()`, `sd()` and `length()`. 4. We build plots using `ggplot()` and add layers such as `geom_point()` and `geom_errorbar()`. Axes can be customised axes using `scale_x_####()` or `scale_y_####()`and `theme_classic()` makes the plot look more those in a scientific report. 5. Save plots made with **`ggplot2`** with `ggsave()`.

9.1 Importing data from files

9.1.1 Importing data from .txt file

9.1.2 Importing a from a.csv file

9.1.3 Importing a from a.xlsx file

9.1.4 Importing data from a file not in your working directory

9.2 Summarising data

9.2.1 Visualise data

9.2.1.1 ggplot2

9.3 Summary

9.1.1 Importing data from `.txt` file

9.1.2 Importing a from a`.csv` file

9.1.3 Importing a from a`.xlsx` file

9.2.1.1 `ggplot2`