cats |>
summarise(mean(mass))
# A tibble: 1 Γ 1
`mean(mass)`
<dbl>
1 4.51
14 July, 2025
We can use summarise()
to get summary statistics for a whole column.
π¬ Find the mean with:
cats |>
summarise(mean(mass))
# A tibble: 1 Γ 1
`mean(mass)`
<dbl>
1 4.51
The result is a dataframe (not a vector). This is one of the very useful features of dplyr
and the tidyverse. Dataframes are the main data structure in R, and most functions in the tidyverse return dataframes. This makes it easy to chain together multiple operations using the pipe operator (|>
).
By default the name of the column is the function name. This is not very informative, so we can give it a more meaningful name.
To name the column, we can use the name =
argument in the summarise()
function.
π¬ Give a name to the new column containing the mean:
cats |>
summarise(mean = mean(mass))
# A tibble: 1 Γ 1
mean
<dbl>
1 4.51
Another useful feature of dataframe outputs is that it is easy to add more columns.
To add additional summary statistics, we can just add more arguments to the summarise()
function. These are separated by commas.
π¬ Find the mean, standard deviation and sample size:
# A tibble: 1 Γ 3
mean sd n
<dbl> <dbl> <int>
1 4.51 1.20 62
There is no function in R for the standard error of the mean, but we can calculate it from the standard deviation and sample size. The formula is:
π Add a column for the standard error which is given by \(\frac{s.d.}{\sqrt{n}}\). The function for the square root is sqrt()
.
# A tibble: 1 Γ 4
mean sd n se
<dbl> <dbl> <int> <dbl>
1 4.51 1.20 62 0.152
So far we have only summarised the whole column. We can also summarise by groups in another column.
The pipe operator (|>
) is a great way to chain together multiple operations. We can use it to group the data by a column and then summarise it.
π¬ Find the mean, standard deviation, sample size and standard error for each coat colour:
cats |>
group_by(coat) |>
summarise(mean = mean(mass),
sd = sd(mass),
n = length(mass),
se = sd / sqrt(n))
# A tibble: 6 Γ 5
coat mean sd n se
<chr> <dbl> <dbl> <int> <dbl>
1 black 4.63 1.33 23 0.278
2 calico 2.19 NA 1 NA
3 ginger 4.46 1.12 10 0.353
4 tabby 4.86 0.444 8 0.157
5 tortoiseshell 4.50 0.929 5 0.416
6 white 4.34 1.34 15 0.345
We can save the results of the summarise operation to a new dataframe using the assignment operator (<-
). This is useful if we want to use the results later.
π¬ Assign the summarised data to a new dataframe called cats_summary
:
We often want to work with a subset of the rows in our data. This might be to focus on one group or to remove data through quality control. The filter()
function is used to filter rows in a dataframe. It takes a logical expression as an argument and returns only the rows that match the expression.
π¬ Filter to only the black cats
cats_black <- cats |>
filter(coat == "black")
π¬ Or only the not black cats:
cats |>
filter(coat != "black")
# A tibble: 39 Γ 2
coat mass
<chr> <dbl>
1 white 2.8
2 tabby 4.96
3 tortoiseshell 5.86
4 white 4.3
5 tabby 4.92
6 white 3.23
7 white 6.53
8 tortoiseshell 3.53
9 tortoiseshell 4.52
10 ginger 5.18
# βΉ 29 more rows
!
is the NOT operator. It negates whatever comes next. So coat != "black"
returns all rows where the coat is not black.
We can use the pipe operator to chain together multiple operations. For example, we can filter the data and then summarise it in one step. This means we donβt have to create a new dataframe for the filtered data. For exploratory analysis, this is often the best approach to avoid clutter.
π¬ Summarise the black cats:
cats |>
filter(mass > 6,
coat == "ginger")
# A tibble: 2 Γ 2
coat mass
<chr> <dbl>
1 ginger 6.37
2 ginger 6.09
cats |>
filter(coat != "calico")
# A tibble: 61 Γ 2
coat mass
<chr> <dbl>
1 white 2.8
2 tabby 4.96
3 tortoiseshell 5.86
4 black 5.74
5 white 4.3
6 tabby 4.92
7 black 6.23
8 white 3.23
9 black 5.91
10 black 5.25
# βΉ 51 more rows
cats_summary |>
filter(coat != "calico")
# A tibble: 5 Γ 5
coat mean sd n se
<chr> <dbl> <dbl> <int> <dbl>
1 black 4.63 1.33 23 0.278
2 ginger 4.46 1.12 10 0.353
3 tabby 4.86 0.444 8 0.157
4 tortoiseshell 4.50 0.929 5 0.416
5 white 4.34 1.34 15 0.345
# A tibble: 5 Γ 5
coat mean sd n se
<chr> <dbl> <dbl> <int> <dbl>
1 black 4.63 1.33 23 0.278
2 ginger 4.46 1.12 10 0.353
3 tabby 4.86 0.444 8 0.157
4 tortoiseshell 4.50 0.929 5 0.416
5 white 4.34 1.34 15 0.345
cats_summary |>
drop_na(sd)
# A tibble: 5 Γ 5
coat mean sd n se
<chr> <dbl> <dbl> <int> <dbl>
1 black 4.63 1.33 23 0.278
2 ginger 4.46 1.12 10 0.353
3 tabby 4.86 0.444 8 0.157
4 tortoiseshell 4.50 0.929 5 0.416
5 white 4.34 1.34 15 0.345
cats <- cats |>
mutate(mass_g = mass * 1000)
cats <- cats |>
mutate(is_ginger_chonk = mass > 6 & coat == "ginger")
cats <- cats |>
mutate(size = case_when(mass < 3.5 ~ "small",
mass > 5 ~ "large",
.default = "medium" ))
cats <- cats |>
mutate(size2 = case_when(
mass < 3.5 ~ "small",
mass >= 3.5 & mass <= 5 ~ "medium",
mass > 5 ~ "large"))
write_csv(cats, "cats-added.csv")
Pages made with R (R Core Team 2025), Quarto (Allaire et al. 2024), knitr
(Xie 2024, 2015, 2014), kableExtra
(Zhu 2024)