Data manipulation

Published

14 July, 2025

Introduction

  • Summarising data
    • Whole column
    • By groups in another column
  • Filtering rows
  • Adding columns
  • Writing data to a file

Summarising

Whole column

We can use summarise() to get summary statistics for a whole column.

🎬 Find the mean with:

cats |> 
  summarise(mean(mass))
# A tibble: 1 Γ— 1
  `mean(mass)`
         <dbl>
1         4.51

The result is a dataframe (not a vector). This is one of the very useful features of dplyr and the tidyverse. Dataframes are the main data structure in R, and most functions in the tidyverse return dataframes. This makes it easy to chain together multiple operations using the pipe operator (|>).

By default the name of the column is the function name. This is not very informative, so we can give it a more meaningful name.

Naming the column

To name the column, we can use the name = argument in the summarise() function.

🎬 Give a name to the new column containing the mean:

cats |> 
  summarise(mean = mean(mass))
# A tibble: 1 Γ— 1
   mean
  <dbl>
1  4.51

Another useful feature of dataframe outputs is that it is easy to add more columns.

Adding other summary stats

To add additional summary statistics, we can just add more arguments to the summarise() function. These are separated by commas.

🎬 Find the mean, standard deviation and sample size:

cats |>
  summarise(mean = mean(mass),
            sd = sd(mass),
            n = length(mass))
# A tibble: 1 Γ— 3
   mean    sd     n
  <dbl> <dbl> <int>
1  4.51  1.20    62

There is no function in R for the standard error of the mean, but we can calculate it from the standard deviation and sample size. The formula is:

πŸ› Add a column for the standard error which is given by \(\frac{s.d.}{\sqrt{n}}\). The function for the square root is sqrt().

Answer - don’t look until you have tried!
cats |> 
  summarise(mean = mean(mass),
            sd = sd(mass),
            n = length(mass),
            se = sd / sqrt(n))
# A tibble: 1 Γ— 4
   mean    sd     n    se
  <dbl> <dbl> <int> <dbl>
1  4.51  1.20    62 0.152

So far we have only summarised the whole column. We can also summarise by groups in another column.

By groups given in another column

The pipe operator (|>) is a great way to chain together multiple operations. We can use it to group the data by a column and then summarise it.

🎬 Find the mean, standard deviation, sample size and standard error for each coat colour:

cats |> 
  group_by(coat) |> 
  summarise(mean = mean(mass),
            sd = sd(mass),
            n = length(mass),
            se = sd / sqrt(n))
# A tibble: 6 Γ— 5
  coat           mean     sd     n     se
  <chr>         <dbl>  <dbl> <int>  <dbl>
1 black          4.63  1.33     23  0.278
2 calico         2.19 NA         1 NA    
3 ginger         4.46  1.12     10  0.353
4 tabby          4.86  0.444     8  0.157
5 tortoiseshell  4.50  0.929     5  0.416
6 white          4.34  1.34     15  0.345

Save the results with assignment

We can save the results of the summarise operation to a new dataframe using the assignment operator (<-). This is useful if we want to use the results later.

🎬 Assign the summarised data to a new dataframe called cats_summary:

cats_summary <- cats |> 
  group_by(coat) |> 
  summarise(mean = mean(mass),
            sd = sd(mass),
            n = length(mass),
            se = sd / sqrt(n))

Filtering rows

We often want to work with a subset of the rows in our data. This might be to focus on one group or to remove data through quality control. The filter() function is used to filter rows in a dataframe. It takes a logical expression as an argument and returns only the rows that match the expression.

Filter to only the black cats

🎬 Filter to only the black cats

cats_black <- cats |> 
  filter(coat == "black")

🎬 Or only the not black cats:

cats |> 
  filter(coat != "black")
# A tibble: 39 Γ— 2
   coat           mass
   <chr>         <dbl>
 1 white          2.8 
 2 tabby          4.96
 3 tortoiseshell  5.86
 4 white          4.3 
 5 tabby          4.92
 6 white          3.23
 7 white          6.53
 8 tortoiseshell  3.53
 9 tortoiseshell  4.52
10 ginger         5.18
# β„Ή 29 more rows

! is the NOT operator. It negates whatever comes next. So coat != "black" returns all rows where the coat is not black.

Piping is great

We can use the pipe operator to chain together multiple operations. For example, we can filter the data and then summarise it in one step. This means we don’t have to create a new dataframe for the filtered data. For exploratory analysis, this is often the best approach to avoid clutter.

🎬 Summarise the black cats:

cats |> 
  filter(coat == "black") |> 
  summarise(mean = mean(mass),
            n = length(mass))
# A tibble: 1 Γ— 2
   mean     n
  <dbl> <int>
1  4.63    23

Filtering on more than one condition

cats |> 
  filter(mass > 6,
         coat == "ginger")
# A tibble: 2 Γ— 2
  coat    mass
  <chr>  <dbl>
1 ginger  6.37
2 ginger  6.09

filter the raw data to remove calico

cats |> 
  filter(coat != "calico")
# A tibble: 61 Γ— 2
   coat           mass
   <chr>         <dbl>
 1 white          2.8 
 2 tabby          4.96
 3 tortoiseshell  5.86
 4 black          5.74
 5 white          4.3 
 6 tabby          4.92
 7 black          6.23
 8 white          3.23
 9 black          5.91
10 black          5.25
# β„Ή 51 more rows
cats_summary |> 
  filter(coat != "calico")
# A tibble: 5 Γ— 5
  coat           mean    sd     n    se
  <chr>         <dbl> <dbl> <int> <dbl>
1 black          4.63 1.33     23 0.278
2 ginger         4.46 1.12     10 0.353
3 tabby          4.86 0.444     8 0.157
4 tortoiseshell  4.50 0.929     5 0.416
5 white          4.34 1.34     15 0.345
cats_summary |> 
  filter(!is.na(sd))
# A tibble: 5 Γ— 5
  coat           mean    sd     n    se
  <chr>         <dbl> <dbl> <int> <dbl>
1 black          4.63 1.33     23 0.278
2 ginger         4.46 1.12     10 0.353
3 tabby          4.86 0.444     8 0.157
4 tortoiseshell  4.50 0.929     5 0.416
5 white          4.34 1.34     15 0.345
cats_summary |> 
  drop_na(sd)
# A tibble: 5 Γ— 5
  coat           mean    sd     n    se
  <chr>         <dbl> <dbl> <int> <dbl>
1 black          4.63 1.33     23 0.278
2 ginger         4.46 1.12     10 0.353
3 tabby          4.86 0.444     8 0.157
4 tortoiseshell  4.50 0.929     5 0.416
5 white          4.34 1.34     15 0.345

Adding columns

cats <- cats |> 
  mutate(mass_g = mass * 1000)
cats <- cats |> 
  mutate(is_ginger_chonk = mass > 6 & coat == "ginger")
cats <- cats |> 
  mutate(size = case_when(mass < 3.5 ~ "small",
                          mass > 5 ~ "large",
                          .default = "medium"  ))
cats <- cats |> 
  mutate(size2 = case_when(
    mass < 3.5 ~ "small",
    mass >= 3.5 & mass <= 5 ~ "medium",
    mass > 5 ~ "large"))

Writing data to a file

write_csv(cats, "cats-added.csv")

Pages made with R (R Core Team 2025), Quarto (Allaire et al. 2024), knitr (Xie 2024, 2015, 2014), kableExtra (Zhu 2024)

References

Allaire, J. J., Charles Teague, Carlos Scheidegger, Yihui Xie, Christophe Dervieux, and Gordon Woodhull. 2024. β€œQuarto.” https://doi.org/10.5281/zenodo.5960048.
R Core Team. 2025. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.
Xie, Yihui. 2014. β€œKnitr: A Comprehensive Tool for Reproducible Research in R.” In Implementing Reproducible Computational Research, edited by Victoria Stodden, Friedrich Leisch, and Roger D. Peng. Chapman; Hall/CRC.
β€”β€”β€”. 2015. Dynamic Documents with R and Knitr. 2nd ed. Boca Raton, Florida: Chapman; Hall/CRC. https://yihui.org/knitr/.
β€”β€”β€”. 2024. Knitr: A General-Purpose Package for Dynamic Report Generation in r. https://yihui.org/knitr/.
Zhu, Hao. 2024. kableExtra: Construct Complex Table with ’Kable’ and Pipe Syntax. https://CRAN.R-project.org/package=kableExtra.