class: center, middle, inverse, title-slide # Introduction to R and working with data. ## White Rose BBSRC DTP Training: An Introduction to Reproducible Analyses in R. ### Emma Rand ### University of York, UK --- <style> div.blue { background-color:#b0cdef; border-radius: 5px; padding: 20px;} div.grey { background-color:#d3d3d3; border-radius: 0px; padding: 0px;} </style> # Overview * Finding your way round RStudio * Typing in data, doing some calculations on it, plotting it * Understanding the manual * Working with a data * Importing data: working directories and paths * Summarising and visualising with the [`tidyverse`](https://www.tidyverse.org/) <img src="../pics/tidyverse_logo.png" width="160px" style="display: block; margin: auto 0 auto auto;" /> --- class: inverse # Finding your way round RStudio --- # RStudio: live demonstration Overview [Larger](http://www-users.york.ac.uk/~er13/RStudio%20Anatomy.svg). **Will be followed be a recap** <img src="http://www-users.york.ac.uk/~er13/RStudio%20Anatomy.svg" width="600px" /> There is an [RStudio cheatsheet](http://www-users.york.ac.uk/~er13/rstudio-ide.pdf) which covers more advanced RStudio features. --- # RStudio: Recap * the panels * making yourself comfortable * typing in the console sending commands * using R as a calculator * assigning values * where to see objects * using a script - make sure to execute * comments \# * data types and structures * functions `c()`, `class()` and `str()` * types of R files: .R, .RData .RHistory --- # RStudio: Recap .pull-left[ Top left Panel * Script - write and edit code and comments to keep --- Bottom left Panel * Console - where commands get executed and can be typed ] .pull-right[ Top right Panel * Environment - see your objects * History - of commands --- Bottom right Panel * Files - a file explorer * Packages - those installed and a method of installing * Help - the manual * Plots ] --- # RStudio: Recap Type of file * .R a script file: code and comments * .RData: a environment file also known as a workspace. Objects but no code * .RHistory: everything you typed, mostly wrong! Using a script * any R code can be executed from a script * code can be (should be!) commented * comments start with a `#` --- # RStudio: Recap Data types and structures These are the most commonly needed but there are others .pull-left[ Types * numeric * integer * logical * character ] .pull-right[ Structures * vectors * factors * dataframes ] --- class: inverse # Typing in data, doing some calculations on it, plotting it --- # Typing in data, calcs, plots ## The goal We will work some data on the number of males in 64 bird nests with a clutch size of 5. You are going to type data in R, summarise and plot it -- .pull-left[ <table class="table" style="font-size: 16px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:right;"> No. males </th> <th style="text-align:right;"> No. nests </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 4 </td> </tr> <tr> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 13 </td> </tr> <tr> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> 14 </td> </tr> <tr> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> 15 </td> </tr> <tr> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 13 </td> </tr> <tr> <td style="text-align:right;"> 5 </td> <td style="text-align:right;"> 5 </td> </tr> </tbody> </table> ] .pull-right[ <img src="02_intro_to_r_and_working_with_data_files/figure-html/unnamed-chunk-4-1.png" width="288" /> ] --- # Typing in data, calcs, plots ## Getting set up In RStudio do File | New project | New directory Be purposeful about where you create it and name it. I suggest `birds` -- Make a new script  and save it as analysis.R to carry out the rest of the work. --- # Typing in data, calcs, plots ## Data structures Make a vector `n` that holds the numbers 0 to 5. ```r # the number of males in a clutch of five n <- c(0, 1, 2, 3, 4, 5) ``` * Write your command in the analysis.R -- * Notice I have used a comment -- * Cursor on the line you want to execute -- * Execute with  or Ctrl+Enter --- # Typing in data, calcs, plots ## Data structures Let's take a look at it using `str()` (structure) and `class()` -- ```r class(n) ``` ``` ## [1] "numeric" ``` -- ```r str(n) ``` ``` ## num [1:6] 0 1 2 3 4 5 ``` It's a numeric vector. --- # Typing in data, calcs, plots ## Data structures Create a vector, `freq`, containing the numbers of nests with 0 to 5 males. ```r # the number of nests with 0 to 5 males freq <- c(4, 13, 14, 15, 13, 5) ``` --- # Typing in data, calcs, plots ## Total number of nests We can use `sum(freq)` to check the total number of nests is 64. ```r # the total number of nests sum(freq) ``` ``` ## [1] 64 ``` --- # Typing in data, calcs, plots ## Finding the mean We have frequencies so to find the mean number of males per nest we need the total number of males: .pull-left[ <table class="table" style="font-size: 16px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> No. males </th> <th style="text-align:left;"> No. nests </th> <th style="text-align:left;"> No. males *No. nests </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> 0 </td> <td style="text-align:left;"> 4 </td> <td style="text-align:left;"> 0 </td> </tr> <tr> <td style="text-align:left;"> 1 </td> <td style="text-align:left;"> 13 </td> <td style="text-align:left;"> 13 </td> </tr> <tr> <td style="text-align:left;"> 2 </td> <td style="text-align:left;"> 14 </td> <td style="text-align:left;"> 28 </td> </tr> <tr> <td style="text-align:left;"> 3 </td> <td style="text-align:left;"> 15 </td> <td style="text-align:left;"> 45 </td> </tr> <tr> <td style="text-align:left;"> 4 </td> <td style="text-align:left;"> 13 </td> <td style="text-align:left;"> 52 </td> </tr> <tr> <td style="text-align:left;"> 5 </td> <td style="text-align:left;"> 5 </td> <td style="text-align:left;"> 25 </td> </tr> <tr> <td style="text-align:left;font-weight: bold;background-color: #C0C2C9 !important;"> Total </td> <td style="text-align:left;font-weight: bold;background-color: #C0C2C9 !important;"> 64 </td> <td style="text-align:left;font-weight: bold;background-color: #C0C2C9 !important;"> 163 </td> </tr> </tbody> </table> ] .pull-right[ `$$\frac{163}{64} = 2.55$$` ] --- # Typing in data, calcs, plots ## Finding the mean Total number of nests: ```r total_nests <- sum(freq) ``` -- Total number of males ```r total_males <- sum(n * freq) ``` --- # Typing in data, calcs, plots ## Finding the mean Mean number of males per nest: ```r total_males/total_nests ``` ``` ## [1] 2.546875 ``` --- # Why it works R works 'elementwise' unlike most programming languages. ` n * freq` gives `$$\begin{bmatrix}0\\1\\2\\3\\4\\5\end{bmatrix}\times\begin{bmatrix}4\\13\\14\\15\\13\\5\end{bmatrix}=\begin{bmatrix}0\\13\\28\\45\\52\\25\end{bmatrix}$$` It was designed to make it easy to work with data. --- class: inverse # Plotting the data with ggplot() --- background-image: url(../pics/ggplot2.png) background-position: 90% 75% background-size: 200px # Typing in data, calcs, plots Commands like `c()`, `sum()`, and `str()` are part the 'base' R system. Base packages (collections of commands) always come with R. -- Other packages, such as `ggplot2` (Wickham, 2016) need to be added. `ggplot2` is one of the `tidyverse` packages. --- background-image: url(../pics/tidyverse.png) background-position: 90% 75% background-size: 200px # Typing in data, calcs, plots You should have already installed `tidyverse` but we need to load it (add it to our library) before we can use it in a session. ```r library(tidyverse) ``` -- We will also later use `dplyr` and `tidyr` functions also from `tidyverse`. -- `ggplot2` is the name of the package `ggplot()` is its most important command --- # Plotting using ggplot2 ## Data structure for `ggplot()` `ggplot()` takes a dataframe for an argument We can make a dataframe of the two vectors, `n` and `freq`: ```r nest_data <- data.frame(n = factor(n), freq) ``` `n` was made into a factor (a categorical variable). --- # Plotting using ggplot2 ## Data structure for `ggplot()` Check: ```r str(nest_data) ``` ``` ## 'data.frame': 6 obs. of 2 variables: ## $ n : Factor w/ 6 levels "0","1","2","3",..: 1 2 3 4 5 6 ## $ freq: num 4 13 14 15 13 5 ``` -- `nest_data` is the name we have given the dataframe Click on `nest_data` in the Environment. --- # Plotting using ggplot2 ## A barplot Create a simple barplot using `ggplot` like this: ```r ggplot(data = nest_data, aes(x = n, y = freq)) + geom_col() ``` <img src="02_intro_to_r_and_working_with_data_files/figure-html/unnamed-chunk-17-1.png" width="288" /> --- # Plotting using ggplot2 ## A barplot `ggplot()` alone creates a blank plot. -- `ggplot(data = nest_data)` looks the same. -- `aes()` gives the 'Aesthetic mappings'. How variables (columns) are mapped to visual properties (aesthetics) e.g., axes, colour, shapes. Thus... --- # Plotting using ggplot2 ## A barplot `ggplot(data = nest_data, aes(x = n, y = freq))` produces a plot with axes -- `geom_col` A 'Geom' (Geometric object) gives the visual representations of the data: points, lines, bars, boxplots etc. --- class: inverse # Using the help manual --- # Using the help manual 'Arguments' can be added to the `geom_col()` command. Commands do something Their arguments go in brackets and can specify: * what object to do it to * how exactly to do it -- Many commands have defaults so you need only supply an object. -- Open the manual page using: ```r ?geom_col() ``` ## Demonstration --- # Using the help manual: Recap * **Description** an overview of what the command does * **Usage** lists argument * form: argument name = default value * some arguments MUST be supplied others have defaults * ... means etc * **Arguments** gives the detail about the arguments * **Details** describes how the command works in more detail * **Value** gives the output of the command * Don't be too perturbed by not fully understanding the information --- # Using manual: Alter a ggplot Change the fill of the bars using `fill`: ```r ggplot(data = nest_data, aes(x = n, y = freq)) + geom_col(fill = "lightblue") ``` .pull-left[ <img src="02_intro_to_r_and_working_with_data_files/figure-html/unnamed-chunk-19-1.png" width="288" /> ] -- .pull-right[ Colours can be given by their name, "lightblue" or code, "#ADD8E6". Look up by [name](../pics/colournames.pdf) or [code](../pics/colourhex.pdf) ] --- # Using manual: Alter a ggplot `fill` is an aesthetic. We can set (map) fill aesthetic to a particular colour inside `geom_col()` or map it to a variable inside the `aes()` instead --- # Using manual: Alter a ggplot ```r ggplot(data = nest_data, aes(x = n, y = freq, fill = n)) + geom_col() ``` .pull-left[ <img src="02_intro_to_r_and_working_with_data_files/figure-html/unnamed-chunk-20-1.png" width="288" /> ] -- .pull-right[ Mapping fill to variable means the colour varies for each value of n. ] --- # Using manual: Alter a ggplot Can you use the manual to put the bars next to each other? .footnote[ <br> <span style=" font-weight: bold; color: #f6fafd !important;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #25496b !important;" >Extra exercise:</span> Change the colour of the lines around each bar to black.] -- ```r ggplot(data = nest_data, aes(x = n, y = freq)) + geom_col(fill = "lightblue", width = 1) ``` <img src="02_intro_to_r_and_working_with_data_files/figure-html/unnamed-chunk-21-1.png" width="288" /> --- # Using manual: Alter a ggplot <span style=" font-weight: bold; color: #f6fafd !important;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #25496b !important;" >Extra exercise:</span> Change the colour of the lines around each bar to black. .pull-left[ ```r ggplot(data = nest_data, aes(x = n, y = freq)) + geom_col(fill = "lightblue", width = 1, colour = "black") ``` ] .pull-right[ <img src="02_intro_to_r_and_working_with_data_files/figure-html/unnamed-chunk-22-1.png" width="288" /> ] --- # Top Tip <div class = "blue"> .font100[ Make your code easier to read by using white space and new lines * put spaces around `=` , `->` and after `,` * use a newline after every comma in a command with lots of arguments ] </div> --- # Alter a ggplot: axes Changes to a discrete x axis: `scale_x_discrete()`. Changes to a continuous y axis: `scale_y_continuous()`. `ggplot` automatically extends the axes slightly. You can turn this behaviour off with the expand argument. -- Each 'layer' is added to the ggplot() command with a `+` --- # Alter a ggplot: axes ```r ggplot(data = nest_data, aes(x = n, y = freq)) + geom_col(fill = "lightblue", width = 1, colour = "black") + * scale_x_discrete(expand = c(0, 0)) + * scale_y_continuous(expand = c(0, 0)) ``` <img src="02_intro_to_r_and_working_with_data_files/figure-html/unnamed-chunk-23-1.png" width="288" /> .pull-right[ .footnote[ <span style=" font-weight: bold; color: #fdf9f6 !important;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #25496b !important;" >Extra exercise:</span> Look up `scale_x_discrete` in the manual and work out how to change "n" to "Number of Males"] ] --- # Alter a ggplot: axes ```r ggplot(data = nest_data, aes(x = n, y = freq)) + geom_col(fill = "lightblue", width = 1, colour = "black") + scale_x_discrete(expand = c(0, 0), * name = "Number of Males") + scale_y_continuous(expand = c(0, 0), name = "Number of Nests") ``` <img src="02_intro_to_r_and_working_with_data_files/figure-html/unnamed-chunk-24-1.png" width="288" /> --- class: inverse # Working with imported data --- # The goal .pull-left[ Summarise <table class="table" style="font-size: 16px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="empty-cells: hide;border-bottom:hidden;" colspan="2"></th> <th style="border-bottom:hidden;padding-bottom:0; padding-left:3px;padding-right:3px;text-align: center; " colspan="2"><div style="border-bottom: 1px solid #ddd; padding-bottom: 5px; "> Interorbital distance</div></th> </tr> <tr> <th style="text-align:left;"> Population </th> <th style="text-align:right;"> N </th> <th style="text-align:right;"> Mean </th> <th style="text-align:right;"> SE </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> A </td> <td style="text-align:right;"> 40 </td> <td style="text-align:right;"> 11.24 </td> <td style="text-align:right;"> 0.12 </td> </tr> <tr> <td style="text-align:left;"> B </td> <td style="text-align:right;"> 40 </td> <td style="text-align:right;"> 11.70 </td> <td style="text-align:right;"> 0.09 </td> </tr> </tbody> </table> ] .pull-right[ Plot <img src="02_intro_to_r_and_working_with_data_files/figure-html/unnamed-chunk-26-1.png" width="288" /> ] --- background-image: url(../pics/interorbital.png) background-position: 100% 0% background-size: 180px # Working with Data ## Importing This section will teach you about three concepts: -- 1. 'working directories', 'paths' and 'relative paths' -- 2. Tidy data -- 3. dealing with data in more than group -- We will work with the interorbital distances of domestic pigeons in two different populations: A and B --- # Working with Data ## Importing It is good practice to organise your files into a folder structure. For example, I often use: --- # Example organisation .pull-left[ Directories * systematic * repeatable Naming: human and machine readable * no spaces * use snake/kebab case * ordering: numbers (zero left padded), dates * file extensions ] .pull-right[ .code40[ ``` -- stem_cell_rna_2019 |__stem_cell_rna_2019.Rproj |__raw_ data |__2019-03-21_donor_1.csv |__2019-03-21_donor_2.csv |__2019-03-21_donor_3.csv |__processed_data |__all_long.txt |__all_wide.txt |__figures |__01_volcano_donor_1_vs_donor_2.eps |__02_volcano_donor_1_vs_donor_3.eps |__functions |__01_file_import |__02_normalise.R |__theme_pca.R |__theme_volcano.R |__pics |__01_image.png |__01_image.png |__README.md |__refs |__r_refs.bib |__proj_refs.bib |__analyses |__01_data_processing.R |__02_exploratory.R |__03_modelling.R |__04_figures.R |__05_report.Rmd ``` ] ] --- # Working with Data ## Organising Make two folders in your Project directory (this is also your working directory) called 'data' The easiest way to do this is in RStudio - see the bottom right Files panel -- Save a copy of [pigeon.txt](../data/pigeon.txt) to the 'data' folder --- # Working with Data ## Start coding Make a new script called 'analysis.R' -- Add this code: ```r # load packages library(tidyverse) ``` We need to load the `tidyverse` packages for several of commands we will use. --- # Working with Data ## Importing To read the data in to R you need to use the 'relative path' to the file in the `read_table()` command: ```r pigeon <- read_table("data/pigeon.txt") ``` -- The `data/` part is the 'relative path' to the file. -- It says where the file is **relative to your working directory** pigeon.txt is inside a folder (directory) called 'data' which is in your working directory. --- # Working with Data Check all is well by looking at the structure of the dataframe pigeon using `str()` ```r str(pigeon) ``` ``` ## spec_tbl_df [40 x 2] (S3: spec_tbl_df/tbl_df/tbl/data.frame) ## $ A: num [1:40] 12.4 11.2 11.6 12.3 11.8 10.7 11.3 11.6 12.3 10.5 ... ## $ B: num [1:40] 12.6 11.3 12.1 12.2 11.8 11.5 11.2 11.9 11.2 12.1 ... ## - attr(*, "spec")= ## .. cols( ## .. A = col_double(), ## .. B = col_double() ## .. ) ``` --- # Working with Data ## Understanding the dataframe A dataframe is made of columns and rows The columns are the variables; the rows are the observations -- To refer a column inside a dataframe (a single column data) we need to use the dollar notation: `dataframe$columnname` --- # Working with Data ## Understanding the dataframe So to output all of the values of one column we can use: .scroll-output-width[ ```r pigeon$A ``` ``` ## [1] 12.4 11.2 11.6 12.3 11.8 10.7 11.3 11.6 12.3 10.5 12.1 10.4 10.8 11.9 10.9 ## [16] 10.8 10.4 12.0 11.7 11.3 11.5 11.8 10.3 10.3 11.5 10.7 11.3 11.6 13.3 10.7 ## [31] 12.1 10.2 10.8 11.4 10.9 10.3 10.4 10.0 11.2 11.3 ``` ] --- # Working with Data ## Summarising multi-group data Thus to find the mean of a single column data: .pull-left[ ```r mean(pigeon$A) ``` ``` ## [1] 11.24 ``` ```r mean(pigeon$B) ``` ``` ## [1] 11.7025 ``` ] .pull-right[ This relies on the groups being in different columns! ] --- # Working with Data ## Tidy format Instead of having a population in each column, we often have, **and want**, all measurements in one column with a second column giving the group. -- This format is described as 'tidy' (Wickham Averick, et al., 2019). -- Has one variable in each column and only one observation (case) per row. -- Captures the structure of data and allows you to specify the role of variables in analyses and visualisations. --- # Data Organisation ## What is tidy data? One response per row. Tidy data adhere to a consistent structure which makes it easier to manipulate, model and visualize them. The structure is defined by: 1. Each variable has its own column. 2. Each observation has its own row. 3. Each value has its own cell. --- # Data Organisation ## What is tidy data? The term 'tidy data' was popularised by Wickham (2014). Closely allied to the relational algebra of relational databases (Codd, 1990). Underlies the enforced rectangular formatting in SPSS, STATA and R's dataframe. -- There may be more than one potential tidy structure. --- # Working with Data ## Tidy format Suppose we had just 3 individuals in each of two populations: .pull-left[ NOT TIDY! <table class="table" style="font-size: 16px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:right;"> A </th> <th style="text-align:right;"> B </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 12.4 </td> <td style="text-align:right;"> 12.6 </td> </tr> <tr> <td style="text-align:right;"> 11.2 </td> <td style="text-align:right;"> 11.3 </td> </tr> <tr> <td style="text-align:right;"> 11.6 </td> <td style="text-align:right;"> 12.1 </td> </tr> </tbody> </table> ] .pull-right[ TIDY! <table class="table" style="font-size: 16px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> population </th> <th style="text-align:right;"> distance </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> A </td> <td style="text-align:right;"> 12.4 </td> </tr> <tr> <td style="text-align:left;"> A </td> <td style="text-align:right;"> 11.2 </td> </tr> <tr> <td style="text-align:left;"> A </td> <td style="text-align:right;"> 11.6 </td> </tr> <tr> <td style="text-align:left;"> B </td> <td style="text-align:right;"> 12.6 </td> </tr> <tr> <td style="text-align:left;"> B </td> <td style="text-align:right;"> 11.3 </td> </tr> <tr> <td style="text-align:left;"> B </td> <td style="text-align:right;"> 12.1 </td> </tr> </tbody> </table> ] --- # Working with Data We can make the data tidy with `pivot_longer()`<sup>1</sup>. .footnote[ [1] `pivot_longer()` is a function from a package called `tidyr` which is one of the `tidyverse` packages. ] -- `pivot_longer()` collects the values from specified columns (`cols`) into a single column (`values_to`) and creates a column to indicate the group (`names_to`). --- # Working with Data .scroll-output-width[ ```r pigeon2 <- pivot_longer(data = pigeon, cols = everything(), names_to = "population", values_to = "distance") str(pigeon2) ``` ``` ## tibble [80 x 2] (S3: tbl_df/tbl/data.frame) ## $ population: chr [1:80] "A" "B" "A" "B" ... ## $ distance : num [1:80] 12.4 12.6 11.2 11.3 11.6 12.1 12.3 12.2 11.8 11.8 ... ``` ] A 'tibble' `\(\approx\)` dataframe --- # Working with Data Now we have a dataframe in tidy format which *will* make it easier to summarise, analyse and visualise. -- To summarise data in this format we use the `group_by()` and `summarise()` functions. -- We will also use the pipe operator: ` %>% ` --- # Working with Data To summarise multiple group data in tidy form: ```r pigeon2 %>% group_by(population) %>% summarise(mean = mean(distance)) ``` -- This can be read as: - take pigeon2 *and then* - group it by population *and then* - summarise it by calculating the mean i.e., the mean is done for each population. -- The `mean` before the `=` is just a name. --- # Working with Data The result: ``` ## # A tibble: 2 x 2 ## population mean ## <chr> <dbl> ## 1 A 11.2 ## 2 B 11.7 ``` --- # Working with Data We can add the number of pigeons in each group to the summary using the `length()` function. ```r pigeon2 %>% group_by(population) %>% summarise(mean = mean(distance), * n = length(distance)) ``` .footnote[ <span style=" font-weight: bold; color: #fdf9f6 !important;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #25496b !important;" >Extra exercise:</span> Add a column for the standard deviation <span style=" font-weight: bold; color: #fdf9f6 !important;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #25496b !important;" >Extra exercise:</span> Add a column for the standard error given by `\(\frac{s.d.}{\sqrt{n}}\)` ] --- # Working with Data The result: ``` ## # A tibble: 2 x 3 ## population mean n ## <chr> <dbl> <int> ## 1 A 11.2 40 ## 2 B 11.7 40 ``` --- # Working with Data ```r pigeon2 %>% group_by(population) %>% summarise(mean = mean(distance), n = length(distance), * sd = sd(distance), * se = sd/sqrt(n)) ``` ``` ## # A tibble: 2 x 5 ## population mean n sd se ## <chr> <dbl> <int> <dbl> <dbl> ## 1 A 11.2 40 0.740 0.117 ## 2 B 11.7 40 0.573 0.0906 ``` --- # Working with Data To plot this data as a histogram: ```r ggplot(data = pigeon2, aes(x = distance)) + * geom_histogram(bins = 10, col = "black") + scale_x_continuous(name = "Interorbital distance (mm)") ``` .pull-left[ <img src="02_intro_to_r_and_working_with_data_files/figure-html/unnamed-chunk-44-1.png" width="288" /> ] -- .pull-right[ `geom_histogram()` is the `geom` and `bins` gives the number of bars. This is whole data set, not separated by population! ] --- # Working with Data To plot multiple group data in tidy form we map the population variable to the `fill` aesthetic ```r *ggplot(data = pigeon2, aes(x = distance, fill = population)) + geom_histogram(bins = 10, col = "black") + scale_x_continuous(name = "Interorbital distance (mm)") ``` <img src="02_intro_to_r_and_working_with_data_files/figure-html/unnamed-chunk-45-1.png" width="288" /> .pull-right[ .footnote[ <span style=" font-weight: bold; color: #fdf9f6 !important;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #25496b !important;" >Extra exercise:</span> Make the axes cross at (0,0)] ] --- # Working with Data <span style=" font-weight: bold; color: #fdf9f6 !important;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #25496b !important;" >Extra exercise:</span> Make the axes cross at (0,0) ```r ggplot(data = pigeon2, aes(x = distance, fill = population)) + geom_histogram(bins = 10, col = "black") + scale_x_continuous(name = "Interorbital distance (mm)", * expand = c(0, 0)) + scale_y_continuous(name = "Frequency", * expand = c(0, 0)) ``` result on next slide. --- # Working with Data The result: <img src="02_intro_to_r_and_working_with_data_files/figure-html/unnamed-chunk-46-1.png" width="288" /> --- # Working with Data `geom_density()` can also be used when `distance` is mapped to `x` and `y` gives a measure of occurrence. ```r ggplot(data = pigeon2, aes(x = distance, fill = population)) + * geom_density(col = "black") + scale_x_continuous(name = "Interorbital distance (mm)") ``` <img src="02_intro_to_r_and_working_with_data_files/figure-html/unnamed-chunk-47-1.png" width="288" /> --- # Working with Data Alter the transparency using `alpha`: ```r ggplot(data = pigeon2, aes(x = distance, fill = population)) + * geom_density(col = "black", alpha = 0.3) + scale_x_continuous(name = "Interorbital distance (mm)") ``` <img src="02_intro_to_r_and_working_with_data_files/figure-html/unnamed-chunk-48-1.png" width="288" /> --- # Working with Data Formatting figures for inclusion in reports? All [elements can be customised individually](https://ggplot2.tidyverse.org/reference/theme.html) but `theme_classic()` takes care of many options you are likely to desire. --- # Working with Data ```r ggplot(data = pigeon2, aes(x = distance, fill = population)) + geom_density(col = "black", alpha = 0.3) + scale_x_continuous(name = "Interorbital distance (mm)") + * theme_classic() ``` <img src="02_intro_to_r_and_working_with_data_files/figure-html/unnamed-chunk-49-1.png" width="288" /> --- # Working with Data A different kind of plot: .pull-left[ <img src="02_intro_to_r_and_working_with_data_files/figure-html/unnamed-chunk-50-1.png" width="288" /> ] .pull-right[ Note: We need to change the `aes()` as well as the `geom` because this figure has population on the x axis. ] --- # Working with Data ```r ggplot(data = pigeon2, aes(x = population, y = distance)) + geom_boxplot() + scale_x_discrete(name = "Population") + scale_y_continuous(name = "Interorbital distance (mm)", expand = c(0, 0), limits = c(0, 15)) + theme_classic() ``` .footnote[ <span style=" font-weight: bold; color: #fdf9f6 !important;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #25496b !important;" >Extra exercise:</span> Can you (gratuitously) colour the boxes by population too? ] --- # Working with Data <img src="02_intro_to_r_and_working_with_data_files/figure-html/unnamed-chunk-51-1.png" width="288" /> --- # Summary * Use a script and comment it * organise analyses and use relative paths * shortcuts: `<-` is Alt-minus ` %>%` is Ctl-Shift-M * objects are seen in the Environment window * data is read in to R from files into dataframes * the dataframe is a common data structure * you'll eventually get used to the manual! * a `ggplot` has a `data` argument and an `aesthetic` argument; layers are added with a `+`; `geoms`determine how the data are plotted --- class: inverse # 🎈 Congratulations! Keep practising! 🎂 --- # References .footnote[ .font60[ Slides made with with xaringan (Xie, 2019) and xaringanExtra (Aden-Buie, 2020) ] ] .font60[ Aden-Buie, G. (2020). _xaringanExtra: Extras And Extensions for Xaringan Slides_. R package version 0.2.3.9000. URL: [https://github.com/gadenbuie/xaringanExtra](https://github.com/gadenbuie/xaringanExtra). Codd, E. F. (1990). _The Relational Model for Database Management: Version 2_. Boston, MA, USA: Addison-Wesley Longman Publishing Co., Inc. Wickham, H. (2014). "Tidy Data". In: _Journal of Statistical Software, Articles_ 59.10, pp. 1-23. Wickham, H. (2016). _ggplot2: Elegant Graphics for Data Analysis_. Springer-Verlag New York. ISBN: 978-3-319-24277-4. URL: [https://ggplot2.tidyverse.org](https://ggplot2.tidyverse.org). Wickham, H., M. Averick, et al. (2019). "Welcome to the Tidyverse". In: _JOSS_ 4.43, p. 1686. Xie, Y. (2019). _xaringan: Presentation Ninja_. R package version 0.12. URL: [https://CRAN.R-project.org/package=xaringan](https://CRAN.R-project.org/package=xaringan). ] --- # Intro to Repro in R Emma Rand [emma.rand@york.ac.uk](mailto:emma.rand@york.ac.uk) Twitter: [@er13_r](https://twitter.com/er13_r) GitHub: [3mmaRand](https://github.com/3mmaRand) blog: https://buzzrbeeline.blog/ <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a><br /><span xmlns:dct="http://purl.org/dc/terms/" property="dct:title">Licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License</a>. Rand, E. (2021). White Rose BBSRC DTP Training: An Introduction to Reproducible Analyses in R (Version v1.1). https://doi.org/10.5281/zenodo.4701167