class: center, middle, inverse, title-slide # Advanced data import. ## White Rose BBSRC DTP Training: An Introduction to Reproducible Analyses in R. ### Emma Rand ### University of York, UK --- <style> div.blue { background-color:#b0cdef; border-radius: 5px; padding: 20px;} div.grey { background-color:#d3d3d3; border-radius: 0px; padding: 0px;} </style> # Aims The aim of this session is to strength your ability to import your own data files regardless of the formatting and to introduce you to some of the other ways to import data such as through web scraping and via APIs. --- # Learning outcomes The successful student will be able to: * use an understanding of what matters in their data import * import multiple files * import plain text and proprietary data formats stored locally and on the web * carry out some simple web scraping * describe the process of accessing data through an API * start to appreciate the number of packages available for importing publicly accessible data 🎬 An instruction to do something!! --- # Four Aspects There are four main considerations for data import: .font80[ 1. **Where are the data?** Are they stored locally (on your own computer) or remotely (on another computer/server). Are they distributed rather than in a single repository or file? 2. **What format are they in?** Are they plain text format, structured text like as XML or JSON, or in proprietary software format . 3. **How?** Should we use base R functions or a specialised package? Are the data available through an API and if so, is there already a package to access it? 4. **What data structure will be created in R?** Often this will be dataframes or dataframe-like structures (eg., tibbles) but there are many specialised data structures for biological data ] --- class: inverse # Locally stored text files --- # Overview This section may be familiar to those with experience. You may want to skip ahead. Plain text files - can be opened in notepad or other text editor and make sense - often have file extensions that indicate the format - have columns 'delimited' by a particular character or at fixed positions. --- # Overview They can be read in to R using: - the base package with methods such as `read.table()`, `read.csv()` -- - Or with the `readr` package methods such as `read_table()`, `read_table2()`, `read_csv()` and `read_delim()` which is usually my preference because they are a little faster and output tibbles. -- - Or with the even faster `fread()` from the `data.table` package <a name=cite-Dowle_Srinivasan_2019></a>([Dowle and Srinivasan, 2019](#bib-Dowle_Srinivasan_2019)). --- # Overview We will focus on `readr` package methods such as `read_table()`, `read_table2()`, `read_csv()` and `read_delim()` -- You might want to note those down --- # Exercise 1 Data file: [structurepred.txt](data-raw/structurepred.txt) These data are root mean square deviations (RMSD) in `\(\unicode{x212B}\)`ngstrom between predicted and actual protein structures. Three programs were used to make predictions for 30 proteins. -- There are three columns: rmsd, prog and prot -- 🎬 Save a copy of the file to your `data-raw` folder and read it in with: -- ```r file <- "data-raw/structurepred.txt" protein_struc <- read_table(file) ``` --- # Exercise 1 🎬 Examine the result: ```r glimpse(protein_struc) ``` ``` ## Rows: 90 ## Columns: 3 ## $ rmsd <dbl> 9.038, 14.952, 17.734, 3.117, 11.284, 12.761, 9.759, 9.715, 0.684~ ## $ prog <chr> "Abstruct", "Abstruct", "Abstruct", "Abstruct", "Abstruct", "Abst~ ## $ prot <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19~ ``` 😒 The command runs but the result looks wrong. All the variables have been read into one column. --- # Exercise 1 🎬 Look up `read_table()` using `?read_table()`. Can you work out how to read the three variables in to three columns? .footnote[ <span style=" font-weight: bold; color: #fdf9f6 !important;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #25496b !important;" >Extra exercise:</span> If you have worked that out try reading in these data: [csativa.txt](data-raw/csativa.txt).] --- # Exercise 1 Answer ```r protein_struc <- read_table2(file) ``` > `read_table2()` ... allows any number of whitespace characters between columns, and the lines can be of different lengths. > `read_table()` is more strict, each line must be the same length, and each field is in the same position in every line. -- Each line in [structurepred.txt](data-raw/structurepred.txt) is not the same length (i.e., it is *not* a fixed width file) which is one of the requirements of `read_table()`. --- # Extra Exercise answer <span style=" font-weight: bold; color: #fdf9f6 !important;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #25496b !important;" >Extra exercise:</span> If you have worked that out try reading in these data: [csativa.txt](data-raw/csativa.txt). ```r file2 <- "data-raw/csativa.txt" csat <- read_table2(file2, skip = 4) ``` The first 4 lines of the data file give information about the data. We only need to read data from line 5. --- # Exercise 1 addendum A function that can save you having to open a text file to look or wait along time to discover an import failure is `readLines()`: ```r readLines(file, n = 5) ``` ``` ## [1] "rmsd prog prot" "9.038 Abstruct 1" "14.952 Abstruct 2" ## [4] "17.734 Abstruct 3" "3.117 Abstruct 4" ``` It helps you identify that each line is not the same length. -- This is especially useful when the file is very big and takes ages to open or import. --- # Exercise 2 Data file: [Icd10Code.csv](data-raw/Icd10Code.csv) ICD-10 codes are alphanumeric codes used by doctors, health insurance companies, and public health agencies across the world to represent diagnoses. This file contains a list of the codes and diseases they represent. -- 🎬 Save a copy of the file to your `data-raw` folder and read it with the most appropriate `readr` function. .footnote[ <span style=" font-weight: bold; color: #fdf9f6 !important;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #25496b !important;" >Extra exercise:</span> If you have worked that out try reading in these data: [grape.txt](data-raw/grape.txt).] --- # Exercise 2 Answer ```r file <- "data-raw/Icd10Code.csv" icd <- read_table2(file) ``` -- You may have quite naturally assumed that file with a .csv extension would contain comma separated values and used `read_csv()`. --- # Exercise 2 Answer The use of `readLines()` reveals it is actually tab, `\t`, separated. ```r readLines(file, n = 2) ``` ``` ## [1] "Code\tDescription" "A00\t\"Cholera\"" ``` It is not uncommon for people to use file extensions that are misleading so don't assume it is what it says on the tin 🕵️♂️ --- # Extra Exercise answer ```r file2 <- "data-raw/grape.txt" readLines(file2, n = 2) ``` ``` ## [1] "linoleic/regime/variety" "26.51/hot/giro" ``` ```r grape <- read_delim(file2, delim = "/") glimpse(grape) ``` ``` ## Rows: 48 ## Columns: 3 ## $ linoleic <dbl> 26.51, 21.56, 21.39, 22.41, 25.04, 26.47, 21.31, 13.79, 14.50~ ## $ regime <chr> "hot", "hot", "hot", "hot", "hot", "hot", "hot", "hot", "warm~ ## $ variety <chr> "giro", "giro", "giro", "giro", "giro", "giro", "giro", "giro~ ``` --- class: inverse # Locally stored, special formats. --- # Overview Files in some special format: * Cannot usually be opened in notepad. -- * Are often specific to proprietary software, e.g., SPSS, STATA, Matlab. -- If you have that software you may be able to export in plain text format. -- But usually there is a package or function that allows you to script the steps. --- # Overview Favourites of mine are: * `haven` <a name=cite-haven></a>([Wickham and Miller, 2018](#bib-haven)) for importing and exporting SPSS, STATA and SAS files * `readxl` <a name=cite-readxl></a>([Wickham and Bryan, 2019](#bib-readxl)) for excel files Both of these packages are installed when you install `tidyverse` .... but are not loaded with `library(tidyverse)`. Therefore we will need to load them. -- * Google is your friend. --- # Example 1 SPSS Data file: [periwinkle.sav](data-raw/periwinkle.sav). These data give a measure of the gut parasite load in two species of rough periwinkle in the Spring and Summer. 🎬 Save a copy of the file to your `data-raw` folder and read it with: ```r library(haven) file <- "data-raw/periwinkle.sav" periwinkle <- read_sav(file) ``` --- # Example 1 SPSS .scroll-output-height[ 🎬 Examine the result. ```r str(periwinkle) ``` ``` ## tibble [100 x 3] (S3: tbl_df/tbl/data.frame) ## $ para : num [1:100] 58 51 54 39 65 67 60 54 47 66 ... ## ..- attr(*, "label")= chr "Number of Parasites" ## ..- attr(*, "format.spss")= chr "F8.0" ## $ season : dbl+lbl [1:100] 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... ## ..@ label : chr "Season" ## ..@ format.spss: chr "F8.0" ## ..@ labels : Named num [1:2] 1 2 ## .. ..- attr(*, "names")= chr [1:2] "Spring" "Summer" ## $ species: dbl+lbl [1:100] 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... ## ..@ label : chr "Species" ## ..@ format.spss : chr "F8.0" ## ..@ display_width: int 14 ## ..@ labels : Named num [1:2] 1 2 ## .. ..- attr(*, "names")= chr [1:2] "Littorina saxatilis" "Littorina nigrolineata" ``` ] The SPSS variables `season` and `species` are labelled integers, i.e., the actual values are integers, which is a common data structure in other statistical environments. `read_sav()` has made them variables of type "haven_labelled". --- # Example 1 SPSS The labels of these can be viewed in R using: ```r attr(periwinkle$season, "labels") ``` ``` ## Spring Summer ## 1 2 ``` ```r attr(periwinkle$species, "labels") ``` ``` ## Littorina saxatilis Littorina nigrolineata ## 1 2 ``` --- # Example 1 SPSS 🎬 We can turn the variables of type "haven_labelled" in to factors with: ```r periwinkle <- periwinkle %>% mutate(season = as_factor(season), species = as_factor(species)) ``` -- 🎬 Write the dataframe to a plain text format file in your `data-processed` folder. .footnote[ <span style=" font-weight: bold; color: #fdf9f6 !important;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #25496b !important;" >Extra exercise:</span> If you have worked that out try reading in these data: [carsdata.dta](data-raw/carsdata.dta). ] --- # Example 1 SPSS - Write Answer You can write the dataframe to a plain text format file in your `data-processed` folder like this: ```r write_delim(periwinkle, "data-processed/periwinkle.txt") ``` --- # Extra Exercise answer <span style=" font-weight: bold; color: #fdf9f6 !important;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #25496b !important;" >Extra exercise:</span> If you have worked that out try reading in these data: [carsdata.dta](data-raw/carsdata.dta). 👀 There are `read_dta()` and `read_sas()` functions for STATA and SAS files respectively so: ```r file <- "data-raw/carsdata.dta" cars <- read_dta(file) ``` --- class: inverse # Reading multiple files --- # Overview We often have multiple files of the same format to import. Rather than coping and pasting the same data import command and editing the file name, it is better to import in single statement. -- **Two methods** I will demonstrate two ways you can import multiple files. --- # Overview 1. Using a `for` loop in base R * create a list with a length equal to the number of files. * use a `for` loop to iterate through each filename, reading it in to a dataframe which occupies a place in the list * combine the list of dataframes into a single dataframe using `rbind()`. -- 2. Using `purrr` from the `tidyverse` * use `map()` to apply an import function to each element of a vector * pipe the output into `bind_rows()` to combine the list of dataframes into a single dataframe. --- # The data [specdata.zip](data-raw/specdata.zip) is a compressed folder csv files containing air quality measures taken from twenty locations. The files all have the same structure of four columns: "Date","sulfate","nitrate" and "ID". "ID" gives the location ID and is the same as in the file name. -- 👀 Both methods require us to download the data and create a vector of file names. --- # Both methods: download 🎬 Save a copy of [specdata.zip](data-raw/specdata.zip) to `data-raw` and unzip it with: ```r unzip("data-raw/specdata.zip", exdir = "data-raw") ``` -- This should create a directory in `data-raw` called `specdata` which contains twenty .csv files. Examine the results in the Files pane --- # Both methods: file names 🎬 Create a vector of file names: ```r files <- dir("data-raw/specdata/", pattern = ".csv", full.names = TRUE) ``` -- `dir()` returns a character vector containing the names of the files in the specified directory which match a regex pattern (".csv" in this case). -- `full.names = TRUE` means the directory path is prepended to the file names to give a relative file path --- class: inverse # Reading multiple files: Method 1 ## Using a `for` loop in base R --- # Method 1 base R We need to: - Pre-allocate a list with a length equal to the number of files - Iterate through each filename in `files` reading it with read_csv - Combine the list of data frames into a single dataframe (optional) --- # Method 1 base R 🎬 Pre-allocate a list with a length equal to the number of files. ```r df_list <- vector("list", length(files)) ``` -- 🎬 Iterate through each filename in `files` reading it with read_csv and assigning it to a place in the list: ```r for (i in 1:length(files)) { df_list[[i]] <- read_csv(files[[i]]) } ``` --- # Method 1 base R 🎬 To combine the list of data frames into a single dataframe we can use: ```r airquality <- do.call(rbind, df_list) ``` -- `do.call()` applies `rbind()` to all the dataframes in the list. -- 🙋 You now have a dataframe containing all the data in the twenty files. --- class: inverse # Reading multiple files: Method 1 ## Using `purrr` from the tidyverse --- # Using purrr The two main differences in this method are that `map()` dispenses with the need for a `for` loop (and for a pre-allocated vector) and we can use the pipe to link operations. -- 🎬 Read the files and combine them: ```r airquality2 <- files %>% map(read_csv) %>% bind_rows() ``` -- I replaced the base R function `rbind()` with the tidyverse function `bind_rows()`. --- # Using purrr There's actually a `tidyverse` short cut for `map(read_csv) %>% bind_rows` ```r airquality3 <- files %>% map_dfr(read_csv) ``` Even shorter! --- class: inverse # Files on the internet --- # Overview It may be preferable to read data files from the internet rather than download the file if the data are likely to be updated. -- We simply use the URL rather than the local file path. Your options for import functions and arguments are the same as for local files. -- For example, these data are from a buoy (buoy #44025) off the coast of New Jersey at http://www.ndbc.noaa.gov/view_text_file.php?filename=44025h2011.txt.gz&dir=data/historical/stdmet/ --- # Example 🎬 Use `readLines()` to examine the data format. ```r file <- "http://www.ndbc.noaa.gov/view_text_file.php?filename=44025h2011.txt.gz&dir=data/historical/stdmet/" ``` .scroll-output-width[ .code60[ ```r readLines(file, n = 4) ``` ``` ## [1] "#YY MM DD hh mm WDIR WSPD GST WVHT DPD APD MWD PRES ATMP WTMP DEWP VIS TIDE" ## [2] "#yr mo dy hr mn degT m/s m/s m sec sec degT hPa degC degC degC mi ft" ## [3] "2010 12 31 23 50 222 7.2 8.5 0.75 4.55 3.72 203 1022.2 6.9 6.7 3.5 99.0 99.00" ## [4] "2011 01 01 00 50 233 6.0 6.8 0.76 4.76 3.77 196 1022.2 6.7 6.7 3.7 99.0 99.00" ``` ] ] -- The first line gives the column name, the second line gives units and the data themselves begin on line 3 --- # Example 🎬 Read in the data. ```r buoy44025 <- read_table(file, col_names = FALSE, skip = 2) ``` -- <span style=" font-weight: bold; color: #fdf9f6 !important;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #25496b !important;" >Extra exercise:</span> Use `scan()` to read in the appropriate lines then tidy the results and name the columns measure_units, for example. [Tidying data and the tidyverse including the pipe](04_tidying_data_and_the_tidyverse.html) might help. --- # Extra Exercise Answer <span style=" font-weight: bold; color: #fdf9f6 !important;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #25496b !important;" >Extra exercise:</span> Use `scan()` to read in the appropriate lines then tidy the results and name the columns measure_units, for example. .code60[ ```r # read in the variable names from the first line, removing the hash measure <- scan(file, nlines = 1, what = character()) %>% str_remove("#") # read in the units from the second line, removing the hash and # replacing the / with _per_ as / is a special character units <- scan(file, skip = 1, nlines = 1, what = character()) %>% str_remove("#") %>% str_replace("/", "_per_") # paste the variable name and its units together for the column names names(buoy44025) <- paste(measure, units, sep = "_") ``` ] --- class: inverse # Web scraping --- # Overview What if data are not in a file but on web page? One solution is to 'scrape' the data using package `rvest` <a name=cite-rvest></a>([Wickham, 2019b](#bib-rvest)). Do you see what they did there? 😜 -- This is possible because the two core standards for building web pages, HTML (Hypertext Markup Language) and CSS (Cascading Style Sheets) are just text files which describe the structure of a page and its appearance with tags. -- We will do just one small example of a big topic: getting the information from a Wikipedia page on [Global biodiversity](https://en.wikipedia.org/wiki/Global_biodiversity) --- # Steps There are several steps 1. download and parse the html file using `read_html` -- 2. determine which css selector marks the data in the page you want -- 3. find that selector in the parsed html and extract the text using `html_nodes()`. -- 4. process the text in to an R data structure for example using `html_table()` or `html_text()`. -- 5. Do any any additional processing to the data in the R object. --- # Steps 🎬 Load the `rvest` package and assign the web address to a variable: ```r library(rvest) url <- "https://en.wikipedia.org/wiki/Global_biodiversity" ``` -- 🎬 Go to the web page: https://en.wikipedia.org/wiki/Global_biodiversity -- 🎬 Scroll down the Global biodiversity page until you see the table. Right click on the page and choose Inspect. A box will appear on the right including the full page source. --- # Steps 🎬 In the Elements window, click on the element that highlights the information you want - in this case `<table class="wikitable">` 🎬 Right click on that element and choose Copy | Copy selector. This is: `#mw-content-text > div.mw-parser-output > table` --- # Steps 🎬 Now download and parse the html file using `read_html`, extract the text at the copied selector using `html_nodes()`, process the text in to a list using `html_table()` and get the dataframe which is the first(and only) element of that list: ```r biodiv <- read_html(url) %>% html_nodes(css = '#mw-content-text > div.mw-parser-output > table') %>% html_table() %>% .[[1]] ``` --- # 🎉 🎬 Examine the resulting dataframe. .scroll-output-width[ .code60[ ```r glimpse(biodiv) ``` ``` ## Rows: 1 ## Columns: 2 ## $ X1 <lgl> NA ## $ X2 <chr> "This article needs to be updated. Please help update this article ~ ``` ] ] -- This resulting dataframe needs some work --- class: inverse # APIs --- # Overview There are many data repositories online and you may have used a search page to retrieve data for download. That search page is designed for humans to interface with the data repository. -- An Applications Programming Interface (API) allows a data repository or program to be accessed programmatically. This is especially useful when your data updates frequently. --- # Overview Many packages have been written that simplify your use of an API<sup>1</sup>. You may have used these without knowing. Using an API that has an already has a package is relatively simple. .footnote[ 1. Searching for API on this [list of CRAN packages](https://cran.r-project.org/web/packages/available_packages_by_name.html) will give you an idea. ] --- # API Package examples Just a few examples are: * `RGoogleAnalytics` for the the Google Analytics API * `rentrez` for the NCBI's 'EUtils' API, allowing users to search databases like 'GenBank' * `aRxiv` Interface to the arXiv API allowing you to search the preprints database. * `CytobankAPI` to access the Cytobank API * Many packages on [Bioconductor](https://www.bioconductor.org/) and [ROpenSci](https://ropensci.org/) --- # Important 👀 Search for an existing package to access an API before you start with the methods shown on the next few slides! -- This is a example of how to connect to an API to get data when the API doesn't have a package for R. It is a brief introduction but if is it something you want to do you will undoubtedly need to learn more. --- # API Terminology Interaction with an API comprises a **request** (what you send) and a **response**. -- The most common type of request is carried out with the GET HTTP method<sup>1</sup> .footnote[ 1. Other HTTP methods can be used for requests such as POST, PUT, DELETE ] -- In R `httr` is one package that allows you create requests --- # API Terminology: Requests An API request like `http://gtr.rcuk.ac.uk/search/project?term=coronavirus&page=1&fetchSize=100` consists of: * a URL <span style=" border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #D0DEE6 !important;" >http://gtr.rcuk.ac.uk/search/</span>project?term=coronavirus&page=1&fetchSize=100 -- * an 'endpoint' followed by a `?`. An endpoint is a dataset provided by the API. http://gtr.rcuk.ac.uk/search/<span style=" border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #D0DEE6 !important;" >project</span>?term=coronavirus&page=1&fetchSize=100 -- Many APIs have multiple endpoints. --- # API Terminology: Requests * Endpoints are queried with 'parameters' separated by `&` which are defined in the API documentation http://gtr.rcuk.ac.uk/search/project?<span style=" border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #D0DEE6 !important;" >term=coronavirus&page=1&fetchSize=100</span> --- # API Terminology: Responses An API response is usually returned in one of two formats: 1. JavaScript Object Notification (JSON) -- 2. Extensible Markup Language (XML) -- These formats are plain text but need processing. JSON data is often handled with the `jsonlite` package --- # Example We are going to query the [UK Research and Innovation](https://www.ukri.org/) [Gateway to Research](https://gtr.ukri.org/) which enables users to search and analyse information about publicly funded research. -- We will retrieve BBSRC research grants that start in 2021 about coronavirus. -- The funding council, type of grant, start year and topic are all parameters we will need to send in the request. -- 🎬 Load the packages ```r library(httr) library(jsonlite) ``` --- # Example We need to: 1. Make a "GET" request to the API to pull raw data into R -- 2. Parse that data from its raw form (JSON) into a usable format --- # Example: Construct query 🎬 Create variables for the URL, the endpoint and the parameters: ```r base <- "https://gtr.ukri.org/search/" endpoint <- "project" page <- 1 fetchsize <- 100 term <- "coronavirus" ``` -- We are searching the `project` data resource; other possible endpoints include `person` and `publication`. `fetchSize` and `page` give the number of results per page returned and which page should be returned. --- # Example ``` base <- "https://gtr.ukri.org/search/" endpoint <- "project" page <- 1 fetchsize <- 100 term <- "coronavirus" ``` Typically, we need to write a loop to retrieve all the results. For simplicity, I have used a search (research grants, funded by the BBSRC in 2021) that returns less that a one page of results - meaning I could have omitted the page and fetchsize. --- # Example: Construct query Specifying the funding council, type of grant and year is done by passing **id codes** to the `selectedFacets` parameter. -- This is information obtained from the API documentation. -- But you can often examine the response to an exploratory request to obtain such information --- # Example: Construct query An exploratory query to determine additional search parameters 🎬 Construct the query: ```r explore <- paste(base, endpoint, "?", "term=", term, "&", "page=", page, "&", "fetchSize=", fetchsize, sep = "") explore ``` ``` ## [1] "https://gtr.ukri.org/search/project?term=coronavirus&page=1&fetchSize=100" ``` --- # Example: Construct query An exploratory query to determine additional search parameters 🎬 Make the request using the query: ```r results <- GET(explore) ``` 🎬 Process the JSON response into an R data structure: ```r data <- results %>% content("text") %>% fromJSON() ``` --- # Example: Construct query An exploratory query to determine additional search parameters 🎬 Examine the `data` list. **Demo** --- # Example: Construct query Having worked out what the id codes are for research grants, the BBSRC and 2020 we will want to add those to our query. -- 🎬 Create variables for the variables for appropriate `selectedFacets` id codes: .code60[ ```r # these are the id codes that filter the results by: # "Project category", research <- "Y2F0fFJlc2VhcmNoIEdyYW50fHN0cmluZw==" # "Funder" and bbsrc <- "ZnVuZGVyfEJCU1JDfHN0cmluZw==" # "Start year y2021 <- "c3RhcnR8MTYwOTQ1OTIwMDAwMF8xNjQwOTk1MTk5MDU5fHJhbmdl" ``` ] --- # Example: Construct query The id codes include some special characters that must be encoded as `%` plus a two-digit hexadecimal representation (known as percent coding). -- You may have seen browsers replace spaces with `%20` - this is just one example. -- The ids must be in a list separated by commas, and then encoded. --- # Example: Construct query 🎬 Create a variable for the selectedFacets parameter by pasting the codes together separated by a comma, then encoding: ```r selectedFacets <- paste(research, bbsrc, y2021, sep = ",") %>% URLencode() ``` --- # Example: Construct query 🎬 Paste all parts into the same query, adding `?` and `&` where required: ```r call1 <- paste(base, endpoint,"?", "term=", term, "&", "page=", page, "&", "fetchSize=", fetchsize, "&", "selectedFacets=", selectedFacets, sep = "") ``` --- # Example: Check query 🎬 Check the query: .scroll-output-width[ ```r call1 ``` ``` ## [1] "https://gtr.ukri.org/search/project?term=coronavirus&page=1&fetchSize=100&selectedFacets=Y2F0fFJlc2VhcmNoIEdyYW50fHN0cmluZw==,ZnVuZGVyfEJCU1JDfHN0cmluZw==,c3RhcnR8MTYwOTQ1OTIwMDAwMF8xNjQwOTk1MTk5MDU5fHJhbmdl" ``` ] --- # Example: Run query 🎬 We use the GET function in the same way as for our exploratory query: ```r results <- GET(call1) ``` --- # Example: Check result 🎬 We can check the request was successful using: ```r results[["status_code"]] ``` ``` ## [1] 200 ``` -- 200 means everything is OK. [List of HTTP status codes](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes). You've almost certainly seen a 404 before. --- # Example: Pages of results? If we needed to know how to make our loop, we would need to look in: ```r results[["headers"]][["link-pages"]] ``` ``` ## [1] "1" ``` ```r results[["headers"]][["link-records"]] ``` ``` ## [1] "16" ``` to find the number of pages and results. --- # Example: Extract response 🎬 Extract the content of the response and parse it from JSON to an R structure. ```r data <- results %>% content("text") %>% fromJSON() ``` --- # Example: Examine 🎬 Examine your results: ```r data[["searchResult"]][["resourceHitCount"]] ``` ``` ## resource count ## 1 project 16 ## 2 outcomes 1361 ## 3 person 849 ## 4 publication 490 ``` --- # Example: Examine 🎬 Examine your results: .scroll-output-width[ .code60[ ```r data[["searchResult"]][["results"]][["projectComposition"]][["project"]][["title"]] ``` ``` ## [1] "UK International coronavirus network (UK-ICN)" ## [2] "An accurate eukaryotic plasma membrane assay for coronavirus binding" ## [3] "Mechanisms of coronavirus replication: the role of cellular lipids in the generation of replication organelles" ## [4] "Development of a virus-free sensor system to repurpose approved drugs for blocking Coronavirus replication" ## [5] "Global trade of coronavirus hosts: bringing geographically isolated hosts and viruses together risks novel recombination and spillover to humans" ## [6] "High-resolution structure, function, and anti-viral inhibition of the SARS-CoV2 E protein ion channel" ## [7] "Common host proteins required for replication organelle function across coronaviruses" ## [8] "Airway Epithelial-Myeloid cell crosstalk as a key mechanism in the pathogenesis of COVID-19" ## [9] "The risk of SARS-2 establishing itself in animal reservoirs" ## [10] "BEACON - A multi-user and multi-project facility for biological macromolecules and nanoparticles characterization in solution" ## [11] "SARS-CoV-2 infections in cats: assessing their zoonotic potential and role in sustaining the COVID-19 pandemic" ## [12] "Discovery and characterisation of novel immunomodulatory Cullin-RING ubiquitin ligases (CRLs) in the airway" ## [13] "CATH-FunVar - Predicting Viral and Human Variants Affecting COVID-19 Susceptibility and Severity and Repurposing Therapeutics" ## [14] "Mapping the lipid envelope composition of SARS-CoV2 for reducing transmission, thrombosis and inflammation" ## [15] "Investigating Host and Viral Factors for Improved Design of Future Live Attenuated Vaccines for IBV" ## [16] "Probing the mechanism of action of Shiftless, a host restriction factor targeting programmed ribosomal frameshifting." ``` ] ] --- # Summary .font80[ * Consider whether the data are locally or remotely stored, their format, whether they are in one file or several or distributed across other resources. * If you use some form of database that presents information in your browser, check whether it has an API, then whether there are existing packages to access that API. * plain text files can be read in with base R, readr or data.table methods (and more) * other formats can usually be imported specialised packages * use iteration to import multiple files (rather than copying and pasting import statements) * files on the internet are imported in the same way as those stored locally * some form of tidying is almost always required * `rvest` allows you to scrape data from webpages is you can identify the CSS selector * an API allows programmatic access to a data repository ] --- # References .footnote[ .font60[ Slides made with with xaringan <a name=cite-xaringan></a>([Xie, 2019](#bib-xaringan)) and xaringanExtra <a name=cite-xaringanExtra></a>([Aden-Buie, 2020](#bib-xaringanExtra)) ] ] .font60[ <a name=bib-xaringanExtra></a>[Aden-Buie, G.](#cite-xaringanExtra) (2020). _xaringanExtra: Extras And Extensions for Xaringan Slides_. R package version 0.2.3.9000. URL: [https://github.com/gadenbuie/xaringanExtra](https://github.com/gadenbuie/xaringanExtra). <a name=bib-Dowle_Srinivasan_2019></a>[Dowle, M. and A. Srinivasan](#cite-Dowle_Srinivasan_2019) (2019). _data.table: Extension of `data.frame`_. R package version 1.12.6. URL: [https://CRAN.R-project.org/package=data.table](https://CRAN.R-project.org/package=data.table). <a name=bib-rvest></a>[Wickham, H.](#cite-rvest) (2019b). _rvest: Easily Harvest (Scrape) Web Pages_. R package version 0.3.5. URL: [https://CRAN.R-project.org/package=rvest](https://CRAN.R-project.org/package=rvest). <a name=bib-readxl></a>[Wickham, H. and J. Bryan](#cite-readxl) (2019). _readxl: Read Excel Files_. R package version 1.3.1. URL: [https://CRAN.R-project.org/package=readxl](https://CRAN.R-project.org/package=readxl). <a name=bib-haven></a>[Wickham, H. and E. Miller](#cite-haven) (2018). _haven: Import and Export 'SPSS', 'Stata' and 'SAS' Files_. R package version 1.1.2. URL: [https://CRAN.R-project.org/package=haven](https://CRAN.R-project.org/package=haven). <a name=bib-xaringan></a>[Xie, Y.](#cite-xaringan) (2019). _xaringan: Presentation Ninja_. R package version 0.12. URL: [https://CRAN.R-project.org/package=xaringan](https://CRAN.R-project.org/package=xaringan). ] --- # Intro to Repro in R Emma Rand [emma.rand@york.ac.uk](mailto:emma.rand@york.ac.uk) Twitter: [@er13_r](https://twitter.com/er13_r) GitHub: [3mmaRand](https://github.com/3mmaRand) blog: https://buzzrbeeline.blog/ <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a><br /><span xmlns:dct="http://purl.org/dc/terms/" property="dct:title">Licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License</a>. Rand, E. (2021). White Rose BBSRC DTP Training: An Introduction to Reproducible Analyses in R (Version v1.1). https://doi.org/10.5281/zenodo.4701167