Workshop

Transcripttranscriptomics 2: Statistical Analysis

Author

Emma Rand

Published

18 September, 2024

Introduction

Session overview

In the workshop, you will learn how to perform differential expression analysis on raw counts using DESeq2 (Love, Huber, and Anders 2014) or on logged normalised expression values using scran (Lun, McCarthy, and Marioni 2016) or both.

Set up

Either:

🎬 Open 🐸 frogs-88H Project and the cont-fgf-s30.R script.

Or

🎬 Open 🐭 mice-88H Project and the hspc-prog.R script.

🎬 Make a new folder results in the project directory. You can use the New Folder button in the Files pane but here I have used the fs (Hester, Wickham, and CsÑrdi 2023) package

fs::dir_create("results")

This is where we will save our results.

🎬 Load tidyverse (Wickham et al. 2019) You most likely have this code at the top of cont-fgf-s30.R or hspc-prog.R already.

── Attaching core tidyverse packages ─────────────────────────────────────────────── tidyverse 2.0.0 ──
βœ” dplyr     1.1.3     βœ” readr     2.1.4
βœ” forcats   1.0.0     βœ” stringr   1.5.0
βœ” ggplot2   3.4.3     βœ” tibble    3.2.1
βœ” lubridate 1.9.3     βœ” tidyr     1.3.0
βœ” purrr     1.0.2     
── Conflicts ───────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
βœ– dplyr::filter() masks stats::filter()
βœ– dplyr::lag()    masks stats::lag()
β„Ή Use the conflicted package to force all conflicts to become errors

Have you ever stopped to think about this message? It is telling us that there are functions in the dplyr package that have the same name as functions in the stats package and that R will use the dplyr version. As this is what you want, this has always been fine. It still is fine in this case. However, as you start to load more packages, you will want to know if you are using a function from a package that has the same name as a function in another loaded package. This is where the conflicted (Wickham 2023) package comes in. Conflicted will warn you when you are using a function that has the same name as a function in another package. You can then choose which function to use.

🎬 Load the conflicted package:

Instead of getting a warning every time you are using a function that has a function with the same name in another package, we can declare a preference for one function over another. This is useful for the functions you use a lot or ones where you are certain you always want to use a particular function.

🎬 Use the dplyr version of filter() by default:

conflict_prefer("filter", "dplyr")

🐸 Analysis

We will carry out several steps

  1. Import the data
  2. Find the genes that are expressed in only one treatment (the control or the FGF-treated)
  3. Create a DESeqDataSet object. This is a special object that is used by the DESeq2 package
  4. Prepare the normalised counts from the DESeqDataSet object.
  5. Do differential expression analysis on the genes. This needs to be done on the raw counts.

The creation of the DESeqDataSet object, the preparation of the normalised counts and the differential expression analysis are all done with the DESeq2 package

Import

We need to import the S30 data that were filtered to remove genes with 4, 5 or 6 zeros and those where the total counts was less than 20.

🎬 Import the data from the data-processed folder.

Genes expressed in one treatment

The genes expressed in only one treatment group are those with zeros in all three replicates in one group and non-zero values in all three replicates in the other group. For example, those shown here:

Error in `arrange()`:
β„Ή In argument: `..1 = S30_C_3`.
Caused by error:
! object 'S30_C_3' not found

We will use filter() to find these genes.

🎬 Find the genes that are expressed only in the FGF-treated group:

s30_fgf_only <- s30_filtered |> 
  filter(S30_C_1 == 0, 
         S30_C_2 == 0, 
         S30_C_3 == 0, 
         S30_F_1 > 0, 
         S30_F_2 > 0, 
         S30_F_3 > 0)
Error in `filter()`:
β„Ή In argument: `S30_C_1 == 0`.
Caused by error:
! object 'S30_C_1' not found

❓ How many genes are expressed only in the FGF-treated group?

🎬 Now you find any genes that are expressed only in the control group.

Error in `filter()`:
β„Ή In argument: `S30_C_1 > 0`.
Caused by error:
! object 'S30_C_1' not found

❓ Do the results make sense to you in light of what you know about the biology?

🎬 Write to file (saved in results) all the genes that are expressed one group only.

Error in eval(expr, envir, enclos): object 's30_fgf_only' not found

Create DESeqDataSet object

🎬 Load the DESeq2 package:

A DEseqDataSet object is a custom data type that is used by DESeq2. Custom data types are common in the Bioconductor1 packages. They are used to store data in a way that is useful for the analysis. These data types typically have data, transformed data, metadata and experimental designs within them.

To create a DESeqDataSet object, we need to provide three things:

  1. The raw counts - these are what we imported into s30_filtered
  2. The meta data which gives information about the samples and which treatment groups they belong to
  3. A design matrix which captures the design of the statistical model.

The counts must in a matrix rather than a dataframe. Unlike a dataframe, a matrix has columns of all the same type. That is, it will contain only the counts. The gene ids are given as row names rather than a column. The matrix() function will create a matrix from a dataframe of columns of the same type and the select() function can be used to remove the gene ids column.

🎬 Create a matrix of the counts:

s30_count_mat <- s30_filtered |>
  select(-xenbase_gene_id) |>
  as.matrix()

🎬 Add the gene ids as row names to the matrix:

# add the row names to the matrix
rownames(s30_count_mat) <- s30_filtered$xenbase_gene_id

You might want to view the matrix.

The metadata are in a file, frog_meta_data.txt. This is a tab-delimited file. The first column is the sample name and the second column is the treatment group.

🎬 Make a folder called meta and save the file to it.

🎬 Read the metadata into a dataframe:

meta <- read_table("meta/frog_meta_data.txt")

🎬 Examine the resulting dataframe.

We need to add the sample names as row names to the metadata dataframe. This is because the DESeqDataSet object will use the row names to match the samples in the metadata to the samples in the counts matrix.

🎬 Add the sample names as row names to the metadata dataframe:

row.names(meta) <- meta$sample_id

(you will get a warning message but you can ignore it)

We are dealing only with the S30 data so we need to remove the samples that are not in the S30 data.

🎬 Filter the metadata to keep only the S30 information:

meta_S30 <- meta |>
  dplyr::filter(stage == "stage_30")
# A tibble: 6 Γ— 4
  sample_id stage    treatment sibling_rep
* <chr>     <chr>    <chr>     <chr>      
1 S30_C_5   stage_30 control   five       
2 S30_C_6   stage_30 control   six        
3 S30_C_A   stage_30 control   A          
4 S30_F_5   stage_30 FGF       five       
5 S30_F_6   stage_30 FGF       six        
6 S30_F_A   stage_30 FGF       A          

We can now create the DESeqDataSet object. The design formula describes the statistical model You should notice that it is the same sort of formula we used in t.test(), lm(),glm() etc. The ~ indicates that the left hand side is the response variable (in this case counts) and the right hand side are the explanatory variables. We are interested in the difference between the treatments but we include sibling_rep to account for the fact that the data are paired. The names of the columns in the count matrix have to match the names in the metadata dataframe and the names of the explanatory variables in the design formula have to match the names of columns in the metadata.

🎬 Create the DESeqDataSet object:

dds <- DESeqDataSetFromMatrix(countData = s30_count_mat,
                              colData = meta_S30,
                              design = ~ treatment + sibling_rep)

The warning β€œWarning: some variables in design formula are characters, converting to factors” just means that the variable type of treatment and sibling_rep in the metadata dataframe are char. This is not a as DESeqDataSetFromMatrix() has made them into the factors it needs.

🎬 Examine the DESeqDataSet object.

The counts are in dds@assays@data@listData[["counts"]] and the metadata are in dds@colData but the easiest way to see them is to use the counts() and colData() functions from the DESeq2 package.

🎬 View the counts:

counts(dds) |> View()
Error in .External2(C_dataviewer, x, title): unable to start data viewer

You should be able to see that this is the same as in s30_count_mat.

colData(dds)
DataFrame with 6 rows and 4 columns
          sample_id       stage treatment sibling_rep
        <character> <character>  <factor>    <factor>
S30_C_5     S30_C_5    stage_30   control        five
S30_C_6     S30_C_6    stage_30   control        six 
S30_C_A     S30_C_A    stage_30   control        A   
S30_F_5     S30_F_5    stage_30   FGF            five
S30_F_6     S30_F_6    stage_30   FGF            six 
S30_F_A     S30_F_A    stage_30   FGF            A   

Prepare the normalised counts

The normalised counts are the counts that have been transformed to account for the library size (i.e., the total number of reads in a sample) and the gene length. We have to first estimate the normalisation factors and store them in the DESeqDataSet object and then we can get the normalised counts.

🎬 Estimate the factors for normalisation and store them in the DESeqDataSet object:

dds <- estimateSizeFactors(dds)

🎬 Look at the factors (just for information):

sizeFactors(dds)
  S30_C_5   S30_C_6   S30_C_A   S30_F_5   S30_F_6   S30_F_A 
0.8812200 0.9454600 1.2989886 1.0881870 1.0518961 0.8322894 

To get the normalised counts we again used the counts() function but this time we use the normalized=TRUE argument.

🎬 Save the normalised to a matrix:

normalised_counts <- counts(dds, normalized = TRUE)

We will write the normalised counts to a file so that we can use them in the future.

🎬 Make a dataframe of the normalised counts, add a column for the gene ids and write to file:

data.frame(normalised_counts,
           xenbase_gene_id = row.names(normalised_counts)) |>
  write_csv(file = "results/S30_normalised_counts.csv")

Differential expression analysis

We used the DESeq() function to do the differential expression analysis. This function fits the statistical model to the data and then uses the model to calculate the significance of the difference between the treatments. It again stored the results in the DESseqDataSet object. Note that the differential expression needs the raw (unnormalised counts) as it does its own normalisation as part of the process.

🎬 Run the differential expression analysis:

dds <- DESeq(dds)

The function will take only a few moments to run on this data but can take longer for bigger datasets.

We need to define the contrasts we want to test. We want to test the difference between the treatments so we will define the contrast as FGF and control.

🎬 Define the contrast:

contrast_fgf <- c("treatment", "FGF", "control")

Note that treatment is the name of the column in the metadata dataframe and FGF and control are the names of the levels in the treatment column. By putting them in the order FGF , control we are saying the fold change will be FGF / control. If we had put them in the order control, FGF we would have got the fold change as control / FGF. This means:

  • positive log fold changes indicate FGF > control and
  • negative log fold changes indicates control > FGF.

🎬 Extract the results from the DESseqDataSet object:

results_fgf <- results(dds,
                       contrast = contrast_fgf)

This will give us the log2 fold change and p-value for the contrast. ## Add gene information from Xenbase

If you want to emulate what I did you can use the following commands in the terminal after downloading the file:

gunzip xenbase.gpi.gz
less xenbase.gpi
q

gunzip unzips the file and less allows you to view the file. q quits the viewer. You will see the header lines and that the file contains both Xenopus tropicalis and Xenopus laevis. I read the file in with read_tsv (skipping the first header lines) then filtered out the Xenopus tropicalis entries, dropped some columns and saved the file as an excel file.

However, I have already done this for you and saved the file as xenbase_info.xlsx in the meta folder. We will import this file and join it to the results dataframe.

🎬 Load the readxl (Wickham and Bryan 2023) package:

🎬 Import the Xenbase gene information file:

gene_info <- read_excel("meta/xenbase_info.xlsx") 
Error: `path` does not exist: 'meta/xenbase_info.xlsx'

You should view the resulting dataframe to see what information is available. You can use glimpse() or View().

🎬 Merge the gene information with the results:

# join the gene info with the results
s30_results <- s30_results |>
  left_join(gene_info, by = "xenbase_gene_id")
Error in eval(expr, envir, enclos): object 's30_results' not found

We will also find it useful to import the metadata that maps the sample names to treatments. This will allow us to label the samples in the visualisations.

🎬 Save the results to a file:

data.frame(results_fgf,
           xenbase_gene_id = row.names(results_fgf)) |> 
  write_csv(file = "results/S30_results.csv")

🐭 Analysis

We will carry out several steps

  1. Import the prog and hspc data
  2. Combine the two datasets ready for analysis
  3. Filter the data to remove genes that are not expressed in any cell
  4. Find the genes that are expressed in only one cell type (the prog or the hspc)
  5. Do differential expression analysis on the genes using the scran package. This needs to be done on the logged normalised counts.

Import

🎬 Import surfaceome_hspc.csv and surfaceome_prog.csv into dataframes called hspc and prog respectively.

Combine the two datasets

We need to combine the two datasets of 701 and 798 cells into one dataset of 1499 cells, i.e., 1499 columns. The number of rows is the number of genes, 280. Before combining, we must make sure genes in the same order in both dataframes or we would be comparing the expression of one gene in one cell type to the expression of a different gene in the other cell type!

🎬 Check the gene ids are in the same order in both dataframes:

identical(prog$ensembl_gene_id, hspc$ensembl_gene_id)
[1] TRUE

scran can use a matrix or a dataframe of counts but theses must be log normalised counts. If using a dataframe, the columns must only contain the expression values (not the gene ids).

🎬 Combine the two dataframes (minus the gene ids) into one dataframe called prog_hspc:

prog_hspc <- bind_cols(prog[-1], hspc[-1])

🎬 Now add the gene ids as the row names:

row.names(prog_hspc) <- prog$ensembl_gene_id

Filter to remove unexpressed genes

In this dataset, we will not see and genes that are not expressed in any of the cells because we are using a specific subset of the transcriptome that was deliberately selected. However, we will go through how to do this because it is an important step in most analyses.

For the 🐸 frog data you should remember that we were able to filter out our unexpressed genes in Transcripttranscriptomics 1 because we were examining both groups to be compared. In that workshop, we discussed that we could not filter out unexpressed genes in the 🐭 mouse data because we only had one cell types at that time. During the Consolidate Independent Study you examined the hspc cells.

Where the sum of all the values in the rows is zero, all the entries must be zero. We can use this to find the filter the genes that are not expressed in any of the cells. To do row wise aggregates such as the sum across rows we can use the rowwise() function. c_across() allows us to use the colon notation Prog_001:HSPC_852 in sum() rather than having to list all the column names: sum(Prog_001, Prog_002, Prog_002, Prog_004,.....)

🎬 Find the genes that are 0 in every column of the prog_hspc dataframe:

prog_hspc |> 
  rowwise() |> 
  filter(sum(c_across(Prog_001:HSPC_852)) == 0)
# A tibble: 0 Γ— 1,499
# Rowwise: 
# β„Ή 1,499 variables: Prog_001 <dbl>, Prog_002 <dbl>, Prog_003 <dbl>,
#   Prog_004 <dbl>, Prog_006 <dbl>, Prog_007 <dbl>, Prog_008 <dbl>,
#   Prog_009 <dbl>, Prog_010 <dbl>, Prog_011 <dbl>, Prog_012 <dbl>,
#   Prog_013 <dbl>, Prog_014 <dbl>, Prog_015 <dbl>, Prog_016 <dbl>,
#   Prog_017 <dbl>, Prog_018 <dbl>, Prog_019 <dbl>, Prog_020 <dbl>,
#   Prog_021 <dbl>, Prog_022 <dbl>, Prog_023 <dbl>, Prog_024 <dbl>,
#   Prog_025 <dbl>, Prog_026 <dbl>, Prog_027 <dbl>, Prog_028 <dbl>, …

Notice that we have summed across all the columns.

❓ What do you conclude?

We might also examine the genes which are least expressed.

🎬 Find ten least expressed genes:

rowSums(prog_hspc) |> sort() |> head(10)
ENSMUSG00000041046 ENSMUSG00000012428 ENSMUSG00000022225 ENSMUSG00000027863 
          30.70322           35.35796           50.45975           61.27461 
ENSMUSG00000019359 ENSMUSG00000020701 ENSMUSG00000030772 ENSMUSG00000027376 
          68.90961           77.95594           84.11234           97.69333 
ENSMUSG00000023132 ENSMUSG00000026285 
         120.43065          126.95425 

❓ What do you conclude?

Find the genes that are expressed in only one cell type

To find the genes that are expressed in only one cell type, we can use the same approach as above but only sum the columns for one cell type.

🎬 Find the genes that are 0 in every column for the HSPC cells:

prog_hspc |> 
  rowwise() |> 
  filter(sum(c_across(HSPC_001:HSPC_852)) == 0)
# A tibble: 0 Γ— 1,499
# Rowwise: 
# β„Ή 1,499 variables: Prog_001 <dbl>, Prog_002 <dbl>, Prog_003 <dbl>,
#   Prog_004 <dbl>, Prog_006 <dbl>, Prog_007 <dbl>, Prog_008 <dbl>,
#   Prog_009 <dbl>, Prog_010 <dbl>, Prog_011 <dbl>, Prog_012 <dbl>,
#   Prog_013 <dbl>, Prog_014 <dbl>, Prog_015 <dbl>, Prog_016 <dbl>,
#   Prog_017 <dbl>, Prog_018 <dbl>, Prog_019 <dbl>, Prog_020 <dbl>,
#   Prog_021 <dbl>, Prog_022 <dbl>, Prog_023 <dbl>, Prog_024 <dbl>,
#   Prog_025 <dbl>, Prog_026 <dbl>, Prog_027 <dbl>, Prog_028 <dbl>, …

We have summed across the HSPC cells only. Note that if we knew there were some rows that were all zero across both cell types, we would need to add |> filter(sum(c_across(Prog_001:Prog_852)) != 0)

meaning zero in all the HSPC but not zero in all the Prog

🎬 Now you find the genes that are 0 in every column for the Prog cells:

# A tibble: 0 Γ— 1,499
# Rowwise: 
# β„Ή 1,499 variables: Prog_001 <dbl>, Prog_002 <dbl>, Prog_003 <dbl>,
#   Prog_004 <dbl>, Prog_006 <dbl>, Prog_007 <dbl>, Prog_008 <dbl>,
#   Prog_009 <dbl>, Prog_010 <dbl>, Prog_011 <dbl>, Prog_012 <dbl>,
#   Prog_013 <dbl>, Prog_014 <dbl>, Prog_015 <dbl>, Prog_016 <dbl>,
#   Prog_017 <dbl>, Prog_018 <dbl>, Prog_019 <dbl>, Prog_020 <dbl>,
#   Prog_021 <dbl>, Prog_022 <dbl>, Prog_023 <dbl>, Prog_024 <dbl>,
#   Prog_025 <dbl>, Prog_026 <dbl>, Prog_027 <dbl>, Prog_028 <dbl>, …

❓ What do you conclude?

Differential expression analysis

Like DESeq2, scran uses a statistical model to calculate the significance of the difference between the treatments and needs metadata to define the treatments.

🎬 Load the scran package:

The meta data needed for the frog data was information about which columns were in which treatment group and which sibling group and we had that information in a file. Similarly, here we need information on which columns are from which cell type. Instead of having this is a file, we will create a vector that indicates which column belongs to which cell type.

🎬 Create a vector that indicates which column belongs to which cell type:

cell_type <- rep(c("prog","hspc"), 
                 times = c(length(prog) - 1,
                           length(hspc) - 1))

The number of times each cell type is repeated is the number of columns in that cell type minus 1. This is because we have removed the column with the gene ids. Do check that the length of the cell_type vector is the same as the number of columns in the prog_hspc dataframe.

🎬 Run the differential expression analysis:

res_prog_hspc <- findMarkers(prog_hspc, 
                             cell_type)

findMarkers() is the function that runs the differential expression analysis. The first argument is the dataframe containing the data. The second argument is the vector indicating which columns are in which cell type. It gives us two dataframes of the results - rather unnecessarily. One is the results with fold changes that are Prog/HSPC and the other is the results with fold changes that are HSPC/Prog. These have the same magnitude, just a different sign

The dataframe res_prog_hspc$prog is log prog - log hspc (i.e.,Prog/HSPC). This means - Positive fold change: prog is higher than hspc - Negative fold change: hspc is higher than prog

The dataframe res_prog_hspc$hspc is log hspc - log prog (i.e., HSPC/Prog). . This means - Positive fold change: hspc is higher than prog - Negative fold change: prog is higher than hspc

The res_prog_hspc$prog dataframe
Top p.value FDR summary.logFC logFC.hspc ensembl_gene_id
ENSMUSG00000028639 1 0 0 1.596910 1.596910 ENSMUSG00000028639
ENSMUSG00000024053 2 0 0 3.035165 3.035165 ENSMUSG00000024053
ENSMUSG00000041329 3 0 0 3.261056 3.261056 ENSMUSG00000041329
ENSMUSG00000030336 4 0 0 -2.146491 -2.146491 ENSMUSG00000030336
ENSMUSG00000016494 5 0 0 -3.056730 -3.056730 ENSMUSG00000016494
ENSMUSG00000002808 6 0 0 3.000810 3.000810 ENSMUSG00000002808
The res_prog_hspc$hspc dataframe. Notice the sign of the fold change is the other way
Top p.value FDR summary.logFC logFC.prog ensembl_gene_id
ENSMUSG00000028639 1 0 0 -1.596910 -1.596910 ENSMUSG00000028639
ENSMUSG00000024053 2 0 0 -3.035165 -3.035165 ENSMUSG00000024053
ENSMUSG00000041329 3 0 0 -3.261056 -3.261056 ENSMUSG00000041329
ENSMUSG00000030336 4 0 0 2.146491 2.146491 ENSMUSG00000030336
ENSMUSG00000016494 5 0 0 3.056730 3.056730 ENSMUSG00000016494
ENSMUSG00000002808 6 0 0 -3.000810 -3.000810 ENSMUSG00000002808

Add gene information from Ensembl using biomaRt

Ensembl (Martin et al. 2023; Birney et al. 2004)is a bioinformatics project to organise all the biological information around the sequences of large genomes. The are a large number of databases but BioMart (Smedley et al. 2009) provides a consistent interface to the material. There are web-based tools to use these but the R package biomaRt (biomaRt?) gives you programmatic access making it easier to integrate information into R dataframes

🎬 Load the biomaRt (biomaRt?) package:

🎬 Connect to the mouse database and see the first 20 bits of information we can retrieve:

# Connect to the mouse database
ensembl <- useMart(biomart = "ensembl", 
                   dataset = "mmusculus_gene_ensembl")

# See what information we can retrieve
listAttributes(mart = ensembl) |> head(20)
                            name                                description
1                ensembl_gene_id                             Gene stable ID
2        ensembl_gene_id_version                     Gene stable ID version
3          ensembl_transcript_id                       Transcript stable ID
4  ensembl_transcript_id_version               Transcript stable ID version
5             ensembl_peptide_id                          Protein stable ID
6     ensembl_peptide_id_version                  Protein stable ID version
7                ensembl_exon_id                             Exon stable ID
8                    description                           Gene description
9                chromosome_name                   Chromosome/scaffold name
10                start_position                            Gene start (bp)
11                  end_position                              Gene end (bp)
12                        strand                                     Strand
13                          band                             Karyotype band
14              transcript_start                      Transcript start (bp)
15                transcript_end                        Transcript end (bp)
16      transcription_start_site             Transcription start site (TSS)
17             transcript_length Transcript length (including UTRs and CDS)
18                transcript_tsl             Transcript support level (TSL)
19      transcript_gencode_basic                   GENCODE basic annotation
20             transcript_appris                          APPRIS annotation
           page
1  feature_page
2  feature_page
3  feature_page
4  feature_page
5  feature_page
6  feature_page
7  feature_page
8  feature_page
9  feature_page
10 feature_page
11 feature_page
12 feature_page
13 feature_page
14 feature_page
15 feature_page
16 feature_page
17 feature_page
18 feature_page
19 feature_page
20 feature_page

There are many (2,985!) possible bits of information (attributes) that can be obtained. You can replace head(20) with View() to see them all.

We use the getBM() function to retrieve information from the database. The filters argument is used to specified what kind of identifier we are supplying to retrieve information. The attributes argument is used to select the information we want to retrieve. The values argument is used to specify the identifiers. The mart argument is used to specify the connection we created.

🎬 Get the gene information:

gene_info <- getBM(filters = "ensembl_gene_id",
                   attributes = c("ensembl_gene_id",
                                  "external_gene_name",
                                  "description"),
                   values = prog_hspc_results$ensembl_gene_id,
                   mart = ensembl)
Error in eval(expr, envir, enclos): object 'prog_hspc_results' not found

We are getting the gene name and and a description. We also need to get the id because we will use that to merge the gene_info dataframe with the prog_hspc_results dataframe. Notice the dataframe returned only has 279 rows - one of the ids does not have information.

🎬 We can find which is missing with:

# prog_hspc_results |> select(ensembl_gene_id) |> 
#   filter(!ensembl_gene_id %in% gene_info$ensembl_gene_id)

Oh, conflicted has flagged a conflict for us.

🎬 Take the appropriate action to resolve the conflict:

❓ What is the id which is missing information?

We might want to look that up - but let’s worry about it later if it turns out to be something important.

🎬 Merge the gene information with the results:

prog_hspc_results <- prog_hspc_results |> 
  left_join(gene_info, by = "ensembl_gene_id")
Error in eval(expr, envir, enclos): object 'prog_hspc_results' not found

I recommend viewing the dataframe to see the new columns. We now have dataframe with all the info we need, normalised counts, log2 normalised counts, statistical comparisons with fold changes and p values, information about the gene other than just the id

🎬 Write the results to file:

data.frame(res_prog_hspc$prog, 
           ensembl_gene_id = row.names(res_prog_hspc$prog)) |> 
  write_csv("results/prog_hspc_results.csv")

πŸ€— Look after future you!

🎬 Go through you script (cont-fgf-s30.R or hspc-prog.R) and tidy up. You might need to :

  • collect together library statements at the top of the script
  • remove code that you needed to start today but which wouldn’t be needed running the script from the top (e.g., importing)
  • edit your comments for clarity
  • rename variables for consistency or clarity
  • remove house keeping or exploratory code
  • restyle code, add code section headers etc

πŸ₯³ Finished

Well Done!

Independent study following the workshop

Consolidate

The Code file

These contain all the code needed in the workshop even where it is not visible on the webpage.

The workshop.qmd file is the file I use to compile the practical. Qmd stands for Quarto markdown. It allows code and ordinary text to be interleaved to produce well-formatted reports including webpages. Right-click on the link and choose Save-As to download. You will be able to open the Qmd file in RStudio. Alternatively, View in Browser. Coding and thinking answers are marked with #---CODING ANSWER--- and #---THINKING ANSWER---

Pages made with R (R Core Team 2024), Quarto (Allaire et al. 2024), knitr [Xie (2024); knitr2; knitr3], kableExtra (Zhu 2021)

References

Allaire, J. J., Charles Teague, Carlos Scheidegger, Yihui Xie, and Christophe Dervieux. 2024. β€œQuarto.” https://doi.org/10.5281/zenodo.5960048.
Birney, Ewan, T. Daniel Andrews, Paul Bevan, Mario Caccamo, Yuan Chen, Laura Clarke, Guy Coates, et al. 2004. β€œAn Overview of Ensembl.” Genome Research 14 (5): 925–28. https://doi.org/10.1101/gr.1860604.
Hester, Jim, Hadley Wickham, and GΓ‘bor CsΓ‘rdi. 2023. β€œFs: Cross-Platform File System Operations Based on ’Libuv’.”
Love, Michael I., Wolfgang Huber, and Simon Anders. 2014. β€œModerated Estimation of Fold Change and Dispersion for RNA-Seq Data with DESeq2.” Genome Biology 15: 550. https://doi.org/10.1186/s13059-014-0550-8.
Lun, Aaron T. L., Davis J. McCarthy, and John C. Marioni. 2016. β€œA Step-by-Step Workflow for Low-Level Analysis of Single-Cell RNA-Seq Data with Bioconductor.” F1000Res. 5: 2122. https://doi.org/10.12688/f1000research.9501.2.
Martin, Fergal J, M Ridwan Amode, Alisha Aneja, Olanrewaju Austine-Orimoloye, Andrey G Azov, If Barnes, Arne Becker, et al. 2023. β€œEnsembl 2023.” Nucleic Acids Research 51 (D1): D933–41. https://doi.org/10.1093/nar/gkac958.
R Core Team. 2024. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.
Smedley, Damian, Syed Haider, Benoit Ballester, Richard Holland, Darin London, Gudmundur Thorisson, and Arek Kasprzyk. 2009. β€œBioMart Biological Queries Made Easy.” BMC Genomics 10 (1): 22. https://doi.org/10.1186/1471-2164-10-22.
Wickham, Hadley. 2023. Conflicted: An Alternative Conflict Resolution Strategy. https://conflicted.r-lib.org/.
Wickham, Hadley, Mara Averick, Jennifer Bryan, Winston Chang, Lucy D’Agostino McGowan, Romain FranΓ§ois, Garrett Grolemund, et al. 2019. β€œWelcome to the tidyverse.” Journal of Open Source Software 4 (43): 1686. https://doi.org/10.21105/joss.01686.
Wickham, Hadley, and Jennifer Bryan. 2023. Readxl: Read Excel Files. https://readxl.tidyverse.org.
Xie, Yihui. 2024. Knitr: A General-Purpose Package for Dynamic Report Generation in r. https://yihui.org/knitr/.
Zhu, Hao. 2021. β€œkableExtra: Construct Complex Table with ’Kable’ and Pipe Syntax.” https://CRAN.R-project.org/package=kableExtra.

Footnotes

  1. Bioconductor is a project that develops and supports R packages for bioinformatics.β†©οΈŽ