R/LinkedCharts Tutorial

Augmenting and checking a standard RNA-Seq analysis

Appendix

This document describes a standard analysis of the RNA-Seq dataset used in this tutorial.

We are working with data from this paper:

C. Conway et al.: Elucidating drivers of oral epithelial dysplasia formation and malignant transformation to cancer using RNAseq. Oncotarget, 6:40186-40201 (2015), doi:10.18632/oncotarget.5529

Conweay et al. have collected tissue samples from 19 patients with oral squamous cell carcinoma. From each patient, they took 3 samples, one of normal oral mucosa (“N”), one of epithelial dysplasia (i.e., abnormal but not yet malignant tissue, “D”), and one sample of the tumour (“T”). They prepared RNA-Seq libraries from the samples and sequenced them. They deposited their raw data at the European Read Archive (ERA) under accession PRJEB7455 (secondary accession: ERP007185).

The Recount2 project (Collado Torres et al., Nature Biotechnology, 2017, doi:10.1038/nbt.3838) has downloaded this data set (and thousands other RNA-Seq data sets), ran it through a standardized preprocessing pipeline and provides a gene count table for download from the Recount2 web site.

Getting data and metadata

We download this count matrix from Recount2 (available via the secondary accession ERP007185) and read it into R

download.file("http://duffel.rail.bio/recount/v2/ERP007185/counts_gene.tsv.gz", 
  "ERP007185_counts_gene.tsv.gz" )

countMatrix <- as.matrix( read.table( 
  gzfile("ERP007185_counts_gene.tsv.gz"), header=TRUE, row.names="gene_id" ) )

The upper-left corner of the matrix looks like this

countMatrix[1:5,1:5]

##                    ERR649035 ERR649053 ERR649022 ERR649026 ERR649029
## ENSG00000000003.14     37354    136965     96040    144832    261208
## ENSG00000000005.5        371       119      1100      1139      1635
## ENSG00000000419.12     55566     53733     29553    111980     54818
## ENSG00000000457.13     77985     73875     44700     89683     83028
## ENSG00000000460.16     25159     31507     20118     37592     38827

The column names are ERA run accessions, whcih we first need to map to useful sample names. Unfortuantely, the sample table available from Recount2 does not contain a column that would actually describe the samples, the more complete sample table available from ERA, however, does (though it took us a bit to find this).

We went to the ERA page for the project. There, we clicked on “Select column” and selected the columns “Run accession” and “Submitter’s sample name”, then downloaded this table. A copy of this table is available from the present site: PRJEB7455.txt.

We will use tidyverse functionality to process the table

library( tidyverse )

We read in the table

sampleTable <- read_tsv( "PRJEB7455.txt" )

## Parsed with column specification:
## cols(
##   run_accession = col_character(),
##   sample_alias = col_character()
## )

sampleTable

## # A tibble: 57 x 2
##    run_accession sample_alias
##    <chr>         <chr>       
##  1 ERR649018     PG038-D     
##  2 ERR649019     PG136-N     
##  3 ERR649020     PG049-T     
##  4 ERR649021     PG049-D     
##  5 ERR649022     PG049-N     
##  6 ERR649023     PG063-D     
##  7 ERR649024     PG063-T     
##  8 ERR649025     PG038-T     
##  9 ERR649026     PG063-N     
## 10 ERR649027     PG108-T     
## # ... with 47 more rows

We split the sample name into its two parts, the patient ID (before the dash) and the tissue type (after the dash), then change the tissue type to more descriptive labels, impose an ordering of the levels by severity ( normal < dysplasia < tumour ), and finally sort the table.

sampleTable %>%
   rename( sample_name = sample_alias ) %>%
   separate( sample_name, c( "patient", "tissue" ), "-", remove=FALSE ) %>%
   mutate( tissue = fct_recode( tissue, "normal"="N", "dysplasia"="D", "tumour"="T" ) ) %>%
   mutate( tissue = fct_relevel( tissue, "normal", "dysplasia", "tumour" ) ) %>%
   arrange( patient, tissue ) ->
      sampleTable

This is now our final sample table

sampleTable

## # A tibble: 57 x 4
##    run_accession sample_name patient tissue   
##    <chr>         <chr>       <chr>   <fct>    
##  1 ERR649059     PG004-N     PG004   normal   
##  2 ERR649060     PG004-D     PG004   dysplasia
##  3 ERR649061     PG004-T     PG004   tumour   
##  4 ERR649035     PG038-N     PG038   normal   
##  5 ERR649018     PG038-D     PG038   dysplasia
##  6 ERR649025     PG038-T     PG038   tumour   
##  7 ERR649022     PG049-N     PG049   normal   
##  8 ERR649021     PG049-D     PG049   dysplasia
##  9 ERR649020     PG049-T     PG049   tumour   
## 10 ERR649026     PG063-N     PG063   normal   
## # ... with 47 more rows

Now, we bring the count matrix columns into the same order as in the sample table and replace the run accessions in the column names by the more readable sample names

countMatrix <- countMatrix[ , sampleTable$run_accession ]
colnames(countMatrix) <- sampleTable$sample_name

Next, we replace the rownames, currently Ensembl gene IDs, with gene names. To this, we first download from Ensembl Biomart a table mapping all Ensembl human gene IDs to gene symbols

bmtbl <- biomaRt::getBM( 
   c( "ensembl_gene_id", "external_gene_name" ), 
   mart = biomaRt::useEnsembl( "ensembl", "hsapiens_gene_ensembl" ) )

head(bmtbl)

##   ensembl_gene_id external_gene_name
## 1 ENSG00000284532            MIR4723
## 2 ENSG00000238933            RF00019
## 3 ENSG00000275693            RF02116
## 4 ENSG00000275451            MIR6085
## 5 ENSG00000222870         RNU6-1328P
## 6 ENSG00000252023          RNU6-581P

We replace the rownames with the gene symbols, unless the gene symbol is missing or duplicated. As usual in R, such simple operations look most complicated:

tibble( rowname = rownames(countMatrix) ) %>% 
   mutate( ensembl_gene_id = str_extract( rowname, "ENSG\\d*" ) ) %>%
   left_join( bmtbl, by = "ensembl_gene_id" ) %>%
   mutate( new_rowname = 
      ifelse( ! is.na(external_gene_name) & 
            ! external_gene_name %in% external_gene_name[duplicated(external_gene_name)], 
         external_gene_name, 
         rowname ) ) %>%
   pull( new_rowname ) ->
     rownames(countMatrix)

Now, we have this

countMatrix[ 1:5, 1:5 ]

##          PG004-N PG004-D PG004-T PG038-N PG038-D
## TSPAN6     11642   25423    1526   37354   30699
## TNMD         405       0    1628     371       0
## DPM1       21828   32694     973   55566   33814
## SCYL3      31332   38436   11661   77985   63853
## C1orf112   14207   21808    8047   25159   25862

Running limma-voom

With everything in place, we now run a standard differential expression analysis using limma-voom (Law et al., Genome Biology, 2014, doi:10.1186/gb-2014-15-2-r29).

library( limma )

We aim to find any genes which differ significantly between any of the three tissue types, accounting for the patient as a blocking factor.

This is our model matrix

mm <- model.matrix( ~ patient + tissue, sampleTable )
colnames(mm)

##  [1] "(Intercept)"     "patientPG038"    "patientPG049"   
##  [4] "patientPG063"    "patientPG079"    "patientPG086"   
##  [7] "patientPG105"    "patientPG108"    "patientPG114"   
## [10] "patientPG122"    "patientPG123"    "patientPG129"   
## [13] "patientPG136"    "patientPG137"    "patientPG144"   
## [16] "patientPG146"    "patientPG174"    "patientPG187"   
## [19] "patientPG192"    "tissuedysplasia" "tissuetumour"

The model matrix columns concerning the tissue type are the last two

mm_tissue_columns <- which( attr( mm, "assign" ) == 2 )
mm_tissue_columns

## [1] 20 21

Now, we run the standard workflow of voom and limma

vm <- voom( countMatrix, mm )
fit <- lmFit( vm, mm )
fit <- eBayes( fit, robust=TRUE )

Last, we extract the results for an F test comparing the full model with the one missing the last two columns, to see which genes differ significantly between tissues.

voomResult <- topTable( fit, coef = mm_tissue_columns, number=Inf, sort.by="none" )

head(voomResult)

##          tissuedysplasia tissuetumour   AveExpr         F      P.Value
## TSPAN6        -0.6794364  -1.17596765  3.940362 17.369998 2.751020e-06
## TNMD          -2.2216329   0.29408807 -4.690724  4.164227 2.206521e-02
## DPM1           0.3353803   0.14224571  3.551261  1.659938 2.018351e-01
## SCYL3          0.2533487   0.06609826  4.212959  2.152911 1.282229e-01
## C1orf112       0.3422943   0.32698821  3.350802  3.262281 4.774102e-02
## FGR            0.6687112   1.06189945  2.673165  7.042142 2.221402e-03
##             adj.P.Val
## TSPAN6   8.515251e-05
## TNMD     7.345409e-02
## DPM1     3.391010e-01
## SCYL3    2.492604e-01
## C1orf112 1.264950e-01
## FGR      1.397246e-02

Let’s make an MA plot of these, showing the log fold change tumour/normal versus the average expression and highlighting genes significant at 10% FDR in the F test.

plot( voomResult$AveExpr, voomResult$tissuetumour, 
   col = ifelse( voomResult$adj.P.Val < .1, "red", "black" ), pch=20, cex=.3 )

plot of chunk maplot

Save the results

We save the result

save( countMatrix, sampleTable, voomResult, file="orcc.rda" )

This file is available here: orcc.rda

Session info

sessionInfo()

## R version 3.4.2 (2017-09-28)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.3 LTS
## 
## Matrix products: default
## BLAS: /usr/lib/openblas-base/libblas.so.3
## LAPACK: /usr/lib/libopenblasp-r0.2.18.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] limma_3.34.9    bindrcpp_0.2.2  forcats_0.3.0   stringr_1.3.1  
##  [5] dplyr_0.7.6     purrr_0.2.5     readr_1.1.1     tidyr_0.8.1    
##  [9] tibble_1.4.2    ggplot2_3.0.0   tidyverse_1.2.1
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.18         lubridate_1.7.4      lattice_0.20-35     
##  [4] prettyunits_1.0.2    assertthat_0.2.0     digest_0.6.17       
##  [7] utf8_1.1.4           R6_2.2.2             cellranger_1.1.0    
## [10] plyr_1.8.4           backports_1.1.2      stats4_3.4.2        
## [13] RSQLite_2.1.1        evaluate_0.11        highr_0.7           
## [16] httr_1.3.1           pillar_1.3.0         rlang_0.2.2         
## [19] progress_1.2.0       lazyeval_0.2.1       curl_3.2            
## [22] readxl_1.1.0         rstudioapi_0.7       blob_1.1.1          
## [25] S4Vectors_0.16.0     statmod_1.4.30       RCurl_1.95-4.11     
## [28] bit_1.1-14           biomaRt_2.34.2       munsell_0.5.0       
## [31] broom_0.5.0          compiler_3.4.2       modelr_0.1.2        
## [34] pkgconfig_2.0.2      BiocGenerics_0.24.0  tidyselect_0.2.4    
## [37] IRanges_2.12.0       XML_3.98-1.16        fansi_0.3.0         
## [40] crayon_1.3.4         withr_2.1.2          bitops_1.0-6        
## [43] grid_3.4.2           nlme_3.1-137         jsonlite_1.5        
## [46] gtable_0.2.0         DBI_1.0.0            magrittr_1.5        
## [49] scales_1.0.0         cli_1.0.0            stringi_1.2.4       
## [52] xml2_1.2.0           org.Hs.eg.db_3.5.0   tools_3.4.2         
## [55] bit64_0.9-7          Biobase_2.38.0       glue_1.3.0          
## [58] hms_0.4.2            parallel_3.4.2       AnnotationDbi_1.40.0
## [61] colorspace_1.3-2     rvest_0.3.2          memoise_1.1.0       
## [64] knitr_1.20           bindr_0.1.1          haven_1.1.2