R/LinkedCharts Tutorial
Augmenting and checking a standard RNA-Seq analysis
In this simple tutorial, we show how to explore a standard RNA-Seq analysis.
The data
We are working with data from this paper:
C. Conway et al.: Elucidating drivers of oral epithelial dysplasia formation and malignant transformation to cancer using RNAseq. Oncotarget, 6:40186-40201 (2015), doi:10.18632/oncotarget.5529 |
Conway et al. have collected tissue samples from 19 patients with oral squamous cell carcinoma (OSCC). From each patient, they took 3 samples, one of normal oral mucosa (“N”), one of epithelial dysplasia (i.e., abnormal but not yet malignant tissue, “D”), and one sample of the tumour (“T”). We will use their data (available from the European Read Archive (ERA) under accession PRJEB7455) to demonstrate how LinkedCharts can be helpful in a standard bioinformatics task like analysing an RNA-Seq data set.
Fortunately, we do not have to redo the whole analysis, as the Recount2 project (Collado Torres et al., Nature Biotechnology, 2017, doi:10.1038/nbt.3838) gives us a headstart by providing a count table for this and other data sets.
Nevertheless, a bit of data wrangling is necessary, and in order to keep this tutorial short, we describe these preparatory steps in an appendix.
You can download the file that produced this document from here: oscc.Rmd.
We hence start by loading the data file resulting from the preparations, which is available here: oscc.rda
load( "oscc.rda" )
We have data for 57 samples (19 patients x 3 tissue samples per patient), with the metadata in sampleTable
:
sampleTable
## run_accession sample_name patient tissue
## 1 ERR649059 PG004-N PG004 normal
## 2 ERR649060 PG004-D PG004 dysplasia
## 3 ERR649061 PG004-T PG004 tumour
## 4 ERR649035 PG038-N PG038 normal
## 5 ERR649018 PG038-D PG038 dysplasia
## 6 ERR649025 PG038-T PG038 tumour
## 7 ERR649022 PG049-N PG049 normal
## 8 ERR649021 PG049-D PG049 dysplasia
## 9 ERR649020 PG049-T PG049 tumour
## 10 ERR649026 PG063-N PG063 normal
## 11 ERR649023 PG063-D PG063 dysplasia
## 12 ERR649024 PG063-T PG063 tumour
## 13 ERR649062 PG079-N PG079 normal
## 14 ERR649064 PG079-D PG079 dysplasia
## 15 ERR649065 PG079-T PG079 tumour
## 16 ERR649034 PG086-N PG086 normal
## 17 ERR649037 PG086-D PG086 dysplasia
## 18 ERR649036 PG086-T PG086 tumour
## 19 ERR649053 PG105-N PG105 normal
## 20 ERR649063 PG105-D PG105 dysplasia
## 21 ERR649045 PG105-T PG105 tumour
## 22 ERR649029 PG108-N PG108 normal
## 23 ERR649028 PG108-D PG108 dysplasia
## 24 ERR649027 PG108-T PG108 tumour
## 25 ERR649072 PG114-N PG114 normal
## 26 ERR649073 PG114-D PG114 dysplasia
## 27 ERR649074 PG114-T PG114 tumour
## 28 ERR649069 PG122-N PG122 normal
## 29 ERR649070 PG122-D PG122 dysplasia
## 30 ERR649071 PG122-T PG122 tumour
## 31 ERR649043 PG123-N PG123 normal
## 32 ERR649033 PG123-D PG123 dysplasia
## 33 ERR649044 PG123-T PG123 tumour
## 34 ERR649030 PG129-N PG129 normal
## 35 ERR649031 PG129-D PG129 dysplasia
## 36 ERR649032 PG129-T PG129 tumour
## 37 ERR649019 PG136-N PG136 normal
## 38 ERR649041 PG136-D PG136 dysplasia
## 39 ERR649042 PG136-T PG136 tumour
## 40 ERR649038 PG137-N PG137 normal
## 41 ERR649039 PG137-D PG137 dysplasia
## 42 ERR649040 PG137-T PG137 tumour
## 43 ERR649046 PG144-N PG144 normal
## 44 ERR649047 PG144-D PG144 dysplasia
## 45 ERR649048 PG144-T PG144 tumour
## 46 ERR649049 PG146-N PG146 normal
## 47 ERR649050 PG146-D PG146 dysplasia
## 48 ERR649051 PG146-T PG146 tumour
## 49 ERR649052 PG174-N PG174 normal
## 50 ERR649055 PG174-D PG174 dysplasia
## 51 ERR649054 PG174-T PG174 tumour
## 52 ERR649058 PG187-N PG187 normal
## 53 ERR649057 PG187-D PG187 dysplasia
## 54 ERR649056 PG187-T PG187 tumour
## 55 ERR649066 PG192-N PG192 normal
## 56 ERR649067 PG192-D PG192 dysplasia
## 57 ERR649068 PG192-T PG192 tumour
Our actual data is a matrix of read counts: The samples are the columns, the rows the genes, the matrix entries the
number of RNA-Seq reads that mapped to each gene in each sample. Here is the top left corner of countMatrix
:
countMatrix[ 1:5, 1:5 ]
## PG004-N PG004-D PG004-T PG038-N PG038-D
## TSPAN6 11642 25423 1526 37354 30699
## TNMD 405 0 1628 371 0
## DPM1 21828 32694 973 55566 33814
## SCYL3 31332 38436 11661 77985 63853
## C1orf112 14207 21808 8047 25159 25862
An interactive heatmap for quality assessment
When starting to work with such data, it is usually a good idea to first assess the quality of the data. It is unlikely that all of these many samples are of equally perfect quality. A good way to perform a first check is to calculate the correlation or distance between all pairs of samples. We use Spearman correlation so that we don not have to worry (yet) about how to normalize and transform the data.
corrMat <- cor( countMatrix, method="spearman" )
corrMat[1:5,1:5]
## PG004-N PG004-D PG004-T PG038-N PG038-D
## PG004-N 1.0000000 0.8927562 0.7745358 0.8963811 0.8948413
## PG004-D 0.8927562 1.0000000 0.7775693 0.9026254 0.9015365
## PG004-T 0.7745358 0.7775693 1.0000000 0.7569003 0.7538902
## PG038-N 0.8963811 0.9026254 0.7569003 1.0000000 0.9363055
## PG038-D 0.8948413 0.9015365 0.7538902 0.9363055 1.0000000
We visualize this matrix as a heatmap (using Raivo Kolde’s pheatmap
package)
rownames(sampleTable) <- sampleTable$sample_name # pheatmap insists on that
pheatmap::pheatmap( corrMat,
cluster_rows=FALSE, cluster_cols=FALSE,
annotation_col = sampleTable[,c("patient","tissue")] )
We can see that most samples pairs correlate well with each other, with correlation coefficients above ~0.85, in the yellow-orange colour range. Same samples, however, show consistently poorer correlation with all other samples. But is 0.8 really a good cut point, or is this just what the arbitrary color scale happens to highlight?
Each cell in this heatmap summerizes a scatter plot. For example, the cell relating to the first two samples, PG004-N and PG004-D, is the Spearman correlation associated with this plot:
plot(
log10( 1 + countMatrix[,"PG004-N"] ),
log10( 1 + countMatrix[,"PG004-D"] ),
asp=1, col=adjustcolor("black",0.2), pch=20, cex=.5 )
We have plotted here logarithms of the count values, log10(k+1) (with one pseudocount added to avoid zeroes, which could otherwise not be shown in a log-log plot, because log 0 = -∞).
If we looked at several such plots for different squares in the heatmap, maybe some orange ones, a few yellow ones, and some of the blueish ones, we could get a quick feeling about how good or bad a correlation value of 0.9 or 0.8 is.
With LinkedCharts, we can do precisely that. We can display the two plots side-by-side, and when one clicks with the mouse on a cell of the heatmap, the scatter plot will change to display the correlation between the two samples associated with the heatmap cell.
Here is first the code to display the two plots side-by-side, for now without linking them (i.e., without handling mouse clicks):
library( rlc )
openPage( useViewer=FALSE, layout="table1x2" )
lc_heatmap(
dat(
value = corrMat
),
place = "A1"
)
sampleX <- "PG004-N"
sampleY <- "PG004-D"
lc_scatter(
dat(
x = log10( 1 + countMatrix[,sampleX] ),
y = log10( 1 + countMatrix[,sampleY] ),
size = .3,
opacity = .3
),
place = "A2"
)
To run this code, you first need to install R/LinkedCharts. If you haven’t done so yet, see the simple instructions on the overview page.
Once you run the code, you should see, in your web browser, charts like these. (Give the scatter plot a few seconds to appear; it has nearly 60,000 points.) Note how sample names and gene names are displayed when you hover your mouse over a cell in the heatmap or a point in the scatter plot. You can also zoom in (draw a rectangle with the mouse) or zoom out (double-click) or use other functions in the tool menu (click on the arrow button).
We go through this code now and explain line for line:
First, we load the R/LinkedCharts package (“rlc
”). Then, we use openPage
to open a new page. We can open the page
either in the web browser (useViewer=FALSE
) or in the viewer pane of RStudio (useViewer=TRUE
, the default). As
we have two plots, we opt for a layout with 1 row and 2 columns (layout="table1x2"
).
Next, we insert the first chart, the heatmap, using the lc_heatmap
function. All charts in R/LinkedCharts are placed onto the page
with functions starting with lc_
, and they all want a first argument that sets all their data and that has to be enclosed
in dat(...)
(which we will explain later). For a heatmap, we just need a matrix, which has to be assigned (in the dat
phrase)
to value
. The second argument is the place
where the chart should be put. In our table1x2
layout, the places are
labelled A1
and A2
. (If we had, say, a table2x2
layout, there would also be B1
and B2
for the second row.)
Now, we set two global variables, sampleX
and sampleY
, to the names of the two samples that we want to initially
display in the scatter plot.
The scatter plot is inserted with lc_scatter
. Again, its first argument must be enclosed in dat(...)
. Within the dat
, we
set four properties: x
, y
, size
and opacity
. The first two are mandatory: They are vectors with the values of the x
and y coordinates. As before, when using R’s standard plot
function, we use log10( 1 + countMatrix[,sample])
.
The other two properties are optional: We set size = .3
to make the points a bit smaller than the default, and we make
them somewhat transparent, by reducing the opacity
, so that one can see whether several points sit on top of each other
(similar to the use of adjustcolor
above). We place the chart at position A2
, to the right of the heatmap at A1
.
Next, we need to “link” the charts. For this, we just add four very simple lines, marked below with hashes (#):
library( rlc )
openPage( useViewer=FALSE, layout="table1x2" )
sampleX <- "PG004-N"
sampleY <- "PG004-D"
lc_heatmap(
dat(
value = corrMat,
on_click = function(k) { # \
sampleX <<- k[1] # | Linking the
sampleY <<- k[2] # | charts
updateCharts( "A2" ) # /
}
),
place = "A1"
)
countMatrix_downsampled <-
countMatrix[ sample.int( nrow(countMatrix), 8000 ), ]
lc_scatter(
dat(
x = log10( 1 + countMatrix_downsampled[,sampleX] ),
y = log10( 1 + countMatrix_downsampled[,sampleY] ),
size = .3,
opacity = .3
),
place = "A2"
)
We have added a second property to the heatmap, inside the dat
: The property on_click
tells LinkedCharts what to
do when the user clicks on a cell in the heatmap. It is a function with one argument, k
, which R/LinkedCharts will
call whenever a mouse-click event happens in the heatmap, and R/LinkedCharts will then place in k
the row and column
indices of the square that was clicked.
Our on_click
function does just two things: First, it stores the row and column indices (passed as k[1]
and k[2]
) in
sampleX
and sampleY
, the two global variables that we used to indicate which samples the scatter plot should show. Now, they
indicate that the scatter plot should show the samples corresponding to the square that has just been clicked. We only need to tell
the scatter plot that its data has changes and that it should redraw itself. Hence, the call to updateCharts
, which causes
the indicated chart (here the one at A2
) to be redrawn.
Now we can also see why the property assignments have to be enclosed into dat
: dat
is a construct that keeps the code it
encloses in an unevaluated form, so that it can be re-evaluated over and over as needed. And here, the code in the scatter plots
dat
, e.g., x = log10( 1 + countMatrix_downsampled[,sampleX] )
, will get a different result whenever sampleX
has changed.
This is the general idea of LinkedCharts: You describe, with the dat
properties, how your plot should look like, using global
variables, which you can change, e.g., when the user clicks somewhere, and then cause the plot to be redrawn. This makes is extremely easy to link charts in the manner just shown.
One subtelty: Because the on_click
function needs to set a global variable, we have used in it the special global
assignment operator <<-
instead of the usual <-
or =
. It is important not to forget to use <<-
, as otherwise, R would
create a local variable sampleX
and discard it immediately instead of changing the global variable that also the other chart
can see.
And for completeness: There is a second change in the plot above: We have downsampled the count matrix from 58,000 genes to just 8,000 genes. This is merely to ensure that the app reacts smoothly to mouse clicks also on less powerful computers. It shouldn’t change the appearance of the plots much.
If you use the app, you can now easily see which samples are bad and how bad they are. For example, you will notice that they seem to have especially strong noise for the weaker genes.
Exploring the differentially expressed genes
In the appendix, which shows the data preparation, we have also performed a differential expression analysis using limma-voom. For details, see there. In brief, we have looked for genes whose expression differs between the three tissue types. The result table was also in the
data file oscc.rda
that we have loaded in the beginning:
head( voomResult )
## tissuedysplasia tissuetumour AveExpr F P.Value
## TSPAN6 -0.6794364 -1.17596765 3.940362 17.369998 2.751020e-06
## TNMD -2.2216329 0.29408807 -4.690724 4.164227 2.206521e-02
## DPM1 0.3353803 0.14224571 3.551261 1.659938 2.018351e-01
## SCYL3 0.2533487 0.06609826 4.212959 2.152911 1.282229e-01
## C1orf112 0.3422943 0.32698821 3.350802 3.262281 4.774102e-02
## FGR 0.6687112 1.06189945 2.673165 7.042142 2.221402e-03
## adj.P.Val
## TSPAN6 8.515251e-05
## TNMD 7.345409e-02
## DPM1 3.391010e-01
## SCYL3 2.492604e-01
## C1orf112 1.264950e-01
## FGR 1.397246e-02
Here, tissuedysplasia
and tissuetumour
are the log fold changes (LFCs) between dysplasia and normal, and between tumour and normal, respectively. P.Value
is the p value from an F test comparing the the three tissue types after account for patient-to-patient baseline differences, and adj.P.Val
is the result of a Benjamini-Hochberg adjustment.
The following code shows how to use R/LinkedCharts to explore these results
openPage( useViewer=FALSE, layout="table1x2" )
gene <- 1
lc_scatter(
dat(
x = voomResult$AveExpr,
y = voomResult$tissuetumour,
color = ifelse( voomResult$adj.P.Val < 0.1, "red", "black" ),
label = rownames(voomResult),
size = 1.3,
on_click = function(k) { gene <<- k; updateCharts("A2") }
),
"A1"
)
countsums <- colSums(countMatrix)
lc_scatter(
dat(
x = sampleTable$patient,
y = countMatrix[gene,] / countsums * 1e6 + .1,
logScaleY = 10,
colorValue = sampleTable$tissue,
title = rownames(countMatrix)[gene],
axisTitleY = "counts per million (CPM)",
ticksRotateX = 45
),
"A2"
)
Here is the app that this code creates:
The left plot is an MA plot: Each point is one gene in the result table, the x axis is the average expression of the gene (column AveExpr
in voomResult
), the y axis is the log fold change between tumour and normal tissue (column tissuetumour
). The colour of the points is red or black, indicating whether the gene is significant at 10% false discovery rate (voomResult$adj.P.Val < 0.1
). If one hovers the mouse over a point, the gene name is shown, which is taken from the rownames of the result table (label = rownames(voomResult)
).
The right plot now shows details of a selected gene, and as before, there is a global variable, gene
, which contains a number, indicating a row and hence a gene in the results table and the count matrix. (The rows of both voomResults
and countMatrix
are in the same order). Each point in the plot is one sample, the x position indicates the patient (x = sampleTable$patient
) and the colour the tissue type (colorValue = sampleTable$tissue
). The y axis shows the expression of the selected gene (countMatrix[gene,]
; remember that the rows in countMatrix
and resultsTable
are in the same order). We display the expression of the gene as “counts per million” (CPM). To this end, we divide the count value by the total count for each libary (countsums = colSums(countMatrix)
) and multiply by one million (1e6
). We also add 0.1 pseudocounts to avoid losing zeroes on the axis, which we have made logarithmic with decadic tick marks (logScaleY = 10
). Ticks on the other axis are rotated by 45 degrees (ticksRotateX = 45
) so that they don’t overlap with each other. Above the plot, a title shows the name of the displayed gene, as given by the rowname of the result table (title = rownames(countMatrix)[gene]
).
Note here that in the left plot, we used colour
, in the right plot colourValue
: The difference is that in the left plot, we passed a string vector with elements that directly describe a color ("red"
or "black"
), in the right plot, we pass a word (normal
, dysplasia
or tumour
) and expect LinkedCharts to choose three colours to map these to. LinkedCharts therefore also adds a legend.
Describing the two plots did not require more R code than when describing the same plots, when using e.g., ggplot2. To link the two plots, we
only needed a single extra line: on_click = function(k) { gene <<- k; updateCharts("A2") }
As explained before, this simply changes the global variable gene
to the index of the point the user clicked to and then redraws the
right-hand plot: There, reevaluating the content of the dat
call gives the new plot, because gene
has changed and now, another row of the countMatrix
is used.
If you play a bit with these linked charts you will quickly notice that they are helpful to gain a better feel for the data. You will quickly see how common it is that a gene really behaves the same way in all patients, and how strong such changes are; and you might even come up with ideas for a better statistic to use as y axis in the left-hand plot to really find the genes with the most consistant patterns.
Lastly, you may want to examine interesting genes. What is INHBA, for example? A look at Gene Cards might tell as more, but entering each gene symbol in Gene Cards’ search box is cumbersome. As Gene Cards URLs always have the form https://www.genecards.org/cgi-bin/carddisp.pl?gene=SYMBOL
, with SYMBOL
replaced by some gene symbol, we could automatically put a link below the right-hand plot. Here’s the code from above, with a few more lines to this end:
openPage( useViewer=FALSE, layout="table2x2" )
# Now two rows: ^
gene <- 1
lc_scatter(
dat(
x = voomResult$AveExpr,
y = voomResult$tissuetumour,
color = ifelse( voomResult$adj.P.Val < 0.1, "red", "black" ),
label = rownames(voomResult),
size = 1.3,
on_click = function(k) { gene <<- k; updateCharts( c( "A2", "B2" ) ) }
# Update the new chart, too: ^^
),
"A1"
)
countsums <- colSums(countMatrix)
lc_scatter(
dat(
x = sampleTable$patient,
y = countMatrix[gene,] / countsums * 1e6 + .1,
logScaleY = 10,
colorValue = sampleTable$tissue,
title = rownames(countMatrix)[gene],
axisTitleY = "counts per million (CPM)",
ticksRotateX = 45
),
"A2"
)
# one more chart:
lc_html(
dat( content = sprintf(
"<a href='https://www.genecards.org/cgi-bin/carddisp.pl?gene=%s' target='_blank'>Gene card for %s</a>",
rownames(countMatrix)[gene], rownames(countMatrix)[gene] ) ),
"B2"
)
Now, we have added one more chart, B2, below the scatter plot B1. This is not a real chart – rather, it is simply a piece of HTML code that is
inserted verbatim into the web page at position B2. Here, we use the standard R function sprintf
to construct a standard hyperlink tag. Unless
you have not yet worked with HTML at all yet, you will remeber that such a tag takes the form <a href="http://some.site.xy/some/link> Text </a>
,
and so we simply insert the gene name (as before: rownames(countMatrix)[gene]
) into the GeneCard URL behind href
and into the link text,
which reads “Gene card for [gene]”. We also add the tag attribute target='_blank'
, which instruct the browser to display the gene card in a new
window or tab. This is an easy way to quickly access online information when exploring your data.