R/LinkedCharts Tutorial:

Exploring single-cell data with R/LinkedCharts

This tutorial demonstrates how our R/LinkedCharts package can be used to created linked scatter charts.

We will develop an interactive app that produces, from less than a page of R code, this here:

The app above is provided for demonstration purposes. Functionally, it is identical to what you get after running the code in R, but uses subsampled data so that not to load too big data files.

Readers in a hurry can download the final script here, and run it as is in R. Of course, you will have to read on to see what the different charts are supposed to show.

You can also download the file that produced this document: example_1.Rmd

Example data

As example data, we use single-cell transcriptomics data from the “CiteSeq” paper (Stoeckius et al., Simultaneous epitope and transcriptome measurement in single cells, Nature Methods, 14:865, 2017). This paper describes a new method to simultaneously measure the transcriptome as well as the “epitome” (the abundance of selected surface markers) of single cells, using droplet sequencing. They demonstrate it with a sample of cord blood mononuclear cells.

Both the RNA-Seq and the epitome data are available from GEO (GSE100866. We will use only the RNA-Seq data, which is available as file GSE100866_CBMC_8K_13AB_10X-RNA_umi.csv.gz from the GEO page.

The file contains a table with one row per gene and one column per cell, giving for each gene and each cell the number of transcript molecules (UMIs – Unique Molecular Identifiers) detected in the cDNA of the cell that mapped to the gene. In the following, we will explore this data matrix.

Data preparation

We first perform some very simple data preparation, using only base R commands. If you have worked with expression data before, this will look completely familiar.

Some of these steps take time, but you can skip over them and then later just read in the R data file citeseq_data.rda.

First, we load the data matrix

# The path to the data file downloaded to GEO
fileName <- "~/Downloads/GSE100866_CBMC_8K_13AB_10X-RNA_umi.csv.gz"

# Load the file. (This takes a while as it is a large file.)
countMatrix <- as.matrix( read.csv( gzfile( fileName ), row.names = 1) )

Stoecklin et al. have spiked in a few mouse cells to their human cord blood sample for quality control purposes, and have therefore mapped everything against both the human and the mouse genome. We keep only the cells with nearly only human transcripts and also remove the mouse gene counts from the data matrix.

# Calculate for each cell ratio of molecules mapped to human genes
# versus mapped to mouse genes
human_mouse_ratio <- 
  colSums( countMatrix[ grepl( "HUMAN" , rownames(countMatrix) ), ] ) / 
  colSums( countMatrix[ grepl( "MOUSE" , rownames(countMatrix) ), ] )

# Keep only the cells with at least 10 times more human than mouse genes
# and keep only the counts for the human genes.
countMatrix <- countMatrix[ 
  grepl( "HUMAN" , rownames(countMatrix) ), 
  human_mouse_ratio > 10 ]

# Remove the "HUMAN_" prefix from the gene names
rownames(countMatrix) <- sub( "HUMAN_", "", rownames(countMatrix) )

Now, we have a matrix of counts that looks like this

# size of the matrix
dim(countMatrix)
## [1] 20400  8005
# top left corner of the matrix
countMatrix[ 1:5, 1:5 ]
##          TACAGTGTCTCGGACG GTTTCTACATCATCCC ATCATGGAGTAGGCCA
## A1BG                    0                0                0
## A1BG-AS1                0                0                0
## A1CF                    0                0                0
## A2M                     0                0                0
## A2M-AS1                 0                0                0
##          GTACGTATCCCATTTA ATGTGTGGTCGCCATG
## A1BG                    0                0
## A1BG-AS1                0                0
## A1CF                    0                0
## A2M                     0                0
## A2M-AS1                 0                0

Each column corresponds to a cell, identified by its DNA barcode, and each row to a gene. The numbers are the number of transcript molecules (UMIs – unique molecular identifiers) from a given cell that were mapped to the given gene.

Next, we calculate the sums of the counts for each cell, and divide them by 1000. We will call these the “size factors” and use them for normalization: when comparing the counts from two cells, we will consider them as comparable after dividing them by their respective size factors.

sf <- colSums(countMatrix) / 1000

For our first example, we will ask which genes actually differ in an interesting way from cell to cell. To this end, we calculate for each gene the mean and variance of the normalized counts

means <- apply( countMatrix, 1, function(x) mean( x / sf ) )
vars <- apply( countMatrix, 1, function(x) var( x / sf ) )

If you want to skip over preceding steps, you can also just load this R data file from here: citeseq_data.rda.

Analysis the old-fashioned way

Here is a plot of the variance-to-mean ratios versus the means

plot( means, vars / means, 
   pch = ".", log = "xy", col = adjustcolor( "black", 0.4 ) )
## Warning in xy.coords(x, y, xlabel, ylabel, log): 221 x values <= 0 omitted
## from logarithmic plot
abline( h = mean( 1/sf ), 
	col = adjustcolor( "red", 0.5 ) )

plot of chunk var_mean_simple

All the genes that are scattering close to the red line have no or only little signal: the Poisson noise dominates over any biological signal of interest that may there be. Only the genes that are clearly above the red line can convey useful information.

(To understand why the red line marks the expected strength of Poisson noise, and why it is has to be placed at mean( 1/sf ), please see our write-up at … [Sorry, still under preparation at the moment.])

We can see groups of genes emerging from the horizontal clouds at different points. Which genes are these? Are they active in only few cells or in many cells?

It would be helpful if, for each gene, we could get a plot like this one

gene <- "CD79A"

plot( sf, countMatrix[ gene, ], 
   log = "x", col = adjustcolor( "black", 0.4 ), main = gene )

plot of chunk CD79A_simple

But where is gene “CD79A” on the previous plot? And, given a specific point on the overview plot (the first one), how can we quickly get the detail plot (the second one)?

Here we can click on a point in the left chart, to change what is displayed in the right plot:

Now, we show how to create such linked charts with R/LinkedCharts.

Using LinkedCharts

If you haven’t done so yet, install the R/LinkedChart package with

devtools::install_github( "anders-biostat/rlc" )

Load the R/LinkedCharts library

library( rlc )

To produce the first scatter plot, use

openPage( useViewer = FALSE, layout = "table2x2" )
## WebSocket opened
lc_scatter(
  dat( 
    x = means, 
    y = vars / means,
    logScaleX = 10, 
    logScaleY = 10,
    size=2.5
  ), 
  "A1"
)
## Chart 'A1' added.
## Layer 'Layer1' is added to chart 'A1'.

The openPage function opens a new page, either in your web browser (for useViewer=FALSE) or in the “Viewer” pane of RStudio (for useViewer=TRUE, the default). We have chosen to use a browser window, as the viewer pane is a bit small to show two plots side by side. As layout for our charts, we use a simple 2x2 grid, with four chart positions labeled “A1”, “A2”, “B1”, and “B2”.

The lc_scatter function inserts a scatter plot. The first argument sepcifies values for the plots parameters. Unsurprisingly, we have to pass two vectors with x and y coordinates of the points, and furthermore, we specify a few more optional parameters: Both axes should be logarithmic, with tick marks labelling multiples of 10 and size of the points should be smaller than the default one (6). Note how all the parameters are wrapped in a call to a function named dat. This will be explained later.

The second parameter (“place”) specifies where to put the plot on the page: here, at position A1, i.e., top left.

We will explain more details further below; but first, let’s add the second plot:

gene <- "CD79A"

lc_scatter(
  dat( 
    x = sf, 
    y = countMatrix[ gene, ], 
    logScaleX = 10, 
    title = gene,
    opacity = 0.2 ),
  "A2"
)
## Chart 'A2' added.
## Layer 'Layer1' is added to chart 'A2'.

As before, we have used lc_scatter to insert a scatter plot, now at position A2 (to the right of A1). As before, we use the vector sf for the x axis, and one row of countMatrix for the y axis. The variable gene contains the rowname of the row to display. We also use two further optional parameters: title sets the plot title above the chart (in this case the gene name, as stored in the variable gene), and opacity = 0.2 renders the points semi-transparent, so that we can better see whether several points are sitting on top of each other.

Now, we need to link the two charts. When the user clicks on a point in the left chart (chart A1), then the value of the variable gene should be changed to the name of the gene onto which the user has clicked. If we then cause the second chart (A2) to be redrawn, it will show the data for that gene. To explain to R/LinkedChart that this should happen when the user clicks on a data point, we change the code for the first chart to the following:

lc_scatter(
  dat( 
    x = means, 
    y = vars / means,
    logScaleX=10, 
    logScaleY=10,
    size = 2.5,
    on_click = function( i ) {
       gene <<- rownames(countMatrix)[ i ]
       updateCharts( "A2" )
    }
  ), 
  "A1"
)
## Layer 'Layer1' is added to chart 'A1'.

As you can see, we have merely added one more optional parameter to the dat block. It is called on_click and set to a function. This function will now be called whenever the user clicks on a point and it gets passed, as its argument i, the index of this point. We want gene to contain not the number of the gene’s row but its name. Hence, we simply look up the name of the gene as the i-th rowname of count matrix and write this into the variable.

Note that we assign to the variable gene using the <<- operator. This variant of the usual <- (or =) operator assigns to a global rather than a local variable. Had we used the normal <-, R would have created a local variable called gene within the function and forgot it immediately afterwards, rather than changing the global variable that we had defined before.

After changing the variable, we simply instruct R/LinkedCharts to update chart A2: updateCharts( "A2" ).

This is all we need to link the two charts.

Here, again, is the complete code. It is not much longer than the R code one would have written for normal static plots: Adding interactivity is easy with R/LinkedCharts.

library( rlc )

load( "citeseq_data.rda" )

openPage( useViewer=FALSE, layout = "table2x2" )
## WebSocket opened
gene <- "CD79A"

lc_scatter(
  dat( 
    x = means, 
    y = vars / means,
    logScaleX = 10, 
    logScaleY = 10,
    size = 2.5,
    on_click = function( i ) {
       gene <<- rownames(countMatrix)[ i ]
       updateCharts( "A2" )
    }
  ), 
  "A1"
)
## Chart 'A1' added.
## Layer 'Layer1' is added to chart 'A1'.
lc_scatter(
  dat( 
    x = sf, 
    y = countMatrix[ gene, ], 
    logScaleX = 10, 
    title = gene,
    opacity = 0.2 ),
  "A2"
)
## Chart 'A2' added.
## Layer 'Layer1' is added to chart 'A2'.

How does it work? The call to updateCharts causes all the code within the dat function of chart A2 to be re-evaluated, and then, the new data to be sent to the browser. As gene has changed, the re-evaluation of the code in the dat call will now give different values to y and to title, and the plot changes. This is why all the plot parameters (means, vars, etc.) have to be placed inside the dat call: The dat function is a little trick to keep R from evaluating this code straight away when you call lc_scatter but rather to keep the code as is in order to be able to re-evaluate it whenever needed. (In other words, dat just defines an anonymous function without arguments, but does so in a manner that allows for a light, unobstructive syntax.)

Adding a t-SNE plot

So far, our app demonstrated well the principle of using R/LinkedChart, but practitioners of single-cell transcriptomics might still find it unconvincing. Hence, let’s add a bit more to make it actually useful.

In single-cell RNA-Seq, one often employs dimension reduction methods, such as t-SNE. These produce a plot with points representing the cells such that cells with similar transcriptome are closeby.

Let’s use the Rtsne package to calculate a t-SNE plot for our expression data. (The t-SNE calculation takes a few minutes. If you don’t like to wait, just load the citeseq_data.rda file, which contains the result.)

# install.packages( "Rtsne" )   # uncomment if needed

library( Rtsne )

# Normalize the counts by dividing by the size factors and take the square root
# (The square root stabilizes the variance more robustly than the logarithm;
# please see our write-up on this [under preparation] if you are interested.)
expr <- t( apply( countMatrix, 1, function( x ) sqrt( x / sf ) ) )

# Get the names of the 1000 genes with the highest variance-to-mean ratios
varGenes <- names( tail( sort( vars/means ), 1000 ) )

# Calculate the t-SNE embedding. This may take a while.
tsne <- Rtsne( t( expr[varGenes, ] ), verbose = TRUE )
## Performing PCA
## Read the 8005 x 50 data matrix successfully!
## OpenMP is working. 1 threads.
## Using no_dims = 2, perplexity = 30.000000, and theta = 0.500000
## Computing input similarities...
## Building tree...
## Done in 3.03 seconds (sparsity = 0.017480)!
## Learning embedding...
## Iteration 50: error is 92.687413 (50 iterations in 1.89 seconds)
## Iteration 100: error is 81.436077 (50 iterations in 1.77 seconds)
## Iteration 150: error is 79.811144 (50 iterations in 1.51 seconds)
## Iteration 200: error is 79.370756 (50 iterations in 1.52 seconds)
## Iteration 250: error is 79.145878 (50 iterations in 1.49 seconds)
## Iteration 300: error is 3.120874 (50 iterations in 1.32 seconds)
## Iteration 350: error is 2.865478 (50 iterations in 1.27 seconds)
## Iteration 400: error is 2.721587 (50 iterations in 1.22 seconds)
## Iteration 450: error is 2.631423 (50 iterations in 1.19 seconds)
## Iteration 500: error is 2.568666 (50 iterations in 1.19 seconds)
## Iteration 550: error is 2.523436 (50 iterations in 1.21 seconds)
## Iteration 600: error is 2.489009 (50 iterations in 1.20 seconds)
## Iteration 650: error is 2.463170 (50 iterations in 1.19 seconds)
## Iteration 700: error is 2.443486 (50 iterations in 1.19 seconds)
## Iteration 750: error is 2.429011 (50 iterations in 1.21 seconds)
## Iteration 800: error is 2.418269 (50 iterations in 1.25 seconds)
## Iteration 850: error is 2.411005 (50 iterations in 1.25 seconds)
## Iteration 900: error is 2.405193 (50 iterations in 1.25 seconds)
## Iteration 950: error is 2.400266 (50 iterations in 1.23 seconds)
## Iteration 1000: error is 2.395971 (50 iterations in 1.26 seconds)
## Fitting performed in 26.61 seconds.
#save all the necessary data
save( countMatrix, sf, means, vars, tsne, expr, varGenes, file = "citeseq_data.rda" )

The embedding is in the Y field of the object returned by Rtsne. We display it with a third scatter plot:

lc_scatter(
  dat( 
    x = tsne$Y[,1], 
    y = tsne$Y[,2],
    label = colnames(countMatrix),  
    colourValue = expr[ gene, ],
    palette = RColorBrewer::brewer.pal( 9, "YlOrRd" ),
    colourLegendTitle = gene,
    size = 1 ),
  "B1"
)
## Chart 'B1' added.
## Layer 'Layer1' is added to chart 'B1'.

This plot is now placed a position “B1”, i.e., under A1. We use the two columns of the 2xN matrix in tsne$Y for x and y. Each data point represents a cell, and we label it with the cell tag and use the points’ colours to indicate the expression strength of the currently selected gene. To this end, we read the corresponding row from the expr matrix that we have also used to calculate the t-SNE plot.

For the colour palette, we set the parameter palette. We have to provide a vector of colours, and LinkedCharts will interpolate between these. Here, we have used the “YlOrRd” (for “Yellow-Orange-Red”) palette provided by the RColorBrewer package. We also use the name of the selected gene as a title for the automatically generated legend.

Note how it easy it is again to link this chart with the existing ones. If the use clicks on a point on chart A1 to select a gene, the gene variable will be changed and this will affect what data is shown now in both the charts A2 and B1. The only thing that is missing is that we need to change the event handler in A1 to also update B2. So, we simply add this to the updateCharts call from above:

lc_scatter(
  dat( 
    x = means, 
    y = vars / means,
    logScaleX = 10, 
    logScaleY = 10,
    size = 2.5,
    on_click = function( i ) {
       gene <<- rownames(countMatrix)[ i ]
       updateCharts( c( "A2", "B1" ) )
       #            ^^     ########   <- the only change to above
    }
  ), 
  "A1"
)
## Layer 'Layer1' is added to chart 'A1'.

Have a look below for the final code, if you are unsure how this fits in.

Play a bit with the app. Now you can easily find out for any gene with high variability whether it is expressed throughout, or whether is is specific to one of the clusters visible in the t-SNE embedding. This will help us to conveniently explore and understand the t-SNE plot.

Analysing marked cells

Mark one of the cell clusters in the t-SNE plot by drawing a rectangle with the mouse while holding the Shift key. (Drawing without Shift causes zooming in rather than selecting). You can unselect all cells by holding Shift and double-clicking. (Double-clicking without Shift undoes any zooms and returns to the original axis limits.)

After marking cells in the browser, you can go right back to your R session and inquire about them using the getMarked function.

I have just marked these 147 cells:

marked <- getMarked( "B1" )
> marked
  [1]    8   10   11   14   15   16   17   18   25   26   27   29   30   31
 [15]   33   35   40   42   43   44   46   48   52   54   57   63   68   73
...
[127]  671  700  733  746  774  792  793  814  824  879  884  907  927  967
[141]  975 1034 1465 1911 5803 7975 7996

They are all from the same small cluster at the periphery of the t-SNE plot:

plot( tsne$Y, asp=1, col=1+(1:nrow(tsne$Y)%in%marked), pch=".", 
   cex=2, xaxt="n", yaxt="n", xlab="", ylab="", bty="n" )

plot of chunk tsne_cluster

What genes might have particularly high expression in this cluster? To see, let’s calculate, for each gene, mean and standard deviation first over all the marked cells and then over all the remaining cells. To save time, let’s do it only for the informative genes that we have selected above and stored in the variable varGenes.

df <- data.frame(
  meanMarked   =  apply( expr[ varGenes,  marked ], 1, mean ),
  sdMarked     =  apply( expr[ varGenes,  marked ], 1, sd ),
  meanUnmarked =  apply( expr[ varGenes, -marked ], 1, mean ),
  sdUnmarked   =  apply( expr[ varGenes, -marked ], 1, sd )
)

Now, we can calculate a separation score for each gene, which tells us how well the distribution of this gene’s expression values in the marked cells is separated from the distribution in the unmarked cells. It’s a bit like a z-score: We take the difference of the means between marked and unmarked, and divide by the standard deviations (which we simply add up for both groups). Notably, if the standard deviations are extremely small they blow up the score too much; we circumvent this by setting a minimum value of 0.002. Not a very sophisticated approach, but it works well enough for this demonstration.

df$sepScore <- ( df$meanMarked - df$meanUnmarked ) / pmax( df$sdMarked + df$sdUnmarked, 0.002 )

Now we have this

head( df )
##             meanMarked   sdMarked meanUnmarked sdUnmarked    sepScore
## DGCR5      0.000000000 0.00000000 0.0002171349 0.01362520 -0.01593628
## AC091814.2 0.002994731 0.03630918 0.0004960033 0.02021686  0.04420489
## MIR663A    0.000000000 0.00000000 0.0003498924 0.01637477 -0.02136778
## HP         0.000000000 0.00000000 0.0053221283 0.06398697 -0.08317518
## PRKCQ-AS1  0.148005784 0.21193058 0.1709553286 0.33820823 -0.04171592
## TALDO1     0.677642630 0.27947742 0.3403140015 0.43947661  0.46919360

Here are the top genes, i.e., those with the highest score

head( df[ order( df$sepScore, decreasing=TRUE ), ] )
##              meanMarked  sdMarked meanUnmarked sdUnmarked sepScore
## PRSS57        0.7250667 0.3690533 0.0126227363 0.10363671 1.507212
## SPINK2        0.8310036 0.4939622 0.0131388229 0.09469020 1.389385
## C1QTNF4       0.6207673 0.4672307 0.0009496667 0.02462847 1.260152
## STMN1         1.0419737 0.3661231 0.1571042083 0.35102983 1.233864
## C17orf76-AS1  2.0924204 0.3084434 1.0117939060 0.59899410 1.190855
## KIAA0125      0.6097646 0.4287839 0.0095667244 0.08837219 1.160574

It would be nice to have this table appear in our app. There is still space at the bottom right-hand corner (position B2). We can use the hwrite function from the simple but very useful hwriter package to transform the table into HTML code

html_table <- hwriter::hwrite( head( df[ order( df$sepScore, decreasing=TRUE ), ] ) )
html_table
## [1] "<table border=\"1\">\n<tr>\n<td></td><td>meanMarked</td><td>sdMarked</td><td>meanUnmarked</td><td>sdUnmarked</td><td>sepScore</td></tr>\n<tr>\n<td>PRSS57</td><td>0.725066676278626</td><td>0.369053339059246</td><td>0.0126227362973354</td><td>0.103636705089765</td><td>1.5072116470401</td></tr>\n<tr>\n<td>SPINK2</td><td>0.831003575054457</td><td>0.493962176122786</td><td>0.013138822891984</td><td>0.0946902014959639</td><td>1.38938494646186</td></tr>\n<tr>\n<td>C1QTNF4</td><td>0.620767250693636</td><td>0.467230715177914</td><td>0.000949666672459759</td><td>0.024628474625193</td><td>1.26015249256457</td></tr>\n<tr>\n<td>STMN1</td><td>1.04197366701963</td><td>0.366123080494962</td><td>0.157104208312422</td><td>0.351029833213004</td><td>1.23386441272627</td></tr>\n<tr>\n<td>C17orf76-AS1</td><td>2.09242035687478</td><td>0.308443407217987</td><td>1.01179390598515</td><td>0.598994096952368</td><td>1.19085495797048</td></tr>\n<tr>\n<td>KIAA0125</td><td>0.609764572553002</td><td>0.42878392946118</td><td>0.00956672439381324</td><td>0.0883721856290961</td><td>1.16057382025622</td></tr>\n</table>\n"

This cryptic line will become nice and readable once we put it into our web app to let the web browser interpret the HTML code:

lc_html( content = html_table, place = "B2" )
## Chart 'B2' added.

The moment your R executes this line, the table appears on the web page.

lc_html is another simple type of charts that puts its content parameter on the web page. It behaves as any other chart in R/LinkedCharts. It also can be updated using the updateCharts function and its parameters can be wrapped in the dat function. This code is equivalent to the lc_html call above

lc_html(dat(
    content = html_table,), 
  "B2"
)

Note how you can also put parameters outside of the dat function if you don’t want them to be reevaluated in the future.

How can we make this automatic, so that all this happens whenever we mark a group of cells? Simple: We just have to write an event handler again, only this time we use on_marked rather than on_click.

Let’s add such an event handler to the t-SNE plot:

lc_scatter(
  dat(
    x = tsne$Y[,1],
    y = tsne$Y[,2],
    label = colnames(countMatrix), 
    colourValue = sqrt( countMatrix[ gene, ] / sf ),
    palette = RColorBrewer::brewer.pal( 9, "YlOrRd" ),
    size = 1,
    colourLegendTitle = gene,
    on_marked = function() {
      html_table <<- getHighGenes()
      updateCharts("B2")
    }),
  "B1"
)
## Layer 'Layer1' is added to chart 'B1'.

And the function showHighGenes contains the analysis just described:

getHighGenes <- function(){
  
  marked <- getMarked("B1")
  
  # If no genes are marked, clear output, do nothing else
  if( length(marked) == 0 ) {
    return( "" )
  }
  
  df <- data.frame(
    meanMarked   =  apply( expr[ varGenes,  marked ], 1, mean ),
    sdMarked     =  apply( expr[ varGenes,  marked ], 1, sd ),
    meanUnmarked =  apply( expr[ varGenes, -marked ], 1, mean ),
    sdUnmarked   =  apply( expr[ varGenes, -marked ], 1, sd )
  )
  df$sepScore <- ( df$meanMarked - df$meanUnmarked ) / pmax( df$sdMarked + df$sdUnmarked, 0.002 )
  
  # round to two decimal places
  df <- round(df * 100)/100
  
  head( df[ order( df$sepScore, decreasing=TRUE ), ], 15 )
}

Everything put together

In case you got lost with the individual pieces of code, here is everything put together.

What else is there?

R/LinkedCharts has many more possibilities than this simple example could demonstrate. Here are a few:

Some of the ideas behind R/LinkedCharts

A minimalistic model-view-controller pattern

If you have experience with GUI programming, you might be familiar with the model-view-controller (MVC) design pattern. What we have just built is a most minimalistic version of this.

LinkedCharts and jrc

The work horse under the hood of R/LinkedCharts is our LinkedCharts library, which is written in JavaScript, not in R. It can also used directly, in order to write stand-alone client-side data exploration apps, and offers all the functionality of R/LinkedCharts and some more. R/LinkedCharts uses our jrc library [documentation/tutorial to be written] to establish a bidirectional WebSocket communication between the R session on one end and the web browser session with its JavaScript interpreter on the other end. The WebSocket link is used to send data and commands (e.g., to add charts to the page) from R to JavaScript and to send event notifications about user actions (such as a mouse click) from JavaScript to R.