What is it?
When working with single-cell data, e.g., from single-cell RNA-Seq, we need an intuitive visual representation of our data. To do so, dimensionality reduction approaches are useful, and dozens of these have been developed by now: PCA, Diffusion Map, t-SNE, UMAP and many others.
The aim of these dimension-reduced embeddings is simple: Each cell is represented by a point on two-dimensional plot, and the points are arranged such that similar cells appear close to each other and cells that are different farer apart.
The same ideas are useful, if, insetad of many cells, we have many samples that have been assessed with some bulk omics (e.g., ordinary RNA-Seq), and we want to see which samples are similar.
What do we mean by similar and different? We think of each cell (or sample) as a point in “feature space” – an imagined high-dimensional space whose axes are all the features (i.e., gene expressions, protein abundances, drug responses or any other measured values) and the cell’s measured values for all the features are its coordinates. Since we usually work with tens, hundreds or thousands features, this is not very intuitive. Have you ever tried to imagine 1000-dimensional space?
However, we can imagine to measure distances in this high-dimensional space. The Euclidean distance between two cells, for example, is to simply take for each feature the difference between the two cells, add up the squares of all this distances and take the square root – exactly as one does in oridnary 3D space using Pythagoras’ theorem.
A dimension-reduced embedding is now an arrangement of point representing the cells in less dimensions – only 2, if we want to show it on a two-deimensional computer screen – that, to some extent, preserves these distances.
This can never work perfectly – we will always introduce distortions when reducing dimensions.
So, can you be sure that the visualisation you get by using t-SNE, UMAP, MDS or the like really give you a faithful representation of your data? Are the points that lie almost on top of each other really all similar? Does the large distance on your 2D representation
always mean lots of dissimilarities? Our
sleepwalk package for the R statistical programming environment can help you answer these questions.
Explore an embedding
Below is an example of a t-SNE visualisation that you can explore with
This is single-cell transcriptomics data from the “CiteSeq” paper (Stoeckius et al., Simultaneous epitope and transcriptome measurement in single cells, Nature Methods, 14:865, 2017). Each point is a cell from a human cord blood sample. A two-dimensional t-SNE embedding was obtained by running a Seurat workflow. Gene expression values were normalised and scaled. PCA was performed and all but the first 13 principal components were omitted to reduce noise level. After that tSNE was used to further reduce number of dimensions and get a 2D visualisation.
To the left you can see cell types assigned to each cluster in the Seurat vignette.
Move your mouse over the points. At any moment the colours will show you the real (i.e., based on the original data, before the dimension reduction) distances of all the cells to the one cell right under your mouse cursor. (By default, Euclidean feature-space distances are used as the “real” distances.) Thus, simply by moving the mouse you can explore the structure of the data and see whether the embedding gives a faithful representation and where it is distorted.
For instance, we immediately can see that the cluster of T cells (red and green) which is the largest on the plot is in fact in fact among the most dense ones. If you put the mouse over one of the T cells, almost the entire cluster immediately light up in black or dark-green. Compare that, for example, with seemingly small and compact cluster of B cells (intense green), or, even more so, with megakaryocytes (lilac, close to the T cells) or erythrocytes (pink, upper-right corner). These small clusters contain cells that are further away from each other than are some entire clusters. To see any similarity between megakaryocytes you need to increase the limit of the colour scale (press the “+” button).
We can also see the reason why tSNE failed to separate CD4+ (red) and CD8+ T cells (green). The differences between the two are so subtle that we can hardly see it. Moreover, there is no distinct border between them.
Try to move the mouse to the very right tip of the CD8+ T cells cluster to see a continuous change from the CD8+ T cells to CD4+ T cells (from right to the left). Yet we know that there is CD4+ T cells do not mature to
become CD8+ and vice verse. So here
sleepwalk indicates a possibility of some technical artifact that was picked up by tSNE. It is possible that isolating and regressing it out can help to distinguish the cell types
Notice how, when you explore T cells cluster, the upper tip of the B cells cluster lights up. After moving the mouse there we may notice that these cells, even though they are clustered as B cells, resemble other cells in their cluster as much as T cells. This may be an indication of doublets and can be worth further investigation. If you had a running R session instead of exploring this web page, you would be able to select some cells with your mouse. For this you need to press the left mouse button and enclose the points in a selection contour. When you finish the selection, a variable with indices of all selected points will appear in your R session and you can go on with further analysis.
You can also notice that some of the cells at the edges of the clusters are drastically different from everything else around them. They likely are some sort of outliers.
Compare two embeddings
sleepwalk can also help you compare two different embeddings. For example, let’s look at the same t-SNE plot as before and put next to it a visualisation of the same data
obtained with another dimensionality reduction approach. This time, we have used UMAP.
The main idea is the same: you move the mouse over an embedding and colour shows you the real distance from the current point to all others. But now you see distances from the same point on the other embedding as well. This allows you immediately identify the same clusters from both visualisations and gives you an easy way to compare two (or more if you want) dimensionality reduction techniques applied to your data.
Compare two sets of samples
You can also use
sleepwalk to compare not only different embeddings for the same data, but also to compare different samples. As example data we use here glioma samples from two patients (out of six, presented in the paper)
by Filbin et al. (Developmental and oncogenic programs in H3K27M gliomas dissected by single-cell RNA-seq, Science, 360:6386, 2018).
Again, single-cell transcriptomics has been used to study each sample, and again we have used t-SNE to visualise them.
Here, our goal is to figure out if there are correponding groups of cells in the two samples and what are those groups.
sleepwalk can help here, too. The colour
now shows the distance from the cell under the mouse cursor to all other cells in all the embeddings, allowing us to find the most cells not only in the current but also in the other samples. We can immediately see that the two largest
clusters not only correspond to each other in both samples, but also are aranged in a similar manner: Move the mouse along the large cluster in one of the plots to see it.
The small cluster at the bottom of the right-hand plot probably corresponds to a group of clusters on the other plot. We also can clearly see that each sample has a population of cells not present in the other one.
sleepwalk, start R and type
install.packages( "devtools" ) devtools::install_github( "anders-biostat/JsRCom" ) devtools::install_github( "anders-biostat/sleepwalk" )
sleepwalk package is easy to use. It has only one function, also called
As its first argument (
sleepwalk expects an embedding, i.e., a n x 2 matrix of 2D coordinates for the n points. To get multiple plot (as in the second and third example above), pass a list with several such matrices, one for each of the plots.
As second argument (
featureMatrices), pass the feature matrix: this is the n x m matrix with one row for each of the n points (e.g., cells) and one columns for each of the m features (e.g., genes) that should be used to calculate the actual distances. Sleepwalk determines the colour of each point from the Euclidean distance between the row of the point to be coloured and the row of the point under the mouse cursor. For multiple plots, pass a list of several such matrices, one for each plot.
If m is large, it can be a good idea to only pass the first 30 or so principal components of the sample. Use the
prcomp_irlba function from the
irlba package, to efficiently calculate them.
Instead of passing the n x m feature matrix, you can use the positional argument
distances and pass a square n x n distance matrix, giving the distances between all pairs of points that should be used to determine the point colours. Use this if you think that simple Euclidean distances are not appropriate for your data and you have something more suitable.
You also have to specify
maxdists: the maximum value for the colour scale. You can adjust this afterwards with the “+” and “-“ buttons next to the colour scale. For mutiple plots, you can pass a vector, with one value for each plot.
If you ask for mroe than one plot, you can specify, with the parameter
same, whether the plots share the same objects (
same="objects"), as above in the second example with two embeddings compared, or the same features (
same="features"), as in the third examples, where we compared two samples. The default is
same="objects". Note that for same
same="objects", all the matrices in the first and second argument must have the same number of rows. For
same="features", the feature matrices in the second argument must have the same number of columns.
The optional parameter
pointsize allwos you to change the size of the points.
Finally, instead of opening the app in a browser, you can ask to save everything to an HTML file, by passing a file name in the optional
Sleepwalk is beeing developed by Svetlana Ovchinnikova and Simon Anders at the Center for Molecular Biology of the University of Heidelberg. Please contact us for questions or feedback, or file an issue on GitHub if you find a bug.