Abstract
Single cell RNA-sequencing technology (scRNA-seq) provides a new avenue to discover and characterize cell types, but the experiment-specific technical biases and analytic variability inherent to current pipelines may undermine the replicability of these studies. Meta-analysis of rapidly accumulating data is further hampered by the use of ad hoc naming conventions. Here we demonstrate our replication framework, MetaNeighbor, that allows researchers to quantify the degree to which cell types replicate across datasets, and to rapidly identify clusters with high similarity for further testing. We first measure the replicability of neuronal identity by comparing more than 13 thousand individual scRNA-seq transcriptomes, then assess cross-dataset evidence for novel pyramidal neuron and cortical interneuron subtypes identified by scRNA-seq. We find that 24/45 cortical interneuron subtypes and 10/48 pyramidal neuron subtypes have evidence of replication in at least one other study. Identifying these putative replicates allows us to re-analyze the data for differential expression and provide lists of robust candidate marker genes. Across tasks we find that large sets of variably expressed genes can identify replicable cell types and subtypes with high accuracy, indicating many of the transcriptional changes characterizing cell identity are pervasive and easily detected.
Introduction
Single cell RNA-sequencing (scRNA-seq) has emerged as an important new technology enabling the dissection of heterogeneous biological systems into ever more refined cellular components. One popular application of the technology has been to try to define novel cell subtypes within a given tissue or within an already refined cell class, as in the lung (Treutlein et al., 2014), pancreas (Baron et al., 2016; Muraro et al., 2016; Segerstolpe et al., 2016; Wang et al., 2016), retina (Macosko et al., 2015; Shekhar et al., 2016), or others (Grun et al., 2015; Klein et al., 2015; Min et al., 2015). Because they aim to discover completely new cell subtypes, the majority of this work relies on unsupervised clustering, with most studies using customized pipelines with many unconstrained parameters, particularly in their inclusion criteria and statistical models (Grun et al., 2015; Habib et al., 2016; Macosko et al., 2015; Zeisel et al., 2015). While there has been steady refinement of these techniques as the field has come to appreciate the biases inherent to current scRNA-seq methods, including prominent batch effects (Hicks et al., 2015), expression drop-outs (Lun et al., 2016; Pierson and Yau, 2015), and the complexities of normalization given differences in cell size or cell state (Buettner et al., 2015; Vallejos et al., 2015), the question remains: how well do novel transcriptomic cell subtypes replicate across studies?
In order to answer this, we turned to the issue of cell diversity in the brain, a prime target of scRNA-seq as neuron diversity is critical for construction of the intricate, exquisite circuits underlying brain function. The heterogeneity of brain tissue makes it particularly important that results be assessed for replicability, while its popularity as a target of study makes this goal particularly feasible. Because a primary aim of neuroscience has been to derive a taxonomy of cell types (Ascoli et al., 2008), already more than twenty single cell RNA-seq experiments have been performed using mouse nervous tissue (Poulin et al., 2016). Remarkable strides have been made to address fundamental questions about the diversity of cells in the nervous system, including efforts to describe the cellular composition of the cortex and hippocampus (Tasic et al., 2016; Zeisel et al., 2015), to exhaustively discover the subtypes of bipolar neurons in the retina (Shekhar et al., 2016), and to characterize similarities between human and mouse midbrain development (La Manno et al., 2016). In spite of this wealth of data, there have been few attempts to compare, validate and substantiate cell type transcriptional profiles across scRNA-seq datasets, and no systematic or formal method has been developed for accomplishing this task.
To address this gap in the field, we propose a simple, supervised framework, MetaNeighbor (meta-analysis via neighbor voting), to assess how well cell type-specific transcriptional profiles replicate across datasets. Our basic rationale is that if a cell type has a biological identity rooted in the transcriptome then knowing its expression features in one dataset will allow us to find cells of the same type in another dataset. We make use of the cell type labels supplied by data providers, and assess the correspondence of cell types across datasets by taking the following approach (see schematic, Figure 1):
First we construct a kernel: we calculate correlations between all pairs of cells that we aim to compare across datasets based on the expression pattern of a set of genes. This generates a network where each cell is a node and the edges are the strength of the correlations between them.
Next, we do cross-dataset validation: we hide all cell type labels (‘identity’) for one dataset at a time. This dataset will be used as our test set. Cells from all other datasets remain labeled, and are used as the training set.
Finally, we predict the cell type labels of the test set: we use a neighbor voting algorithm to predict the identity of the held-out cells based on their similarity to the training data.
Conceptually, this resembles approaches for the validation of sample clustering (Dudoit et al., 2002; Kapp and Tibshirani, 2007) but it has been adapted to operate from within a supervised learning framework. This permits both systematic scoring and carefully defined control experiments to investigate the data features that drive high performance. Our implementation is extremely fast and robust to technical differences between experiments; because prediction is performed only within an individual dataset at a time, we are able to keep many aspects of technical variation constant. This essentially controls for any dataset specific effects that would otherwise swamp the subtler cell identity signal. The method provides a score that indicates the degree to which a cell type replicates for each gene set that is tested. This means that MetaNeighbor doubles as a low-tech ‘feature selection tool’ that we can use to identify the transcriptional features that are most discriminative between cell types. By comparing the scores returned from using Gene Ontology (GO) functions (“functional gene sets”) or sets of randomly chosen genes (“random gene sets”), we can determine whether co-expression of specific gene sets is characteristic of particular cell types, and thus important for cell function or identity.
We evaluate cell identity by taking sequential steps according to the basic taxonomy of brain cells: first classifying neurons vs. non-neuronal cells across eight single cell RNA-seq studies, then classifying cortical inhibitory neurons vs. excitatory neurons, and for our final step, we align interneuron and pyramidal cell subtypes across three studies. Critically, we discover that that almost any sufficiently large and highly variable set of genes can be used to distinguish between cell types, suggesting that cell identity is widely represented within the transcriptome. Furthermore, we find that cross-dataset analysis of pyramidal neurons results in broad definition of cortical vs. hippocampal types, and find evidence for the replication of five layer-restricted subtypes. In contrast, we find that cortical interneuron subtypes show clear lineage-specific structure, and we readily identify 11 subtypes that replicate across datasets, including Chandelier cells and five novel subtypes defined by transcriptional clustering in previous work. Meta-analysis of differential expression across these highly replicable cortical interneuron subtypes revealed evidence for canonical marker genes such as parvalbumin and somatostatin, as well as new candidates which may be used for improved molecular genetic targeting, and to understand the diverse phenotypes and functions of these cells.
Assessing neuronal identity with MetaNeighbor
We aimed to measure the replicability of cell identity across tasks of varying specificity. Broadly, these are divided into tasks where we are recapitulating known cell identities, and ones where are measuring the replicability of novel cell identities discovered in recent research. The former class of task is the focus of this subsection: first, by assessing how well we could distinguish neurons from non-neuronal cells (“task one”), and next assessing the discriminability of excitatory and inhibitory neurons (“task two”). As detailed in the methods, MetaNeighbor outputs a performance score for each gene set and task. This score is the mean area under the receiver operator characteristic curve (AUROC) across all folds of cross-dataset validation, and it can be interpreted as the probability that we will rank a positive higher than a negative (e.g. neuron vs. non-neuronal cell, when using neurons as the positive label set) based on the expression of a set of genes. This varies between 0 and 1, with 1 being perfect classification, 0.5 meaning that we have performed as well as if we had randomly guessed the cell’s identity, and 0.9 or above being extremely high. Comparison of scores across gene sets allows us to discover their relative importance for defining cell identity.
As described above, in task one we assessed how well we could identify neurons and non-neuronal cells across eight datasets with a total of 13928 cells (Table S1). Although this was designed to be fairly simple, we were surprised to find that AUROC scores were significantly higher than chance for all gene sets tested, including all randomly chosen sets (AUROCall sets=0.78 ± 0.1, Figure 2A). Reassuringly, a bootstrapped sampling of the datasets showed a trend toward increased performance with the inclusion of additional training data, indicating that we are recognizing an aggregate signal across datasets (Figure S1). However, the significant improvement of random sets over the null means that prior knowledge about gene function is not required to differentiate between these cell classes. Randomly chosen sets of genes have decidedly non-random expression patterns that enable discrimination between cell types.
Task two aimed to assess how well we could discriminate between cortical excitatory and inhibitory neurons across four studies with a total of 2809 excitatory and 1162 inhibitory neurons (Dueck et al., 2015; Habib et al., 2016; Tasic et al., 2016; Zeisel et al., 2015). Similar to our previous results, we saw that AUROC scores were significantly higher than chance (AUROC=0.69 ± 0.1, Figure 2B), suggesting that transcriptional differences are likely to be encoded in a large number of genes.
Consistent with the view that a large fraction of transcripts are useful for determining cell identity, we found a positive dependency of AUROC scores on gene set size, regardless of whether genes within the sets were randomly selected or shared some biological function (Figure 2B). This was further supported by a comparison of scores for task one using 100 sets of either 100 or 800 randomly chosen genes. AUROC score distributions and means were significantly different, with sets of 100 genes having lower scores but higher variability in performance, whereas sets of 800 genes were more restricted in variance and gave higher performance on average (Figure 2C, AUROC100=0.80 ± 0.05, AUROC800=0.90 ± 0.03, p<2.2E-16, Wilcoxon rank sum test). The variability in performance observed while keeping set size constant suggests that even in random sets, there are transcriptional features that contribute to cell identity. We delved into this further by comparing AUROC scores across gene sets chosen based on their mean expression as we have previously shown that this is a critical factor to control for in evaluating single cell gene co-expression (Crow et al., 2016). We performed task one again using expression-level based gene sets and found a strong positive relationship between expression level and our ability to classify cells (Figure 2D, rs=0.9).
These results provide evidence that MetaNeighbor can readily identify cells of the same type across datasets, without relying on specific knowledge of marker genes. In these two examples, all cells could be classified as one of two types, making this a binary classification task. We find that a gene set’s size and mean expression level are the key features that allow for cell type discrimination in this setting.
Investigating cortical interneuron subtypes using MetaNeighbor
Cortical inhibitory interneurons have diverse characteristics based on their morphology, connectivity, electrophysiology and developmental origins, and it has been an ongoing goal to define cell subtypes based on these properties (Ascoli et al., 2008). In a related paper (Paul et al., submitted), we describe the transcriptional profiles of GABAergic interneuron types which were targeted using a combinatorial strategy including intersectional marker gene expression, cell lineage, laminar distribution and birth timing, and have been extensively phenotyped both electrophysiologically and morphologically (He et al., 2016). Previously, two studies were published in which new interneuron subtypes were defined based on scRNA-seq transcriptional profiles (Tasic et al., 2016; Zeisel et al., 2015). These found different numbers of subtypes (16 in one and 23 in the other), and the authors of the later paper compared their outcomes by looking at the expression of a handful of marker genes, which yielded mixed results: a small number of cell types seemed to have a direct match but for others the results were more conflicting, with multiple types matching to one another, and others having no match at all. Here we aimed to more quantitatively assess the similarity of their results, and compare them with our own data which derives from phenotypically characterized sub-populations; i.e., not from unsupervised expression clustering (see Table S2 for sample information).
MetaNeighbor relies on coordinated variation in expression level to detect cell identity, which means that genes with high variability are particularly useful. Our preceding binary classifications showed that genes with high mean expression were more likely to have variation that allowed MetaNeighbor to learn cell identities. In the following analyses, we are examining both rare and common cell types across datasets. In this case, the mean expression level of marker genes should be a proxy for cell incidence: we can expect that the marker expression for a more abundant type would have a higher mean expression. Since variance scales with expression, the most highly variable genes in the dataset would likely only be discriminative for the abundant type. Because we would like to be able to identify both abundant and rare cell types, we select the genes with the highest variance at each mean expression level.
We identified 638 genes with high variability given their expression levels (detailed in Methods) and these were used as a ‘high variability gene set’ to measure AUROC scores between each pair of cells across datasets. When AUROCs were measured using all genes, we saw that clustering was subject to strong lab-specific effects (Figure S2). In contrast, the use of variable genes reproduced the known subtype structure, with major branches for the three main subtypes, Pv, Sst and Htr3a.
To examine how the previously identified interneuron subtypes are represented across the three studies, we tested the similarity of each pair of subtypes both within and across datasets using the high variability gene set. For each genetically-targeted interneuron type profiled by Paul et al., we found at least one corresponding subtype from the other two studies, which were defined by having a mean AUROC score across training/testing folds >0.95 (Figure 3). This includes Chandelier cells, a subtype that could not be definitively identified by either Tasic or Zeisel. Using our reciprocal testing and training protocol we find that the Tasic_Pvalb Cpne5 subtype are likely to be Chandelier cells (AUROC=0.99). In addition, expanding our criteria to include all reciprocal best matches in addition to those with ID scores >0.95, we found correspondence among five subtypes that were assessed only in the Tasic and Zeisel data, Tasic_Smad3/Zeisel_Int14 (AUROC=0.97), Tasic_Sncg/Zeisel_Int6 (AUROC=0.95), Tasic_Ndnf-Car4/Zeisel_Int15 (AUROC=0.95), Tasic_Igtp/Zeisel_Int13 (AUROC=0.94) and Tasic_Ndnf-Cxcl14/Zeisel_Int12 (AUROC=0.91). Overall, based on this high-variance gene set, we could identify 11 subtypes representing 24/45 (53%) types (Figure 3A), with total n for each subtype ranging from 25-189 out of 1583 interneurons across all datasets (1.5-11%). These results were robust to differences in data processing. Tasic et al. provided data as both RPKM and TPM values, and while thousands of genes had extremely divergent expression between the two, including some key markers like Vip, reciprocal average AUROCs among corresponding subtypes were nearly identical (Figure S3). Our corresponding subtypes also confirm the marker gene analysis performed by Tasic et al. (Table S3), without requiring manual gene curation. Because we quantify the similarity among types we can prioritize matches, and use these as input to MetaNeighbor for further evaluation.
In the above, we identified overlaps using a single gene set. To assess cell identification more broadly, we ran MetaNeighbor with these new across-dataset subtype labels, measuring predictive validity across all gene sets in GO (Figure 3A, far right). The distribution of AUROC scores varied across subtypes but we found that the score from the high variability gene set was representative of overall trends, with high performing groups showing higher mean AUROC scores over many gene sets. As detailed in the previous section, we note that AUROC scores are sensitive both to the number of training samples (n) and to underlying data features (e.g., transcriptome complexity), which complicates direct comparison of ID score distributions. Both the high mean AUROCs across all putative replicate subtypes (>0.6), and the similarity of maximum performance suggest that distinctive gene co-expression can be observed in each subtype (max AUROC=0.92 ± 0.04). As with previous tasks, we found little difference in average AUROCs using functional gene sets compared to random sets (mean AUROCRandom=0.67 ± 0.06, mean AUROCGO=0.68 ± 0.1).
These results indicate that highly variable gene sets can be used alongside pairwise testing and training as a heuristic to identify replicable subtypes.
Investigating pyramidal neuron subtypes using MetaNeighbor
The heterogeneity of pyramidal neurons is undisputed, but the organizing principles are still debated, with some suggesting that identity is discrete and modular (Habib et al., 2016; Zeisel et al., 2015) and others purporting that identities are more likely to be described by expression gradients or spectra (Cembrowski et al., 2016). With MetaNeighbor we are able to quantitatively assess the degree to which pyramidal subtypes defined by scRNA-seq replicate across diverse datasets. If cell types are discrete and modular, we would expect to see sharp differences, with some types showing very strong similarity to one another, and strong dissimilarities to other types.
To compare pyramidal neuron scRNA-seq datasets we permuted through all combinations of subtypes as testing and training data based on a set of 743 genes with high variability given their expression level (subtypes listed in Table S2). This was the same procedure that was used for cortical interneurons and while there were similar numbers of subtypes in total, a smaller fraction corresponded across datasets (10/48, ~21%) yielding five putative subtypes (Figure 3B). The AUROC score heatmap was generally less modular than the heatmap of interneuron scores. The most prominent feature was that types from the hippocampus and cortex tended to cluster separately from one another. Within each region-specific cluster some layer- or area-specific clustering was observed but it was not completely consistent. Particular discrepancy was observed between the cortical layer 5 subtypes which showed more similar AUROC score profiles to the hippocampal subtypes than to other deep layer types (Tasic L5b_Cdh13, L5_Chrna6, L5b_Tph). Note that these were also the same subtypes that Tasic et al. found no match for in their marker gene analysis. We suggest that the inclusion of additional datasets may help to resolve this inconsistency.
We assessed the five putative subtypes using MetaNeighbor. All subtypes were significantly discernable compared to the null (Figure 3B) and as with the interneuron subtypes, AUROC scores from the high variability gene set were well correlated with mean performance across all of GO (3888 gene sets). In line with previous tasks, we found that functional gene sets performed equally to random gene sets (mean AUROCRandom=0.71 ± 0.08, AUROCGO=0.70 ± 0.09).
Comparing gene set performance across tasks
Finally, we compared gene set results from the 11 replicate interneuron subtypes and the 5 pyramidal neuron subtypes. In agreement with our previous results, we found that the top groups were all related to neuronal function, which is unsurprising given the large size of these gene sets and their likelihood of expression and variation in these cells (Figure 3C). AUROCs were highly correlated across tasks (r~0.76), with slightly higher performance for identifying interneuron types compared to pyramidal types (Figure 3D). The linearity of the trend across all scores suggests that fundamental data features, like mean expression level and set size, underlie the differential discriminative value of gene sets. The high performance across many sets (mean AUROC ~0.7) also supports the notion that cell identity is encoded promiscuously across the transcriptome, and is not restricted to a small set of functionally important genes.
Identifying subtype specific genes
ScRNA-seq experiments often seek to define marker genes for novel subtypes. Though ideally marker genes are perfectly discriminative with respect to all cells, in practice marker genes are often contextual and defined relative to a particular out-group. Here we aimed to identify possible marker genes that would allow discrimination among interneuron subtypes or pyramidal neuron subtypes. For each of our identified replicate subtypes we generated a ranked list of possible marker genes by performing one-tailed, non-parametric differential expression analysis within each study for all subtypes (e.g., Int1 vs. all other interneurons in the Zeisel study, Int2 vs. all interneurons, etc.) and combining p-values for replicated types using Fisher’s method (Table S4). Figure 4A shows the FDR adjusted p-values for the top candidates based on fold change for the ten replicated interneuron subtypes with overlapping differential expression patterns. Figure 4B shows the same for the two pyramidal neuron subtypes with overlapping differential expression patterns. The majority of these genes have previously been characterized as having some degree of subtype- or layer-specific expression, for example we readily identify genes that were used for the Cre-driver lines in the Tasic and Paul studies (Sst, Pvalb, Vip, Cck, Htr3a, Ctgf). Even though we filtered for genes with high fold changes, we see that many genes are differentially expressed in more than one subtype. Notably, considerable overlap can be observed among the Htr3a-expressing types. For example, the Vip Sncg subtype (Tasic_Vip Sncg/Paul_Vip Cck) is only subtly different from the Sncg subtype (Tasic_Sncg/Zeisel_Int6) across this subset of genes, with the Sncg cells lacking differential expression of Cxcl14 and Nr2f2.
We also identify some novel candidates, including Ptn, or pleiotrophin, which is significantly more expressed in the three Nos1-expressing subtypes than in the others (Figure 4B). It is thus expected to be discriminative of Nos1-positive neurons compared to other interneuron types. We validated Ptn expression with in situ hybridization and we show clear expression in neurons that are positive for both Sst and Nos1 (Figure 4C). Ptn is a growth factor, and we suggest that its expression may be required for maintaining the long-range axonal connections that characterize these cells. These cells are well described by current markers, however this approach is likely to be of particular value for novel subtypes that lack markers, allowing researchers to prioritize genes for follow-up by assessing robustness across multiple data sources.
Discussion
Single-cell transcriptomics promises to have a revolutionary impact by enabling comprehensive sampling of cellular heterogeneity; nowhere is this variability more profound than within the brain, making it a particular focus of both single-cell transcriptomics and our own analysis into its replicability. The substantial history of transcriptomic analysis and meta-analysis gives us guidance about bottlenecks that will be critical to consider in order to characterize cellular heterogeneity. The most prominent of these is laboratory-specific bias, likely deriving from the adherence to a strict set of internal standards, which may filter for some classes of biological signal (e.g., poly-A selection) or induce purely technical grouping (e.g., by sequencing depth). Because of this, it is imperative to be able to align data across studies and determine what is replicable. In this work, we have provided a formal means of determining replicable cell identity by treating it as a quantitative prediction task. The essential premise of our method is that if a cell type has a distinct transcriptional profile within a dataset, then an algorithm trained from that data set will correctly identify the same type within an independent data set.
The currently available data allowed us to draw a number of conclusions. We validated the discrete identity of eleven interneuron subtypes, and described replicate transcriptional profiles to prioritize possible marker genes, including Ptn, a growth factor that is preferentially expressed in Sst Chodl cells. We performed a similar assessment for pyramidal neurons but found less correspondence among datasets, suggesting that additional data will be required to determine whether there is evidence for discrete pyramidal neuron types. One major surprise of our analysis is the degree of replicability in the current data. Our AUROC scores are exceptionally high, particularly when considered in the context of the well-described technical confounds of single-cell data. We suspect this reflects the fundamental nature of the biological problem we are facing: discrete cell types can be identified by their transcriptional profiles, and the biological clarity of the problem overcomes technical variation.
This is further suggested by our result that cell identity has promiscuous effects within transcriptional data. While in-depth investigation of the most salient gene functions is required to characterize cell types, to simply identify cell types is relatively straightforward. This is necessarily a major factor in the apparent successes of unsupervised methods in determining novel cell types and suggests that cell type identity is clearly defined by transcriptional profiles, regardless of cell selection protocols, library preparation techniques or fine-tuning of clustering algorithms. To us this result recalls the startling finding by Venet et al. that “Most random gene expression signatures are related to breast cancer outcome” (Venet et al., 2011). Where, until that point, research had often focused on demonstrating that highly specific genes or gene clusters could predict breast cancer outcome, Venet et al. clearly demonstrated that this was a more straightforward task than targeted analyses would reveal, and was due to the strength of the underlying biological signal: more aggressive cancers divide more, and so anything correlated with fast cycling times will be associated with poor clinical outcomes. Comparison of transcriptional signatures between different cell types provides an equally clear lens. Many gene sets show more correlated expression within than across types, and variation across types is likely to be accounted for by simple important factors, like cell size. This is not to say that more detailed characterization of cell types is not necessary: understanding the differences between cells and how they work will require focused investigation into the precise molecular players that are differentially utilized. However, we hope that this helps to demonstrate that the variations on dimension reduction and clustering methods in single cell RNA-seq are ‘working’, inevitably by taking advantage of this very clear signal.
In this work we opted to use the subtype or cluster labels provided by the original authors, in essence to characterize both the underlying data as well as current analytic practices. However, this has limitations where studies cluster to different levels of specificity. For example, the Tasic paper defines multiple Parvalbumin subtypes but the Zeisel and Paul work do not. Our method makes it extremely easy to identify highly overlapping types at the levels defined by each author, facilitating downstream work to validate the sub-clusters through meta-analysis and at the bench. Given the known noisiness of single-cell expression and the complex and idiosyncratic character of approaches taken to assessing it, the degree of replicability that we see is much higher than could have been expected were there not simple explanations for the derived clusters from individual laboratories. Our work shows that with additional data, comprehensive evaluation and replication is likely to be quantitatively straightforward, making it possible to have high confidence in derived cell sub-types quite rapidly. As this additional data is generated, our approach can provide consistent updates of the field-wide consensus.
The simplicity of our method makes it unlikely to be biased toward the exact cell identity tasks assessed here. For example, because of the method’s reliance on relative ranks, it is almost entirely immune to normalization as a potential confound. On the one hand, this limits our sensitivity to detect real signals of some type, but this cost is more than offset by the robustness of signals identified. Its simplicity also means that it is scalable, and readily admits to the incorporation of data from individual labs in their ongoing work. Ultimately we hope that by defining what is replicable clearly, MetaNeighbor will allow future studies involving cell-cell comparisons to build on a strong foundation toward a comprehensive delineation of cell types.
Experimental Procedures
Animals, manual cell sorting and scRNA-seq
Mice were bred and cared for in accordance with animal husbandry protocols at Cold Spring Harbor Laboratory, with access to food and water ad libitum and a 12 hour light-dark cycle. Transgenic animals bred to target the six phenotypically characterized subpopulations were generated using the following breeding strategies (detailed in He et al): Nkx2.1-CreER, Pv-ires-Cre animals were bred separately to Ai14 reporter to label ChC and Pv cells in the cortex. ChCs were enriched in frontal cortex with tamoxifen induction at embryonic day 17.5. Intersectional labeling was achieved by breeding (a) Sst-Flp, Nos1-CreER, (b) Sst-Flp, CR-Cre, (c) Vip-Flp, CR-Cre and (d) Vip-Flp, Cck-Cre separately to the Ai65 intersectional reporter that labels cells with tdTomato only when both the lox-STOP-lox and frt-STOP-frt cassettes are excised. Adult animals (P28-35) were sacrificed by cervical dislocation to harvest brains for single cell sorting. Cell sorting and scRNA-seq were performed as described previously (Crow et al., 2016). Single cells were collected by manual sorting then placed into single tubes with 1μl total volume of RNAseOUT (Invitrogen), 1:400K diluted ERCC spike-in RNA, and sample-specific RT primers. Cells were flash frozen in liquid nitrogen then stored at -80°C until processed. RNA was linearly amplified using the MessageAmp-II kit (Life Technologies) according to the manufacturer’s recommended protocol. Reverse transcription of amplified aRNA was done with SuperScript-III (Invitrogen) and cDNA libraries were prepared with the Illumina TruSeq small RNA library preparation kit (7-11 cycles of PCR). Libraries were size-selected with SPRISelect magnetic beads (Agencourt) and sequenced with paired-end 101bp reads using an Illumina HiSeq. PolyA-primed reads were mapped to the mouse reference genome (mm9) with Bowtie (v 0.12.7), while paired sequences were used for varietal tag counting. A custom python script was used map and tally sequences with unique tags for each mRNA in each cell (Crow et al., 2016). All data is available to download from GEO (accession GSE92522).
Public expression data
Data analysis was performed in R using custom scripts (github.com/maggiecrow/MetaNeighbor, 2016). Processed expression data tables were downloaded from GEO directly, then subset to genes appearing on both Affymetrix GeneChip Mouse Gene 2.0 ST array (902119) and the UCSC known gene list to generate a merged matrix containing all samples from each experiment. The mean value was taken for all genes with more than one expression value assigned. Where no gene name match could be found, a value of 0 was input. We considered only samples that were explicitly labeled as single cells, and removed cells that expressed fewer than 1000 genes with expression >0. Cell type labels were manually curated using sample labels and metadata from GEO (see Tables S1 and S2). Merged data and metadata are linked through our Github page.
Gene sets
Gene annotations were obtained from the GO Consortium ‘goslim_generic’ (August 2015). These were filtered for terms appearing in the GO Consortium mouse annotations ‘gene_association.mgi.gz’ (December 2014) and for gene sets with between 20-1000 genes, leaving 106 GO groups with 9221 associated genes. Random gene sets were generated by randomly choosing genes with the same set size distribution as GO slim. Sets of high variance genes were generated by binning data from each dataset into deciles based on expression level, then making lists of the top 25% of the most variable genes for each decile, excluding the most highly expressed bin. The high variance set was then defined as the intersect of the high variance gene lists across the relevant datasets.
MetaNeighbor
All scripts, sample data and detailed directions to run MetaNeighbor in R can be found on our Github page (github.com/maggiecrow/MetaNeighbor, 2016).
The input to MetaNeighbor is a set of genes, a data matrix and two sets of labels: one set for labeling each experiment, and one set for labeling the cell types of interest. For each gene set, the method generates a cell-cell similarity network by measuring the Spearman correlation between all cells across the genes within the set, then ranking and standardizing the network so that all values lie between 0 and 1. The use of rank correlations means that the method is robust to any rank-preserving normalization (i.e., log2, TPM, RPKM). Ranking and standardizing the networks ensures that distributions remain uniform across gene sets, and diminishes the role outlier similarities can play since values are constrained.
The node degree of each cell is defined as the sum of the weights of all edges connected to it (i.e., the sum of the standardized correlation coefficients between each cell and all others), and this is used as the null predictor in the neighbor voting algorithm to standardize for a cell’s ‘hubness’: cells that are generically linked to many cells are preferentially down-weighted, whereas those with fewer connections are less penalized. For each cell type assessment, the neighbor voting predictor produces a weighted matrix of predicted labels by performing matrix multiplication between the network and the binary vector (0,1) indicating cell type membership, then dividing each element by the null predictor (i.e., node degree). In other words, each cell is given a score equal to the fraction of its neighbors, including itself, which are part of a given cell type (Ballouz et al., 2016). For cross-validation, we permute through all possible combinations of leave-one-dataset-out cross-validation, sequentially hiding each experiment’s cell labels in turn, and then reporting how well we can recover cells of the same type as the mean area under the receiver operator characteristic curve (AUROC) across all folds. A key difference from conventional cross-validation is that there is no labeled data within the dataset for which predictions are being made. Labeled data comes only from external datasets, ensuring predictions are driven by signals that are replicable across data sources. To improve speed, AUROCs are calculated analytically, where the AUROC for each cell type j, is calculated based on the sum of the ranks of the scores for each cell i, belonging to that cell type. This can be expressed as follows: where N is the number of true positives, and NNeg is the number of true negatives. Note that for experiments with only one cell type this cannot be computed as there are no true negatives. AUROCs are reported as averages across all folds of cross-validation for each gene set (excluding NAs from experiments with no negatives), and the distribution across gene sets is plotted.
To test the dependency of results on the amount of training and testing data we repeated the neuron vs. non-neuronal cell discrimination task after randomly selecting between two and seven datasets ten times each. This was done for 21 representative gene sets. Means for each gene set and each number of included datasets were plotted.
Identifying putative replicates
In cases where cell identity was undefined across datasets (i.e., cortical interneuron and pyramidal subtypes) we treated each subtype label as a positive for each other subtype, and assessed similarity over the high variance gene set described above. For example, Int1 from the Zeisel dataset was used as the positive (training) set, and all other subtypes were considered the test set in turn. Mean AUROCs from both testing and training folds are plotted in the heatmap in Figure 3. A stringent cut-off of mean AUROC >0.95 and/or mutual best matches across datasets identified putative replicated types for further assessment with our supervised framework (detailed above). While lowering this threshold could increase the number of subtypes with some match, we found that reciprocal top hits alone provided an upper bound on the number of replicated types (i.e., lowering the thresholds did not allow for a higher number of subtypes). New cell type labels encompassing these replicate types (e.g. a combined Sst-Chodl label containing Int1 (Zeisel), Sst Chodl (Tasic) and Sst Nos1 (Paul)) were generated for MetaNeighbor across random and GO sets, and for meta-analysis of differential expression. While only reciprocal top-hits across laboratories were used to define novel cell-types, conventional cross-validation within laboratories was performed to fill in AUROC scores across labels contained within each lab.
Differential expression
For each cell type within a dataset (defined by the authors’ original labeling), differential gene expression was calculated using a one-sided Wilcoxon rank-sum test, comparing gene expression within a given cell type to all other cells within the dataset (e.g., Zeisel_Int1 vs all other Zeisel interneurons). Meta-analytic p-values were calculated for each putative replicated type using Fisher’s method (Fisher, 1925) then a multiple hypothesis test correction was performed with the Benjamini-Hochberg method (Benjamini and Hochberg, 1995). Top differentially expressed genes were those with an adjusted meta-analytic p-value <0.001 and with log2 fold change >2 in each dataset. All differential expression data for putative replicated subtypes can be found in Table S4.
Author Contributions
JG conceived the study. JG, MC, and JH designed experiments. MC and JG wrote the manuscript. MC and SB performed computational experiments. AP performed cell sorting, and generated and parsed the raw sequencing data. JH supervised wet-lab data collection. All authors read and approved the final manuscript.
Acknowledgments
MC, SB and JG were supported by a gift from T. and V. Stanley. Z.J.H. was supported by NIH 5R01 MH094705-04, R01 MH109665-01 and the CSHL Robertson Neuroscience Fund. A.P. was supported by a NARSAD Postdoctoral Fellowship. The authors would like to thank Paul Pavlidis, Bo Li and Jessica Tollkuhn for their thoughtful feedback on earlier drafts of this manuscript. We would also like to thank the dedicated researchers who have made their data publicly available. Our work would not be possible without their valuable contributions.
Footnotes
mcrow{at}cshl.edu, paula{at}cshl.edu, sballouz{at}cshl.edu, huangj{at}cshl.edu, jgillis{at}cshl.edu