Abstract
Recent efforts in systems immunology lead researchers to build quantitative models of cell activation and differentiation. One goal is to account for the distributions of proteins from single-cell measurements by flow cytometry or mass cytometry as a readout of biological regulation. In that context, large cell-to-cell variability is often observed in biological quantities. We show here that these readouts, viewed in logarithmic scale may result in two easily-distinguishable modes, while the underlying distribution (in linear scale) is uni-modal. We introduce a simple mathematical test to highlight this mismatch. We then dissect the flow of influence of cell-to-cell variability using a graphical model and its effect on measurement noise. Finally we show how acquiring additional biological information can be used to reduce uncertainty introduced by cell-to-cell variability, helping to clarify whether the data is uni- or bi-modal. This communication has cautionary implications for manual and automatic gating strategies, as well as clustering and modeling of single-cell measurements.
Author summary
Populations of cells are often composed of distinct sub-populations, each performing a unique biological function. A major tool to identify such populations is antibody staining followed by flow- and mass-cytometry. These technologies boast high acquisition speed, resolution and sensitivity, measuring ~ 102 − 105 molecules per cell in more than 15 different labels. With these data, identification of populations typically amounts to manually selecting clusters of cells with distinct molecular abundances. While such a strategy is sufficient for clearly distinct groups, our increasingly refined definitions of cell populations require objective criteria to partition a population. To establish the number of unique sub-populations, we apply both Hartigan’s dip test and a new method which examines the statistical mode of abundance distributions. We demonstrate the necessity of these criteria both mathematically and experimentally. We find that the number of unique populations depends on whether the tag abundances are scaled linearly or logarithmically, an often overlooked fact. Interestingly, theoretical models to explain such distributions are usually treated in linear space, whereas the experimental data is usually treated logarithmically. This mismatch between the two representations has the potential to mislead researchers, more so as technology advances and with it a growing reliance on automatic tools to distinguish populations in high-dimensional space. We detail an approach relying specifically on higher multiplexing in measurements that will be useful to circumvent this mismatch.
Introduction
Flow cytometry data typically stretches across several orders of magnitude, with fluorescence intensity I readily spanning values between 102 and 105. As such, when binning cytometry data to create histograms or distributions, it is natural to let bin sizes increase as a geometric progression, namely, to evenly bin the logarithm of the fluorescence intensity. As a result, instead of the distribution Q(I) of fluorescence intensity I, one usually analyzes the distribution of log I, which we denote P(log I). Indeed, P(log I) has many advantages: easy display of many orders of magnitude in I, easy to model as a two-component log-normal mixture model (as in [1]), and easy to intuitively understand the effect of changing the voltage gain on the flow-cytometer detector photo-multiplier. While such data presentation has been widely adopted in the field of cytometry out of these practical reasons, a rigorous assessment of this log-transformation reveals unwarranted features.
After estimating P(log I), by logarithmically binning or using a kernel-density method, one can formally derive Q(I) as [2], with log I ≡ y.
Simply plotting Q(I) vs. I is impractical as most of the data inevitably appears crowded against the I = 0 axis. Thus, it is common practice to plot P(log I) or variants thereof which deal with small and negative I values introduced by fluorescence compensation (e.g.“Logicle” [3], ”VLog” [4] and other transformations [5]). Displaying faithfully flow-cytometry data is not easy, as the logarithmic scale and fluorescence compensation introduce problems that are easy to miss [6] leading to uncertainty in the number of distinct populations present in the data. Previously, attention has been given to the possibility of effects produced by logarithmic binning [7], contrasting the difference between plotting logarithmic histograms P(log I) vs. log I as opposed to rescaling the x-axis by plotting Q(I) vs. logI. However, an additional, potentially confusing situation seems to have been overlooked: the possible appearance of a second mode in P(log I), rendering P(log I) bi-modal, while for the same data only one mode exists in Q(I). This is the focus of this work.
When considering biological measurements, I is proportional to the actual copy number of RNA or proteins. When theoretical considerations are applied to biological systems (such as biochemical dynamics [8–12,15], mass-action chemical equilibria, cell-cycle measurements [13] and Hill dose-reponse curves [14]), it is the copy number itself that is under consideration. Despite that, the logarithm of copy number is an appealing quantity because of its approximately Gaussian statistics, yielding insight into details easily lost if the data were to be analyzed only in linear scale. This leads to a mismatch, where for instance models posed in linear space and data plotted in logarithmic space seem unable to be reconciled without invoking additional effects such as stochastic gene expression noise [15] and cell-to-cell variability [16–18]. Even so, typically one must resort to approximations to analyze noise propagation linearly [8].
The difference between the convenient consideration of the logarithm of abundances and the theoretically-accurate analysis of the linear copy number renders the question of whether Q(I) has one or two modes (3 extrema) relevant in the following ways: (i) the existence of 1 or 3 extrema is often used to infer the fixed points of a dynamic stochastic biochemical network [1, 9,15] and other in silico methods [19] (ii) extrema are used to define cell-types in automatic (density based) gating and clustering algorithms [20–25]; (iii) The existence of a clearly bi-modal distribution is used for manual gating (e.g. discerning between activated and un-activated cells) in a way that appears more robust and compelling than it might truly be. It is this potential for confusion when the data is viewed in log-space, which we will elaborate on.
The rest of this paper is composed of two parts. In the first part, we point out and analyze the situation where a mismatch between the two representations can happen: we formally state the problem using theoretical modeling of cytometry data as a mixture of two log-normal distributions (colloquially the ”negative” and ”positive” modes), explicitly show situations where two modes appear in P(log I) while only one mode exists in Q(I), and demonstrate this confounding effect on experimental data. In the second part, we analyze the role of cell-to-cell variability in experimental data and show how by measuring a suitable extra dimension one can factor out some of this variability, thus reducing the broadness of the modes sufficiently so as to reduce the mismatch between the two representations. Thus we provide a prescription to design experiments and analyze them so as to resolve the uni-modal vs. bi-modal discrepancy.
Theoretical method
Cytometry data is often amenable to modeling as a log-normal mixture (e.g. [1]). To demonstrate the log/linear mismatch we consider a mixture of two populations, characterized by the distribution of intensity. We define P(log I) as follows: with y1,2 = log I1,2 which are the loci of the centers of the left and right Gaussians in log-space, respectively, and σ1,2 the log-space standard deviations. We then define as in Eq. 1.
In Fig. 1, to illustrate with typical measurement values, we set I1 = 100 and I2 = 1000 (arbitrary units) and α = 0.5, while varying σ1 = σ2. This figure presents the three cases we wish to contrast: on the left column, both P(log I) and Q(I) are bi-modal; in the central column, P(log I) is bi-modal whereas Q(I) is uni-modal; on the right column, both P(log I) and Q(I) are uni-modal, a situation which we examine in more detail in Eq. 7. Moreover, we see an example where even when both are bi-modal (left column), the loci of the modes are different.
The question of uni/multi modality has been investigated before, in the context of modeling flow-cytometry data [19], by using Hartigan’s dip test for uni-modality [26]. Briefly, Hartigan’s dip statistic measures the maximum difference between the empirical distribution and the uni-modal distribution that minimizes that maximum difference. This is compared to the appropriate null distribution which is, in this case, the uniform distribution, to give pu, a p-value for uni-modality. In Fig. 1, we report pu for the data, according to Hartigan’s test, by simulating 104 events drawn from the distributions under consideration [27]. We note that for the central column, Hartigan’s test quantifies and concurs with a visual inspection of the data, i.e., whereas P(log I) is not uni-modal (pu = 0), for the corresponding Q(I) it is uni-modal (pu = 1).
We define y = log I and proceed to the number of extrema for P(y) and Q(I). It is possible to discern between one or three extrema of the distribution P(log I), corresponding to one or two modes (respectively) by counting the number of solutions for . Similarly, can have either one or three solutions. This raises the possibility of there being three extrema (two modes) for P(log I) whereas only one mode in Q(I). We explicitly evaluate the extrema of the mixture of log-normal distributions by solving, which by algebraic rearrangement, provides a more transparent form.
Here we refer to the LHS and RHS as S3(y) and F(y), respectively. In like, computing the extrema of the linear scale distribution amounts to , which by change of variable is equivalent to, and by substitution, We refer to the quantity on the LHS, S1 (y). Thus we have introduced the following functions,
F(y) is the ratio of the two Gaussians in Eq. 2 and is therefore always non-negative; this implies that any extremum y* must satisfy S1 (y*) ≥ 0 and S3 (y*) ≥ 0. We note that in Eq. 4 the condition for extrema in Q(I) requires that whereas, of course, extremizing P(log I) sets its derivative to zero, demonstrating the fact that the loci of the modes for P(log I) and Q(I) are manifestly different. The region where the log-space distribution shows a second mode occurs when Eq. 3 for S3 admits three solutions whereas Eq. 5 for S1 admits only one. Given that Eq. 3 and 5 are transcendental, a graphical way to asses the number of solutions is to plot F(y), S1 (y), S3 (y) and count the number of times S1 and S3 intersect F.
In Fig. 2, we present an example of this graphical method. The mismatch between the number of extrema of P(log I) and Q(I) is apparent whenever (red curve) S3 (y) intersects F at 3 points, whereas (blue curve) S1 (y) only intersects F once.
In the plots along the diagonal, we have σ1 = σ2 (as in Fig. 1) which simplifies F(y) since the quadratic (Gaussian) terms cancel, leaving only an exponential. This leads to a simple criterion to determine whether P(log I) itself admits one or two modes - previously in Fig. 1(right) we saw an example where P(log I) is uni-modal despite being generated from a mixture. Graphically, we see that for S3 = F to have 3 solutions, log S3 (y) has to have a slope less than log F(y) about the extremum y*. In other words, , with equality as the threshold between 1 and 3 extrema, similarly to the way Landau theory defines the critical point in second order phase transitions [28]. This leads to the following intuitive criterion, which states that for P(log I) to appear bi-modal, it must have an extremum (y*) such that the variance of the individual Gaussian components of P(log I) must be smaller than the distance between y* and the Gaussian centers. Substituting for y* ≈ log 316, y1 = log 100 and y2 = log 1000 and σ2 = 1.44 we see that the criterion in Eq. 7 is not satisfied and indeed in Fig. 1(right) and Fig. 2(bottom right) we see that P(log I) has only one mode. A similar condition can be derived for Q(I), , such that it is, however, hard to compare the two bounds analytically because the y* which extremizes P(log I) is different from the y* which extremizes Q(I).
As a check for the predictive power of Hartigan’s test with regards to experimental data, we apply it on a log-normal mixture comparing its predictive power as a function of the number of tests and number of events in each test [27]. In Fig. 3, we test it on the situation in Fig. 2(top,middle) in which Q(I) is weakly bi-modal, meaning that its bi-modality is nearly marginal (σ1 = 0.4 and σ2 = 0.8). The Hartigan probability of unimodality (pu) is not sensitive to the number of bootstrap tests in a reasonable range but becomes strongly predictive of the (weak) bi-modality only when there are more than 105 events. Such an abundance of cells may not always be available in typical flow cytometry data, especially for sub-populations which have been selected (gated) and may comprise only a small fraction of all the cells acquired. Fig. 3 supports that in such weakly bi-modal situations, Hartigan’s p-value should be treated cautiously, a situation which we will encounter in Fig. 6.
We conclude this theoretical section of the manuscript by making a more intuitive argument. Above, we have demonstrated the theoretical existence of a situation where P(log I) has two modes when Q(I) has only one. But how does this come about ?
Intuitively, by binning in logarithmic scale, we are effectively making the bin sizes grow as I increases. A larger bin can only lead to a higher count in that bin and so we might stumble upon a regime where this creates a quasi-mode. Probing further, we established a simple criterion (Eq. 7) where by making the underlying Gaussian mixture composed of too-close-together Gaussians, even in P(log I) there exists only one peak.
Experimental results
We now apply our analysis to experimental data, namely the measured distribution of Extracellular Signal-regulated Kinase (ERK) phosphorylation (ppERK) signaling in CD8+ primary mouse T-cells responding to antigens and inhibited by the SRC inhibitor Dasatinib. These data were acquired in exactly the same manner as the experiments in Ref. [1], for brevity we refer the reader there for all experimental details. In Fig. 4(A) we see how a commercially-available analysis software (FlowJo [29]) plots the distribution of ppERK in such an experiment, which clearly shows a bi-modal structure. Fig. 4(B) plots those same data when subjected to logarithmic binning P(log I), giving the two modes as in FlowJo (red dots) whereas Q(I) has a single mode (blue dots). We fit P(log I) as a Gaussian mixture. This is followed in Fig. 4(C) by the same extrema analysis as in Fig. 2, revealing that indeed Q(I) has a single maximum.
Immunologists have relied on cytometry for over forty years to identify new cell populations, often based on manual gating of cytometry data. The success of this method (with validation by identification of new transcription factors) stands in contrast with the danger of generating modes in the distributions of log-transformed data P(log I), as presented above. This is particularly cogent to recent efforts at clustering single-cell measurements in high-dimensional space by mass cytometry. Hence, we wondered whether gating in logarithmic scale could be justified a posteriori, based on biological knowledge. We aimed to include additional information in our analysis such that the biological significance of the distributions in our single-cell measurements is better captured.
In Fig. 5(A), we show the experimental data we will use in our proposed solution. Here we returned to our single-cell measurements of ERK phosphorylation in primary mouse T cells in Ref. [16]. We show a heat map of the joint distribution of ppERK (IppERK) and total ERK1 (IERK1) expression in mouse CD8+ T-cells. Notably, whereas the two modes of ppERK significantly overlap when plotted in the marginal distribution P(log IppERK), ppERK expression correlates with total ERK1 levels in their joint distribution P2 (log IppERK, log IERK1). Each cell’s state encodes another latent variable, its activation status - which tells if the cell has been successfully activated by the stimulus. To deduce the activation status, it is common practice to use manual gating of the data by drawing of a boundary between the active and inactive states. To account for the correlation between ERK1 and ppERK we consider two manual gating strategies: (i) perpendicular gating (dashed red) according to P(log IppERK) with IppERK > ppERK* considered an activated cell, and (ii) diagonal gating according to the apparent correlation in P2 (dashed grey). We set the diagonal gate with a slope of unity, meaning that we take the dividing line, reflecting proportionality ppERK ∝ ERK1, as a good way to partition the two states. We define ”Inactive” to the left of the dashed line, and ”Active” to the right of it.
To understand the structure of these data, it is important to characterize explicitly the dependency structure of our observables (ERK1, ppERK), the latent activation status, and the influence of external factors on these three. The existence of two peaks, in ppERK which appear distinct from each other but correlated with ERK1 levels, guides us to use a Bayesian network to capture these features in the data as a graphical model. First - we test whether ERK1 and the cell’s activation status are independent. In S1 Fig we see that whereas independence implies that P(log ERK1) = P (log ERK 1| Activation), in fact there is a weak dependence between them regardless of whether we employ a vertical (red) or a diagonal (grey) gate (the diagonal gate showing a weaker dependence). The weak dependence between activation state and ERK1 levels is reasonable, if we account for cell-to-cell variability, since for a given stimulus some cells inevitably respond differently from the typical cell [30]. We summarize the causal structure for this system in Fig. 5(B) which depicts a probabilistic graphical model [31] of the flow of influence from cell-to-cell variability and activation signal, to ERK1 levels and activation status, and finally to the distribution of ppERK. We depict the pair ERK1-ppERK in a template, to suggest to the reader the existence of multiple other pairs. Importantly, in what follows we show how to better resolve the log-space peak; this recipe, together with the model in Fig. 5(B) can be used a priori in automatic gating and clustering algorithms to prevent some of the mismatch between logarithmic and linear binning strategies. For stochastic modeling, such a structure presents an opportunity to analyze the structure and propagation of noise in the system [8,32].
We treat the broadness of ppERK modes as generated by cell-to-cell variability in total ERK1 content - a reasonable assumption since the noise in the phosphorylation of ERK is negligible in comparison [33]. We further neglect the indirect influence between activation status and ERK1 levels due to its weakness (checked in S1 Fig). Thus we approximate that the conditional independence between ERK1 and activation status (given that both are influenced by cell-to-cell variability) is true independence. This implies an approximately linear relation IppERK ∝ IERK1 given activation status. We define the normalized intensity as the ratio of ppERK to ERK1 intensity, thereby eliminating the linear dependence of ppERK on ERK1 levels and reducing uncertainty due to cell-to-cell variability. The resulting may boast a sufficiently reduced noise in ppERK such that a clear bi-modal signature appears regardless of logarithmic or linear binning of . In Fig. 6 we show such an example, where in Fig. 6(A,B), P(log I) and Q(I) do not agree on the number of modes, whereas in Fig. 6(C,D) do agree. These data have order 25,000 events and so, similarly to Fig. 3, Hartigan’s test may not identify the number of peaks correctly, as is indicated in the pu values.
Thus we demonstrate how by suitably accounting for cell-to-cell variability one can reduce the measured noise so as to circumvent the mismatch in the number of modes between the logarithmic and linear treatment. For testing bi-modality, whereas our method relies on fitting a Gaussian mixture, Hartigan’s test requires no fitting yet may lack statistical power when applied to typical experimental situations.
Conclusion
The scale-dependent bi-modality as demonstrated in Fig. 4 and Fig. 6(A,B) may be not uncommon. Specifically, one must take extra care when attempting to manually gate, automatically cluster or build dynamical models which rely on an apparent bi-modal structure, as it might depend on whether the data was log-transformed or not. This becomes increasingly relevant as cytometry moves forward to higher dimensional measurements which become tractable only with automatic gating schemes. Instead, one might consider plotting Q(I) on the log-log scale, a presentation which preserves the number of maxima, at the expense of the measure of the distribution. It is possible to ameliorate the mismatch between the two scales, as we demonstrate in Fig. 6(C,D), if one can simultaneously measure correlated observables (in our example, ppERK and ERK1). This allows to control for cell-to-cell variability, increasing the resolution of the data. Recently, this favorable scenario has become more attainable with the introduction of mass cytometry - where one can rely on a large number of channels without compromising the flow-cytometry panel. Based on the analysis carried out in this paper, we conjecture that such extra channels, chosen wisely, can provide automatic clustering/gating algorithms the right information needed to make more reliable clustering and population defining. This is a simple way to introduce knowledge of the biological structure of the data into otherwise objective clustering algorithms, without compromising their objectivity. We propose and test some features of a graphical model that captures the structure of such dependencies in a way potentially useful for those interested in automatic gating and clustering algorithms. Though we caution on the use of P(log I), we find it remarkable how well the distribution of biological quantities can be modeled as a log-normal mixture. This highlights the deep and still little understood connection between distributions observed in living things and their relation to the logarithm of abundance, a subject likely to puzzle researchers for years to come.
Acknowledgments
This work was supported by Human Frontier Science Program grant LT000123/2014 (A. E.) and by the Intramural Research Program of the NCI, NIH.