Abstract
Microbes form complex and dynamic ecosystems that play key roles in the health of the animals and plants with which they are associated. The inter-species interactions are often represented by a directed, signed and weighted ecological network, where nodes represent microbial species and edges represent ecological interactions. Inferring the underlying ecological networks of microbial communities is a necessary step towards understanding their assembly rules and predicting their dynamical response to external stimuli. However, current methods for inferring such networks require assuming a particular population dynamics model, which is typically not known a priori. Moreover, those methods require fitting longitudinal abundance data, which is not readily available, and often does not contain the variation that is necessary for reliable inference. To overcome these limitations, here we develop a new method to map the ecological networks of microbial communities using steady-state data. Our method can qualitatively infer the inter-species interaction types or signs (positive, negative or neutral) without assuming any particular population dynamics model. Additionally, when the population dynamics is assumed to follow the classic Generalized Lotka-Volterra model, our method can quantitatively infer the inter-species interaction strengths and intrinsic growth rates. We systematically validate our method using simulated data, and then apply it to experimental data from a synthetic soil microbial community. Our method offers a novel framework to infer microbial interactions and reconstruct ecological networks, and represents a key step towards reliable modeling of complex, real-world microbial communities, such as human gut microbiota.
1. Introduction
The microbial communities (MCs) established in animals, plants, soils, oceans, and virtually every ecological niche on Earth perform vital functions for maintaining the health of the associated ecosystems1-5. Recently, our knowledge of the organismal composition and metabolic functions of diverse MCs has markedly increased, due to advances in DNA sequencing and metagenomics6. However, our understanding of the underlying ecological networks of these diverse MCs lagged behind7. Mapping the structure of those ecological networks and developing ecosystem-wide dynamic model will be important for a variety of applications, from predicting the outcome of community alterations and the effects of perturbations8, to the engineering of complex MCs7,9. We emphasize that the ecological network is fundamentally different from the correlation-based association or co-occurrence network7,10,11, which is undirected and do not encode any causal relations or direct ecological interactions, and hence cannot be used to faithfully predict the dynamic behavior of the MC.
To date, existing methods for inferring the ecological networks of MCs are all based on temporal abundance data12-16. The success of those methods has been impaired by two fundamental limitations. First, those inference methods require the a priori choice of a parameterized population dynamics model for the MC. These choices are hard to justify, given that species in the MC interact via a multitude of different mechanisms7,17,18,19 producing complex dynamics even at the scale of two species20. Any deviation of the chosen model from the “true” model of the MC can lead to inference errors, regardless of the inference method that is used16. Second, a successful temporal-data based inference requires sufficiently informative time-series data16,21. However, for many host-associated MCs, such as the human gut microbiota, the available temporal data (i.e., time series of the abundance of each taxa in the MC) is often poorly informative. This is due to the fact that such communities often display highly intrinsic stability and resilience22,23, which leads to measurements containing only their steady-state behavior. For MCs such as the human gut microbiota, trying to improve the informativeness of temporal data is challenging and even ethically questionable, as it requires applying drastic and frequent perturbations to the MC, with unknown effects on the host.
To circumvent the above fundamental limitations, here we develop a new inference method based on steady-state data. We rigorously prove that, if we collect enough independent steady states of the MC, it is possible to infer the microbial interaction types (positive, negative and neutral interactions) and the structure of the ecological network, without requiring any population dynamics model. We further derive a rigorous criterion to check if the steady-state data of an MC is consistent with the Generalized Lotka-Volterra (GLV) model, a classic population dynamics model for MCs in human bodies, soils and lakes12-16. We finally prove that, if the MC follows the GLV dynamics, then the crosssectional steady-state data can be used to accurately infer the model parameters, i.e., interspecies interaction strengths and intrinsic growth rates. We validated our inference method using simulated data generated from various classic population dynamics models. Then we applied it to real data collected from a synthetic soil MC in which experimental evidence of the microbial interactions is available.
2. Results
Microbes do not exist in isolation but form complex ecological networks7. The ecological network of an MC is encoded in its population dynamics, which can be described by a set of ordinary differential equations (ODEs):
Here, are some unspecified functions whose functional forms determine the structure of the underlying ecological network; x(t) = (x1(t),…, xN(t))T is an N - dimensional vector with xi(t) denoting the absolute abundance of the i-th taxon at time t. In this work, we don’t require ‘taxon’ to have a particular ranking, as long as the resulting abundance profiles are distinct enough across all the collected samples. Indeed, we can group microbes by species, genus, family or just operational taxonomical units (OTUs).
Note that in the right-hand side of Eq. (1) we explicitly factor out xi to emphasize that (i) without external perturbations those initially absent or later extinct species will never be present in the MC again as time goes by, which is a natural feature of population dynamics; (ii) there is a trivial steady state where all the species are absent; (iii) there are many different non-trivial steady states where at least one taxon is present. We assume that the steady-state samples collected in a dataset correspond to those non-trivial steady states x* of Eq. (1), which satisfy , i = 1,…, N. For many host-associated MCs, e.g., the human gut microbiota, those cross-sectional samples collected from different individuals contain quite different collections of taxa (up to the taxonomic level of phylum binned from OTUs)22. We will show later that the number of independent steady-state samples is crucial for inferring the ecological network.
To infer the ecological network underlying an MC, we make an explicit assumption: the nature of the ecological interactions (i.e., promotion, inhibition, or neutral) between any two taxa does not vary over time, though their interaction strengths might change. Mathematically, those ecological interactions are encoded by the Jacobian matrix with matrix elements Jij(x(t)) ≡ ∂fi(x(t))/∂xj. The condition Jij(x(t)) > 0 (or < 0) means that taxon j promotes (or inhibits) taxon i, respectively. The diagonal terms Jii(x(t)) represent intra-species interactions. Note that Jij(x(t)) might depend on the abundance of many other taxa beyond i and j (due to the so-called “higher-order” interactions). Though the magnitude of Jij(x(t)) by definition may vary over different states, we assume its sign remains invariant, i.e., the microbial interaction type does not change. This assumption requires that those steady-state samples be collected from the MC under very similar environmental conditions (e.g., nutrient availability)25, and the MC is well-mixed to avoid strong spatial segregation. Notably, as we will show later, the assumption is valid for many classic population dynamics models26-30.
Inferring interaction types
The above two assumptions enable us to mathematically prove that the sign-pattern of the Jacobian matrix J(x) satisfies a strong constraint. Thus, by collecting enough independent steady-state samples, we can solve for the sign pattern and hence map the structure of the ecological network.
The basic idea is as follows. Let be the set of all steady-state samples sharing taxon i Then, for any two of those samples xI and xK, where the superscripts I, denote the collections of present taxon in those samples, we can prove that the sign-pattern of the i-th row of Jacobian matrix, denoted as a ternary vector si ∈ {−,0, +}N, is orthogonal to (xI − xK). In other words, we can always find a real-valued vector , which has the same sign pattern as si and satisfies yT · (xI − xK) = 0. If we compute the sign patterns of all vectors orthogonal to (xI − xK) for all I, , then si must belong to the intersections of those sign patterns, denoted as . In fact, as long as the number Ω of steady-state samples in is above certain threshold Ω*, then will be minimum and contain only three sign-patterns {−a, 0, a}. To decide which of these three remaining sign-patterns is the true one, it is sufficient that we know the sign of only one non-zero interaction. If such prior knowledge is unavailable, one can at least make a reasonable assumption that sii = ‘–’, i.e., the intra-species interaction Jii is negative (which is often necessary for community stability). When has more than three sign-patterns, we proved that the steady-state data is not informative enough in the sense that all sign-patterns in are consistent with the data available in. This situation is not a limitation of any inference algorithm but of the data itself. To uniquely determine the sign-pattern in such a situation, one has to either collect more samples (thus increasing the informativeness of ) or use a priori knowledge of non-zero interactions to narrow down to just one possible sign-pattern.
We illustrate the application of the above method to small MCs with unknown population dynamics (Fig. 1). For the two-taxa MC (Fig. 1a), there are three possible steady-state samples, i.e., {x{1}, x{2}, x{1,2}}, depicted as colored pie charts in Fig. 1b. In order to infer s1 = (sign(J11), sign(J12)), we compute a straight line (shown in green in Fig. 1b) that is orthogonal to (x{1,2} − x{1}) and passes the origin. The regions (including the origin and two quadrants) crossed by this green line provide a minimum set of possible sign-patterns that s1 may belong to. A priori knowing that J11 < 0, our method correctly concludes that s1 = (−, +). Note that J12 > 0 is consistent with the observation that with the presence of taxon 2, the steady-state abundance of taxon 1 increases (Fig.1b), i.e., taxon 2 promotes the growth of taxon 1. We can apply the same method to infer the sign-pattern of s2 = (−, −).
For the three-taxa MC (Fig.1c), there are seven possible steady-state samples, i.e., {x{1}, x{2}, x{3}, x{1,2}, x{1,3}, x{2,3}, x{1,2,3}}. Four of them share taxon 1 (see colored pie charts in Fig. 1d), and six line segments connect the sample pairs of the form (xI − xK), I, . Considering a particular line segment (x{1,3} − x{1}), i.e., the solid blue line in Fig. 1d, we compute a plane (shown in orange in Fig.1d) that is orthogonal to (x{1,3} − x{1}) and passes the origin. The regions (including the origin and eights orthants) crossed by this orange plane provide a set of possible sign-patterns that may belong to (see Fig. 1d). We repeat the same process for all other vectors (xI − xK), I, , and compute the intersection of all the possible sign-patterns, finally yielding a minimum set that s1 may belong to. If the sign of one non-zero interaction is known (J11 < 0 for this example), the method correctly infers the true sign-pattern s1 = (−,0, +).
It is straightforward to generalize the above method to MCs of N taxa. But this brute-force method requires us to calculate all the sign-pattern candidates first, and then calculate their intersection to determine the minimum set that si may belong to. Since the solution space of sign-pattern is of size 3N, the time complexity of this brute force method is exponential with N, making it impractical for MCs with N > 10 taxa. To resolve this issue, we develop a heuristic algorithm, which pre-calculates many intersection lines of (N − 1) non-parallel hyperplanes that pass the origin and are orthogonal to (xI − xK), I, , and then determines based on the most probable intersection line. The solution space of this heuristic algorithm is determined by a user-defined parameter, i.e., the number of precalculated interaction lines (denoted as Ψ). Hence this algorithm naturally avoids searching the exponentially large solution space.
In reality, due to measurement noise and/or transient behavior of the MC, the abundance profiles of the collected samples may not exactly represent steady states of the MC. Hence for certain Jij’s their inferred signs might be wrong. Later on, we show that for considerable noise level the inference accuracy is still reasonably high.
Inferring interaction strengths
To quantitatively infer the inter-species interaction strengths, it is necessary to choose a priori a parameterized dynamic model for the MC. The classical GLV model can be obtained from Eq. (1) by choosing where is the intrinsic growth rate vector and is the interaction matrix characterizing the intra- and inter-species interactions.
From Eq. (2) we can easily calculate the Jacobian matrix J, which is nothing but the interaction matrix A itself. This also reflects the fact that the value of aij quantifies the interaction strength of species j on species i. The GLV model considerably simplifies the inference of the ecological network, since we proved that ai · (xI − xK) = 0, for all I, , where ai = (ai1,…, aiN) represents the i-th row of A matrix. In other words, all steady-state samples containing the i-th taxon will align exactly onto a hyperplane, whose orthogonal vector is precisely scalable with the vector ai that we want to infer (Fig. 2a). Thus, for the GLV model, the inference from steady-state data reduces to finding an (N − 1)-dimensional hyperplane that “best fits” the steady-state sample points in the N-dimensional state space. In order to exactly infer ai, it is necessary to know the value of at least one nonzero element in ai, say, aii. Otherwise, we can just determine the relative interaction strengths by expressing aij in terms of aii. Once we obtain ai, the intrinsic growth rate ri of the i-th taxon can be calculated by averaging (−ai · xI) over all , i.e., all the steady-state samples containing taxon i.
In case the samples are not collected exactly at steady states of the MC, those samples containing taxon i will not exactly align onto a hyperplane (Fig. 2b). A naive solution is to find a hyperplane that minimizes its distance to those noisy samples. But this solution is prone to induce false positive errors and will yield non-sparse solutions (corresponding to very dense ecological networks). This issue can be partly alleviated by introducing a Lasso regularization31, implicitly assuming that the interaction matrix A in the GLV model is sparse. However, the classical Lasso regularization may induce a high false discovery rate (FDR), meaning that many zero interactions are inferred as non-zeros ones. To overcome this drawback, we applied the knockoff filter32 procedure, which controls the FDR below a desired user-defined level q > 0.
The observation that for the GLV model all steady-state samples containing the i-th taxon align exactly onto a hyperplane can also be used to characterize how much the dynamics of an MC deviates from the GLV model. This deviation can be quantified by the normalized R2 of multiple linear regression when fitting the hyperplane (Fig. 2b). If R2 is close to 1 (the samples indeed align to a hyperplane), we conclude that the dynamics of the MC is consistent with the GLV model, and hence the inferred interaction strengths and intrinsic growth rates are reasonable. Otherwise, we should only aim to qualitatively infer the ecological interaction types that do not require specifying any population dynamics.
Validation on simulated data
1. Interaction types
To validate the efficacy of our method in inferring ecological interaction types, we numerically calculate the steady states of a small MC with N = 8 taxa, using four different population dynamics models26-30: Generalized Lotka-Volterra (GLV), Holling Type II (Holling II), DeAngelis-Beddington (DB) and Crowley-Martin (CM) models. Note that all these models satisfy the requirement that the sign pattern of the Jacobian matrix is time-invariant. To infer the ecological interaction types among the 8 taxa, we employed both the brute-force algorithm (with solution space ~ 38 = 6,561) and the heuristic algorithm (with solution space given by the number of the pre-calculated intersections Ψ = 5N = 40).
In the noiseless case, we find that when the number of steady-state samples satisfies Ω > 3N, the heuristic algorithm outperformed the brute-force algorithm for all the four datasets generated from different population dynamics models (Fig. 3a). This result is partly due to the fact that the former requires much less samples than the latter to reach high accuracy (the percentage of correctly inferred interaction types). However, when the sample size Ω is small, the heuristic algorithm completely fails while the brute-force algorithm still works to some extent.
We then fix Ω = 5N, and compare the performance of the brute-force and heuristic algorithms in the presence of noise (Fig. 3b). We add artificial noise to each non-zero entry of a steady-state sample xI replacing the value of its i-th entry by , where is a random number uniformly distributed on the interval and η ≥ 0 quantifies the noise level. We find that the heuristic algorithm again works better than the brute-force algorithm.
The above encouraging results on the heuristic algorithm prompt us to systematically study the key factor to obtain an accurate inference, i.e., the minimal sample size Ω*. Note that for an MC of N taxa, there are at most Ωmax = (2N − 1) possible steady-state samples. (Of course, not all of them will be ecologically feasible. For example, certain pair of taxa will never coexist.) In general, it is unnecessary to collect all possible steady-state samples to obtain a highly accurate inference result. Instead, we can rely on a subset of them. We numerically calculate the minimal sample size Ω* we need to reach three different accuracy levels (85%, 90%, 95%). For this, we considered two different taxa presence patterns: (1) uniform: all taxa have equal probability of being present in the steady-state samples (inset of Fig. 3c); and (2) heterogeneous: certain taxa have higher presence probability than others, reminiscent of human gut microbiome samples22 (inset of Fig. 3d). We found that at all three accuracy levels Ω* scales linearly with N in both taxa presence patterns, though the uniform taxa presence pattern requires much less samples (Fig. 3c,d).
Note that as N grows, the total possible steady-state samples Ωmax increases exponentially, while the minimal sample size Ω* we need for high inference accuracy increase linearly. Hence, interestingly, we have Ω*/Ωmax → 0 as N increases. This implies that as the number of taxa increases, the proportion of samples needed for accurate inference actually decreases. This is a rather counter-intuitive result because, instead of a “curse of dimensionality”, it suggests that a “blessing of dimensionality” exists when using the heuristic algorithm to infer interaction types for MCs with a large number of taxa.
2. Interaction strengths
To validate our method in quantitatively inferring interspecies interaction strengths, we numerically calculated steady states for an MC of N = 50 taxa, using the GLV model. For this we set aii = −1 for all taxa. During the inference, we just assume aii’s follow a half-normal distribution . The inference results on inter-species interaction strengths and intrinsic growth rates are shown in Fig. 4.
We find that the classical Lasso regularization could induce many false positives. Indeed, the false discovery rate (FDR) approaches 45.87%, indicating that almost half of inferred non-zero interactions are actually zero (Fig. 4a). In many cases, we are more concerned about low FDR than high false negative rates, because the topology of an inferred ecological network with even many missing links can still be very useful in the study of its dynamical and control properties33. To control FDR below a certain desired level q = 0.2, we further used the knockoff filter32 (Fig. 4b), and find that the knockoff filter succeeds in controlling the FDR below 20%, though it also introduces more false negatives.
The results presented in Fig. 4a,b were obtained with Ω = 5N samples and artificial noise added such that , where and η = 0.1. To study the minimal sample size Ω* required for perfect inference in the noiseless case, we again consider two different taxa presence patterns: (1) uniform; (2) heterogeneous. We find that for both taxa presence patterns Ω* scales linearly with N, though the uniform taxa presence pattern requires much less samples (Fig. 4c).
Application to experimental data
We finally applied our method to analyze experimental data derived from a synthetic soil MC of eight bacterial species34. This dataset consists of steady states of a total of 101 different species combinations: all 8 solos, 28 duos, 56 trios, all 8 septets, and 1 octet. For those steady-state samples that started from the same species collection but with different initial conditions, we average over their final steady states to get a representative steady state for this particular species combination. Note that true multistability was observed in only one of the 101 species combinations, suggesting that our assumption is at least partly supported by the experimental data.
In the experiments, it was found that several species grew to a higher density in the presence of an additional species than in monoculture. The impact of each additional species (competitor) j on each focal species i can be quantified by calculating the relative yield, defined as: , which represents a proxy of the ground truth of the interaction strength between species i and j. A negative relative yield indicates growth hindrance of species jon i, whereas positive values indicated facilitation (Fig. 5a). Though quantifying the relative yield is conceptually easy and implementable for certain small MCs, for many host-associated MCs with many taxa, such as the human gut microbiota, measuring these one- and two-species samples is simply impossible. This actually motivates the inference method we developed here.
Before we apply our inference method, we remove all these steady states involving one- or two species, and analyse only the remaining 65 steady states. (Note that for N = 8, the number of total possible steady-state samples is Ωmax = 255.)
During the inference, we first check if the population dynamics of this MC can be well described by the GLV model. We find that all the fitted hyperplanes show very small R2, indicating that the GLV model is not suitable for none of the eight species. Hence, we have to aim for inferring the ecological interaction types, without assuming any specific community dynamics model.
Since this MC has only eight species, we can use the brute-force algorithm to infer the sign pattern of the 8×8 Jacobian matrix, i.e., the ecological interaction types between the 8 species. Compared with the ground truth obtained from the relative yield (Fig. 5a), we find that 50 (78.13%) of the 64 signs were correctly inferred, 10 (15.62%) signs were wrong (denoted as ‘×’), and 4 (6.25%) signs cannot be determined (denoted as ‘?’) with the information provided by the 65 steady states (Fig. 5b).
We notice that the relative yield of many incorrectly inferred interactions is weak (with the exception of R13 and R15). We conjecture that these errors are caused by noise or measurement errors in the experiments. To test this conjecture, we analyzed the robustness of each inferred sij by calculating the percentage of unchanged sij after adding perturbations to the samples (Fig. 5c). Similar to adding noise on simulated data, here we add noise to each non-zero entry of a sample xI such that , where . The more robust the inferred results are, the larger the percentage of unchanged signs as η is increased. We found that most of the inferred signs were robust: the percentage of unchanged signs remained nearly 80% up to noise level η = 0.3 (Fig. 5c). Specifically, Fig. 5d plots the percentage of unchanged signs of the inferred Jacobian matrix when η = 0.04. We found that even if the perturbation is very small, 5 of 10 false inferred sij in Fig. 5c changed their signs very frequently (blue entries with false label in Fig. 5d). It demonstrated that those interactions were very sensitive to the difference of sample pairs, suggesting the validity of the hypothesis that some false inferences in Fig. 5c were caused by the noise.
3. Conclusion
We developed a new inference method to map the ecological networks of MCs using steady-state data. Our method can qualitatively infer ecological interaction types (signs) without specifying any population dynamics model. Furthermore, we show that steady-state data can be used to test if the dynamics of an MC can be well described by the classic GLV model. If yes, our method can quantitatively infer inter-species interaction strengths and the intrinsic growth rates.
The proposed method bears some resemblance to previous network reconstruction methods based on steady-state data35. But we emphasize that, unlike the previous methods, our method does not require any perturbations applied to the system. For certain MCs such as the human gut microbiota, applying perturbations may raise severe ethical and logistical concerns.
Note that our method requires the measurement of absolute taxon abundances. It fails on analyzing the relative abundance data. The compositionality of relative abundance profiles also causes serious trouble for inference methods based on temporal data12,16. Fortunately, for certain small synthetic MCs, we can assess the total cell density by measuring the optical density (OD) and species fractions (relative abundance) can be determined by plating on nutrient agar plates34. For host-associated MCs, we can combine two sources of information to measure absolute abundances: (1) data measuring relative abundances of microbes, typically consisting of counts (e.g., high-throughput 16S rRNA sequencing data); and (2) data measuring overall microbial biomass in the sample (e.g., universal 16S rRNA quantitative PCR (qPCR)12,13.
In contrast to the difficulties encountered in attempts to enhance the informativeness of temporal data that are often used to infer ecological networks, the informativeness of cross-sectional data can be enhanced by simply collecting more steady-state samples with distinct taxa collection (For host-associated MCs, this can be achieved by collecting steady-state samples from different hosts). Our numerical analysis suggests that the minimal number of samples with distinct taxa collections required for robust inference scales linearly with the taxon richness of the MC. Our analysis of experimental data from a small synthetic MC of eight species shows that collecting roughly one quarter of the total possible samples is enough to obtain a reasonably accurate inference. Furthermore, our numerical results suggest that this proportion can be significantly lower for larger MCs.
This blessing of dimensionality suggests that our method holds great promise for inferring the ecological networks of large and complex real-world MCs, such as the human gut microbiota. There are two more encouraging facts that support this idea. First of all, it has been shown that the composition of the human gut microbiome remains stable for months and possibly even years until a major perturbation occurs through either antibiotic administration or drastic dietary changes36-39. The striking stability and resilience of human gut microbiota suggest that the collected samples very likely represent the steady states of the gut microbial ecosystem. Second, for healthy adults the gut microbiota displays remarkable universal ecological dynamics40 across different individuals. This universality of ecological dynamics suggests that microbial abundance profiles of steady-state samples collected from different healthy individuals can be roughly considered as steady states of a conserved “universal gut dynamical” ecosystem and hence can be used to infer its underlying ecological network.
We expect that additional insights into microbial ecosystems will emerge from a comprehensive understanding of their ecological networks. Indeed, inferring ecological networks using the method developed here will enable enhanced investigation of the stability41 and assembly rules42 of MCs as well as facilitate the design of personalized microbe-based cocktails to treat diseases related to microbial dysbiosis8.
Contributions
Y.-Y.L conceived the project. Y.-Y.L and M.T.A. designed the project. Y.X. and M.T.A. did the analytical calculations. Y.X. did the numerical simulations and analyzed the empirical data. All authors analyzed the results. Y.X., M.T.A. and Y.-Y.L. wrote the manuscript. All authors edited the manuscript.
Author Information
The authors declare no competing financial interests. Correspondence and requests for materials should be addressed to Y.-Y.L. (yyl@channing.harvard.edu).
Acknowledgements
This work is supported in part by the John Templeton Foundation (Award number 51977). We thank Drs. Gabe Billings, Brigid Davis, Liang Tian for insightful comments and discussions on the manuscript.