%0 Journal Article %A Stephen Woloszynek %A Joshua Chang Mell %A Gideon Simpson %A Michael P. O’Connor %A Gail L. Rosen %T Uncovering thematic structure to link co-occurring taxa and predicted functional content in 16S rRNA marker gene surveys %D 2017 %R 10.1101/146126 %J bioRxiv %P 146126 %X Background Analysis of microbiome data involves identifying co-occurring groups of taxa associated with sample features of interest (e.g., disease state). But elucidating key associations is often difficult since microbiome data are compositional, high dimensional, and sparse. Also, the configuration of co-occurring taxa may represent overlapping subcommunities that contribute to, for example, host status. Preserving the configuration of co-occurring microbes rather than detecting specific indicator species is more likely to facilitate biologically meaningful interpretations. In addition, analyses that utilize both taxonomic and predicted functional abundances typically independently characterize the taxonomic and functional profiles before linking them to sample information. This prevents investigators from identifying the specific functional components associate with which subsets of co-occurring taxa.Results We provide an approach to explore co-occurring taxa using “topics” generated via a topic model and then link these topics to specific sample classes (e.g., diseased versus healthy). Rather than inferring predicted functional content independently from taxonomic abundances, we instead focus on inference of functional content within topics, which we parse by estimating pathway-topic interactions through a multilevel, fully Bayesian regression model. We apply our methods to two large publically available 16S amplicon sequencing datasets: an inflammatory bowel disease (IBD) dataset from Gevers et al. and data from the American Gut (AG) project. When applied to the Gevers et al. IBD study, we demonstrate that a topic highly associated with Crohn’s disease (CD) diagnosis is (1) dominated by a cluster of bacteria known to be linked with CD and (2) uniquely enriched for a subset of lipopolysaccharide (LPS) synthesis genes. In the AG data, our approach found that individuals with plant-based diets were enriched with Lachnospiraceae, Roseburia, Blautia, and Ruminococcaceae, as well as fluorobenzoate degradation pathways, whereas pathways involved in LPS biosynthesis were depleted.Conclusions We introduce an approach for uncovering latent thematic structure in the context of sample features for 16S rRNA surveys. Using our topic-model approach, investigators can (1) capture groups of co-occurring taxa termed topics, (2) uncover within-topic functional potential, and (3) identify gene sets that may guide future inquiry. These methods have been implemented in a freely available R package https://github.com/EESI/themetagenomics.AGAmerican gutBMIbody mass indexCDCrohn’s diseaseIBDinflammatory bowel diseaseLDAlatent Dirichlet allocationLFClog-fold changeLNlogistic NormalLPSlipopolysaccharideOTUoperational taxonomic unitPCDAIpediatric Crohn’s disease activity indexPPDposterior predictive distribution STM, structural topic modelSTMstructural topic model %U https://www.biorxiv.org/content/biorxiv/early/2017/06/18/146126.full.pdf