Defining the Core: the Human Gut Pan-Microbiome


In 1996, Kenneth Wilson and Rhonda Blitchington pioneered the use of DNA sequencing to analyze the composition of microbes in human fecal samples (Wilson and Blitchington 1996). Since then, studies around the world have investigated the specific make-up of the human gut microbiome via sequencing technologies. The most common way to analyze the composition of the microbiome is Next Generation Sequencing (NGS). The use of NGS has   identified of a large number of microorganisms found in the human gut.

Analysis of sequence data can identify types, abundances, and shifts in the composition of the gut microbiome of people with medical conditions and health concerns. Finding potential links between diseased states and the makeup of the human gut microbiome begins with DNA analysis of the “condition” gut, and continues with comparison to the types and abundances of microbiota in the “healthy” human gut.

Central Question:  

The critical question then; is there a core composition of bacterial groups in the human gut microbiome consistent in the gut of all healthy humans? To make connections between the diseased state and an altered gut microbiome, there should be a basis for comparison. What is the unaffected state from which the microbiome changed? There are as many definitions for the healthy phylogenetic, or evolutionarily related, core of gut microbiota as there are studies on the subject. Normalizing datasets which originate from varied analysis tools and protocols is one tactic which may yield a more coherent picture of the core gut microbiome (Panek 2018). Daniel Aguirre de Carcer recently published one such study, describing the use of a novel approach to grouping DNA sequences, using data from previously published studies (2018).


Aguirr de Carcer took the data from three large studies and processed it in a way slightly different from past methods (2018). This method used in this meta-analysis may be a springboard from which more comprehensive investigations could be launched.

Most sequence-analysis software packages group DNA sequences, or “reads’ by similarities. “Similarity’ can be a fluid term. Most analysis programs begin by identifying the order of nucleotides making up each DNA sequence. Once the order is obtained through NGS, each and every sequence is compared to vast databases of sequences whose taxonomic groups are known. From kingdom on down to genus and species, hundreds of thousands of genomes have been nailed down to their respective organisms.

These hundreds of thousands of clearly identified sequence-to-organism matches do not represent all sequences found in most samples, however. The remaining sequences fall into “closest match’ categories, where as much as 20-30% of a given sequence has no exact match in the databases (Beiko 2015). Software designed to group these sequences with their closest matches within a database is used to assign the sequences to similarity-based categories . These categories are Organizational Taxonomic Units, or OTUs, which each contain some wholly matched sequences and many “closest match’ sequences. Unfortunately, this process often results in species and even genera lumped into categories which may not represent what they really are, and therefore may not represent their true function inside the human gut (Lagier et al. 2012).

Aguirre de Carcer re-processed each of three large datasets, and re-clustered the sequences into OTUs based not on known sequence alignment in the database, but on similarity to other sequences from all samples, in all three datasets. Agirre de Carcer’s method grouped and regrouped sequences dynamically, using a range of minimum similarity criteria. The goal was to first group sequences by how common they were across the samples from over 2000 individuals, then identify the phylogeny of the common sequences. The phylogenetic OTUs resulting from these clustering thresholds revealed a core   common to all three datasets (Aguirre de Carcer 2018).

Aguirre de Carcer’s OTUs were based on sequences common to all subjects from all three datasets. This grouping accounted for an average of 65.6% of the sequences per sample. Sequences in the core OTUs were most often identified with the bacterial families Lachnospiricaceae and Ruminococcaccae (Aguirre Carcer 2018).  It is no surprise these bacterial families help comprise the core; their connection to intestinal conditions such as colon cancer and irritable bowel syndrome is a subject of numerous scientific investigations. Though consistency in sampling or storage protocols might result in conflicting genus and species-level assignment when comparing data from multiple studies (Panek et al. 2018), the commonality of these two families prevails across all three datasets analyzed.

Other core OTUs were populated by sequences associated with the families Porphyromonadaceae and Rikenellaceae, as well as Bacteroidacea-like sequences including Prevotellaceae. These members of the Firmicutes and Bacteroidetes phyla are also present universally in the vast majority of human gut microbiome samples (Sakelja et al. 2011). These groups are found in samples from subjects of several races, in those consuming both Eastern and Western diets, in both sexes, and in a variety of subject ages ((Aguirre de Carcer 2018). It is obvious, then, that these are the taxa which comprise the core human gut pan-microbiome.

While Aguirre de Carcer’s work was novel in its approach to similarity grouping, there were some weaknesses. This researcher used software which predicted the functional composition of the created OTUs, and compared each predicted function to known functionality of the taxa within each OTU. Functionality in this context refers to the ability to carry out a metabolic process. Sequences were occasionally assigned to taxa whose detected functions conflicted with their reported phylogenetic functionality, especially below the 0.90 similarity threshold. This could represent taxonomic breadth within an OTU; it could mean that the OTUs identified are the sum of diverse taxa, unlimited to a single metabolic process. However, it is also possible these assignments were the result of clustering pitfalls or errors.

The author goes beyond this particular study to suggest several avenues for future investigation. Two core OTUs affiliated with the Subdoligranulum and Fusicatenibacter genera were identified in this study which lacked closely related genomes in the RefSeq collection, the library to which sequences were compared. These are taxa begging for future genome sequencing and functional composition analysis. Further research to establish best-clustering methods in order to preserve sequence phylogenetic relationships should also be explored.   To this end, Aguirre de Carcer referenced a study by Ren and Wu describing software which assembles OTU, sequence, and phylogenetic data into nodes in a phylogenetic tree to produce an enhanced phylogenetic resolution of the core groups (Ren and Wu 2016). A combination of that method and Aguirre de Carcer’s trans-dataset similarity method could yield more detailed data on the core microbiome.

My Questions:  

I was impressed with Aguirre de Carcer’s humble perspective on his own study. He acknowledged his approach was analogous to that employed by Sekelja et al., mainly different in the basis for establishing similarity. I also found his acknowledgement of potential weak areas, such as accuracy of his OTUs’ predicted functionality compared to known bacterial function, to be quite encouraging for future studies.

I would like to know where the notion of assessing core OTUs based on clustering at successively lower similarity ranges originated. It makes sense to analyze the commonality of sequences across multiple datasets, and grouping by percentage of similarity seems just as logical. I think this was a novel concept, as was his means of arriving at a p-value for accuracy of detecting OTUs at each threshold.

Finally, I was intrigued with the notion of creating nodes in a phylogenetic tree based on OTUs (Aguirre de Carcer citing Ren and Wu 2018). Given Aguirre de Carcer’s novel process in this evaluation of the core pan-microbiome, I’d value his take on possible methods of incorporating phylogenetic node-mapping into his own trans-dataset grouping method.

Further Reading:

To learn more about how the scientific community’s perspective on microbiota classification has shifted over the years, visit Robert Beiko’s Opinion piece from November 2015’s Trends in Microbiology.  Dr. Beiko provides a concise history of how genetic information has been used over the past half a century, and some of the struggles facing a scientific community striving for progress amidst shifting definitions. It’s worth your time to look further into this researcher’s suggestions for how microbiome study might become more cohesive across the world.

Aguirre de Carcer’s combined re-analysis of data gives one new view on how these data might be viewed, but his is not the first or only such meta-analysis. An excellent study was described in April 2016’s edition of Science. Dr. Gwen Faloney and her team arrive at a definition of a core microbiota which emphasizes the role of host covariates in their article “Population-level analysis of gut microbiome variation.‘

Including host covariates in analyses of gut microbiota is not a new concept. In 2012 a team of European researchers investigated the struggles associated with the many analysis methods for identifying microbiota and classifying their taxa. You can read about the need to include host-associated factors and to broaden funding sources in Jean-Christophe Lagier et al’s review article, published by Frontiers in Cellular and Infection Microbiology.  

If you’re interested in the investigation which prompted Daniel Aguirre de Carcer’s own meta-analysis, take a look at  Monika Sakelja’s original article. The International Society for Microbial Ecology Journal is a great source for articles on the microbiota, molecular taxonomy, and all things microbiome.



  • Aguirre de Carcer, Daniel. “The human gut pan-microbiome presents a compositional core  formed by discrete phylogenetic units.’ Scientific Reports, vol. 8, no. 14069, Sep. 2 018, doi:10.1038/s41598-018-32221-8.
  • Beiko, Robert. “Microbial Malaise: How Can We Classify the Microbiome?’ Trends in  Microbiology, vol. 23, no. 11, Nov. 2015, pp. 671-9, doi: 10.1016/j.tim.2015.08.009.
  • Falony, Gwen, et al. “Population-level analysis of gut microbiome variation.’ Science, vol. 352,  6285, Apr. 2016, pp. 560-4, doi:10.1126/science.aad3503.
  • Lagier, Jean-Christophe, et al. “Human gut microbiota: repertoire and variations.’ Frontiers in  Cellular and Infection Microbiology, vol. 2, no. 136, Nov. 2012,  doi:10.3389/fcimb.2012.00136.
  • Panek, Marina, et al. “Methodology challenges in studying human gut microbiota — effects of collection, storage, DNA extraction and next generation sequencing technologies.’  Scientific Reports, vol. 8, no. 5143, Mar 2018, doi:10.1038/s41598-018-23296-4.
  • Ren, Tiantian and Wu, Martin. “PhyloCore: A phylogenetic approach to identifying core taxa in  microbial communities.’ Gene, vol. 593, no. 2, Nov. 2016, pp. 330-3,  doi:10.1016/j.gene.2016.08.032.
  • Sekelja, Monika, et al. “Unveiling an abundant core microbiota in the human adult colon by a phylogroup-independent searching approach.’ The ISME Journal, vol. 5, no. 3, Aug.  2010, doi:10.1038/ismej.2010.129.
  • Wilson, Kenneth H. and Blitchington, Rhonda B. “Human Colonic Biota Studied by Ribosomal  DNA Sequence Analysis.’ Applied and Environmental Microbiology, vol. 62, 1996, pp. 2273-8.