New microbiome tools just keep coming – fun times – hard to keep up

So many new tools and methods in microbiome and microbial community studies and it is just really hard to keep up with them.  Here are some that have caught my eye recently:

PLOS ONE: IM-TORNADO: A Tool for Comparison of 16S Reads from Paired-End Libraries.

Jeraldo P, Kalari K, Chen X, Bhavsar J, Mangalam A, et al. (2014) IM-TORNADO: A Tool for Comparison of 16S Reads from Paired-End Libraries. PLoS ONE 9(12): e114804. doi:10.1371/journal.pone.0114804


16S rDNA hypervariable tag sequencing has become the de facto method for accessing microbial diversity. Illumina paired-end sequencing, which produces two separate reads for each DNA fragment, has become the platform of choice for this application. However, when the two reads do not overlap, existing computational pipelines analyze data from read separately and underutilize the information contained in the paired-end reads.


We created a workflow known as Illinois Mayo Taxon Organization from RNA Dataset Operations (IM-TORNADO) for processing non-overlapping reads while retaining maximal information content. Using synthetic mock datasets, we show that the use of both reads produced answers with greater correlation to those from full length 16S rDNA when looking at taxonomy, phylogeny, and beta-diversity.

Availability and Implementation

IM-TORNADO is freely available at​o and produces BIOM format output for cross compatibility with other pipelines such as QIIME, mothur, and phyloseq.

Update on RefSeq microbial genomes resources

NCBI RefSeq genome collection represents all three major domains of life: Eukarya, Bacteria and Archaea as well as Viruses. Prokaryotic genome sequences are the most rapidly growing part of the collection. During the year of 2014 more than 10 000 microbial genome assemblies have been publicly released bringing the total number of prokaryotic genomes close to 30 000. We continue to improve the quality and usability of the microbial genome resources by providing easy access to the data and the results of the pre-computed analysis, and improving analysis and visualization tools. A number of improvements have been incorporated into the Prokaryotic Genome Annotation Pipeline. Several new features have been added to RefSeq prokaryotic genomes data processing pipeline including the calculation of genome groups (clades) and the optimization of protein clusters generation using pan-genome approach.

Standards in Genomic Sciences | Full text | Quality Scores for 32,000 Genomes

More than 80% of the microbial genomes in GenBank are of ‘draft’ quality (12,553 draft vs. 2,679 finished, as of October, 2013). We have examined all the microbial DNA sequences available for complete, draft, and Sequence Read Archive genomes in GenBank as well as three other major public databases, and assigned quality scores for more than 30,000 prokaryotic genome sequences.

Scores were assigned using four categories: the completeness of the assembly, the presence of full-length rRNA genes, tRNA composition and the presence of a set of 102 conserved genes in prokaryotes. Most (~88%) of the genomes had quality scores of 0.8 or better and can be safely used for standard comparative genomics analysis. We compared genomes across factors that may influence the score. We found that although sequencing depth coverage of over 100x did not ensure a better score, sequencing read length was a better indicator of sequencing quality. With few exceptions, most of the 30,000 genomes have nearly all the 102 essential genes.

The score can be used to set thresholds for screening data when analyzing “all published genomes” and reference data is either not available or not applicable. The scores highlighted organisms for which commonly used tools do not perform well. This information can be used to improve tools and to serve a broad group of users as more diverse organisms are sequenced. Unexpectedly, the comparison of predicted tRNAs across 15,000 high quality genomes showed that anticodons beginning with an ‘A’ (codons ending with a ‘U’) are almost non-existent, with the exception of one arginine codon (CGU); this has been noted previously in the literature for a few genomes, but not with the depth found here.

rrnDB: improved tools for interpreting rRNA gene abundance in bacteria and archaea and a new foundation for future development


Microbiologists utilize ribosomal RNA genes as molecular markers of taxonomy in surveys of microbial communities. rRNA genes are often co-located as part of an rrn operon, and multiple copies of this operon are present in genomes across the microbial tree of life. rrn copy number variability provides valuable insight into microbial life history, but introduces systematic bias when measuring community composition in molecular surveys. Here we present an update to the ribosomal RNA operon copy number database (rrnDB), a publicly available, curated resource for copy number information for bacteria and archaea. The redesigned rrnDB ( brings a substantial increase in the number of genomes described, improved curation, mapping of genomes to both NCBI and RDP taxonomies, and refined tools for querying and analyzing these data. With these changes, the rrnDB is better positioned to remain a comprehensive resource under the torrent of microbial genome sequencing. The enhanced rrnDB will contribute to the analysis of molecular surveys and to research linking genomic characteristics to life history.

A straightforward and efficient analytical pipeline for metaproteome characterization. Tanca A, Palomba A, Pisanu S, Deligios M, Fraumene C, Manghina V, Pagnozzi D, Addis MF, Uzzau S. Microbiome. 2014 Dec 10;2(1):49. doi: 10.1186/s40168-014-0049-2. eCollection 2014.

The massive characterization of host-associated and environmental microbial communities has represented a real breakthrough in the life sciences in the last years. In this context, metaproteomics specifically enables the transition from assessing the genomic potential to actually measuring the functional expression of a microbiome. However, significant research efforts are still required to develop analysis pipelines optimized for metaproteome characterization.
This work presents an efficient analytical pipeline for shotgun metaproteomic analysis, combining bead-beating/freeze-thawing for protein extraction, filter-aided sample preparation for cleanup and digestion, and single-run liquid chromatography-tandem mass spectrometry for peptide separation and identification. The overall procedure is more time-effective and less labor-intensive when compared to state-of-the-art metaproteomic techniques. The pipeline was first evaluated using mock microbial mixtures containing different types of bacteria and yeasts, enabling the identification of up to over 15,000 non-redundant peptide sequences per run with a linear dynamic range from 10(4) to 10(8) colony-forming units. The pipeline was then applied to the mouse fecal metaproteome, leading to the overall identification of over 13,000 non-redundant microbial peptides with a false discovery rate of <1%, belonging to over 600 different microbial species and 250 functionally relevant protein families. An extensive mapping of the main microbial metabolic pathways actively functioning in the gut microbiome was also achieved.
The analytical pipeline presented here may be successfully used for the in-depth and time-effective characterization of complex microbial communities, such as the gut microbiome, and represents a useful tool for the microbiome research community.

Your Wild Life Releases Home Microbiome Data Set for Visualization

Lang JM, Eisen JA, Zivkovic AM. (2014) The microbes we eat: abundance and taxonomy of microbes consumed in a day’s worth of meals for three diet types. PeerJ 2:e659

The microbes we eat: abundance and taxonomy of microbes consumed in a day’s worth of meals for three diet types

Far more attention has been paid to the microbes in our feces than the microbes in our food. Research efforts dedicated to the microbes that we eat have historically been focused on a fairly narrow range of species, namely those which cause disease and those which are thought to confer some “probiotic” health benefit. Little is known about the effects of ingested microbial communities that are present in typical American diets, and even the basic questions of which microbes, how many of them, and how much they vary from diet to diet and meal to meal, have not been answered.

We characterized the microbiota of three different dietary patterns in order to estimate: the average total amount of daily microbes ingested via food and beverages, and their composition in three daily meal plans representing three different dietary patterns. The three dietary patterns analyzed were: (1) the Average American (AMERICAN): focused on convenience foods, (2) USDA recommended (USDA): emphasizing fruits and vegetables, lean meat, dairy, and whole grains, and (3) Vegan (VEGAN): excluding all animal products. Meals were prepared in a home kitchen or purchased at restaurants and blended, followed by microbial analysis including aerobic, anaerobic, yeast and mold plate counts as well as 16S rRNA PCR survey analysis.

Based on plate counts, the USDA meal plan had the highest total amount of microbes at 1.3 × 109 CFU per day, followed by the VEGAN meal plan and the AMERICAN meal plan at 6 × 106 and 1.4 × 106 CFU per day respectively. There was no significant difference in diversity among the three dietary patterns. Individual meals clustered based on taxonomic composition independent of dietary pattern. For example, meals that were abundant in Lactic Acid Bacteria were from all three dietary patterns. Some taxonomic groups were correlated with the nutritional content of the meals. Predictive metagenome analysis using PICRUSt indicated differences in some functional KEGG categories across the three dietary patterns and for meals clustered based on whether they were raw or cooked.

Further studies are needed to determine the impact of ingested microbes on the intestinal microbiota, the extent of variation across foods, meals and diets, and the extent to which dietary microbes may impact human health. The answers to these questions will reveal whether dietary microbes, beyond probiotics taken as supplements–i.e., ingested with food–are important contributors to the composition, inter-individual variation, and function of our gut microbiota.

Weighted Statistic Binning: enabling statistically consistent genome-scale Phylogenetic Analyses by M Bayzid, S Mirarab, T Warnow

Because biological processes can make different loci have different evolutionary histories, species tree estimation requires multiple loci from across the genome. While many processes can result in discord between gene trees and species trees, incomplete lineage sorting (ILS), modeled by the multi-species coalescent, is considered to be a dominant cause for gene tree heterogeneity. Coalescent-based methods have been developed to estimate species trees, many of which operate by combining estimated gene trees, and so are called summary methods. Because summary methods are generally fast, they have become very popular techniques for estimating species trees from multiple loci. However, recent studies have established that summary methods can have reduced accuracy in the presence of gene tree estimation error, and also that many biological datasets have substantial gene tree estimation error, so that summary methods may not be highly accurate on biologically realistic conditions. Mirarab et al. (Science 2014) presented the statistical binning technique to improve gene tree estimation in multi-locus analyses, and showed that it improved the accuracy of MP-EST, one of the most popular coalescent- based summary methods. Statistical binning, which uses a simple statistical test for combinability and then uses the larger sets of genes to re-calculate gene trees, has good empirical performance, but using statistical binning within a phylogenomics pipeline does not have the desirable property of being statistically consistent. We show that weighting the recalculated gene trees by the bin sizes makes statistical binning statistically consistent under the multispecies coalescent, and maintains the good empirical performance. Thus, “weighted statistical binning” enables highly accurate genome-scale species tree estimation, and is also statistical consistent under the multi-species coalescent model.

Standards in Genomic Sciences: New beginnings to reflect the association between the journal and BMC


Leave a Reply

Jonathan Eisen

I am an evolutionary biologist and a Professor at U. C. Davis. My lab is in the UC Davis Genome Center and I hold appointments in the Department of Medical Microbiology and Immunology in the School of Medicine and the Department of Evolution and Ecology in the College of Biological Sciences. My research focuses on the origin of novelty (how new processes and functions originate). To study this I focus on sequencing and analyzing genomes of organisms, especially microbes and using phylogenomic analysis (see my lab site here which has more information on lab activities).  In addition to research, I am heavily involved in the Open Access publishing and Open Science movements.