Of possible interest – DAS tool for metagenomic binning

Saw an interesting Tweet

And this tool may be of interest – it is from a new preprint in BioRXiv.  See abstract below:

Microbial communities are critical to ecosystem function. A key objective of metagenomic studies is to analyse organism-specific metabolic pathways and reconstruct community interaction networks. This requires accurate assignment of assembled genome fragments to genomes. Existing binning methods often fail to reconstruct a reasonable number of genomes and report many bins of low quality and completeness. Furthermore, the performance of existing algorithms varies between samples and biotopes. Here, we present a dereplication, aggregation and scoring strategy, DAS Tool, that combines the strengths of a flexible set of established binning algorithms. DAS Tool applied to a constructed community generated more accurate bins than any automated method. Further, when applied to environmental and host-associated samples of different complexity, DAS Tool recovered substantially more near-complete genomes, including novel lineages, than any single binning method alone. The ability to reconstruct many near-complete genomes from metagenomics data will greatly advance genome-centric analyses of ecosystems.

Source: Recovery of genomes from metagenomes via a dereplication, aggregation, and scoring strategy | bioRxiv

Are these microbes the “same”?

There are a number of cases where determining the relationship between microbes is at the center of a research question. Are the microbes inhabiting a building the same as those inhabiting its tenants? Are the microbes in a hospital room the same as those that colonize newborn babies? Is the E. coli living on a wood surface the same as the E. coli living on a plastic surface?

The most common metric of comparing sequenced microbial genomes is average nucleotide identity (ANI)1. The basic idea is to align two genomes and count the number of mismatches in the alignment. Genomes with an ANI of 99% have 1 mismatch between them every hundred bases, whereas genomes with an ANI of 95% have five mismatches between them every one hundred bases, and so on. There are numerous methods to calculate average nucleotide identity, with the major difference being the algorithm used to align the genomes.2–4

Through calculating the ANI between genomes in a number of systems, some loose and general ANI breakpoints have been documented:

  • < 96% ANI   = Same 16S cluster (using standard 97% clustering)5
  • > 96% ANI   = Same bacterial species4
  • > 98% ANI   = Same E. coli clade6
  • > 98.8% ANI = Same Prochlorococcus clade7
  • > 99.9% ANI = Same K. pneumoniae outbreak strain8

At which ANI threshold it becomes appropriate to call genomes the “same” depends on the research question. If the question is whether the microbes in an office in Flagstaff are the same as those in an office in San Diego, two microbes of the same species should probably be considered the “same,” and thus an ANI of 95% (or 16S sequencing) would adequately address the question (and it did; Chase, 20169). If the question is whether microbes in two different body sites came from the same source, 95% ANI is too low. Just because E. coli is on two body sites doesn’t mean they came from the same place; one strain could have come from the soil and the other strain from the neighbor next door. An ANI above 95% is definitely needed to show both strains come from the same source, but how high of an ANI is needed is another question (99.9% ANI was used to address this in a recent publication; Olm, 201610).

When picking an ANI threshold for a specific question it is often helpful to visualize the relationship between the genomes. dRep, a python program recently published on bioRxiv11, was written to do just that. For example:


The figure above shows the ANI between strains of Streptomyces inhabiting different babies in the same NICU in Pittsburgh. From the figure, you can see that the ANI between conN3_174_037G1_concoct_13 and conN1_023_029G1_concoct_18 is about 99.25, the ANI between conN3_174_023G1_concoct_19 and conN3_174_021G1_concoct_4 is about 100, and so on. The figure also makes it clear that different ANI thresholds will result in different conclusions about which babies have the “same” strains. For example, calling genomes the “same” if their ANI is >= 98.5% (as shown at the dotted black line) will result in the conclusion that there is only one single strain of Streptomyces that all babies share. Calling genomes the “same” if their ANI is >= 99.5% (as shown at the dotted red line) will result in the conclusions that there are 5 different strains of Streptomyces, two of which (conN3_174_037G1_concoct_13 and conN1_023_029G1_concoct_18) are only in one infant. In this example changing the ANI threshold by a single percentage point completely altered the conclusions drawn from the data, highlighting the importance of selecting a threshold carefully.

dRep, the program used to compute the ANI and generate the above figure, was recently published on bioRxiv.11 Documentation is available on ReadTheDocs, and the source code is available on GitHub. dRep cannot tell you which ANI threshold is appropriate for your specific application, but it can produce figures like the one shown above to help guide the decision.


1. Konstantinidis, K. T., Ramette, A. & Tiedje, J. M. The bacterial species definition in the genomic era. Philos. Trans. R. Soc. B Biol. Sci. 361, 1929–1940 (2006).

2. Goris, J. et al. DNA-DNA hybridization values and their relationship to whole-genome sequence similarities. Int. J. Syst. Evol. Microbiol. 57, 81–91 (2007).

3. Richter, M. & Rosselló-Móra, R. Shifting the genomic gold standard for the prokaryotic species definition. Proc. Natl. Acad. Sci. 106, 19126–19131 (2009).

4. Varghese, N. J. et al. Microbial species delineation using whole genome sequences. Nucleic Acids Res. 43, 6761–6771 (2015).

5. Kim, M., Oh, H.-S., Park, S.-C. & Chun, J. Towards a taxonomic coherence between average nucleotide identity and 16S rRNA gene sequence similarity for species demarcation of prokaryotes. Int. J. Syst. Evol. Microbiol. 64, 346–351 (2014).

6. Luo, C. et al. Genome sequencing of environmental Escherichia coli expands understanding of the ecology and speciation of the model bacterial species. Proc. Natl. Acad. Sci. 108, 7200–7205 (2011).

7. Kashtan, N. et al. Single-cell genomics reveals hundreds of coexisting subpopulations in wild Prochlorococcus. Science 344, 416–420 (2014).

8. Snitkin, E. S. et al. Tracking a hospital outbreak of carbapenem-resistant Klebsiella pneumoniae with whole-genome sequencing. Sci. Transl. Med. 4, 148ra116–148ra116 (2012).

9. Chase, J. et al. Geography and Location Are the Primary Drivers of Office Microbiome Composition. mSystems 1, (2016).

10. Olm, M. R. et al. Identical bacterial populations colonize premature infant gut, skin, and oral microbiomes and exhibit different in situ growth rates. Genome Res. gr-213256 (2017).

11. Olm, M. R., Brown, C. T., Brooks, B. & Banfield, J. F. dRep: A tool for fast and accurate genome de-replication that enables tracking of microbial genotypes and improved genome recovery from metagenomes. bioRxiv (2017). doi:10.1101/108142



From genomes to phenotypes: Traitar, the microbial trait analyzer

There is an increasing number of studies with a large number genomes recovered from isolate, metagenome, or single cell sequencing. To bridge the gap between the available genome sequences and available phenotype information, we have developed Traitar, a bioinformatics software to phenotype bacteria based on their genome sequence (see workflow below) . Traitar includes phenotype models for predicting 67 traits such as the use of different substrates as carbon and energy sources, oxygen requirement, morphology, and antibiotic susceptibility, and it provides the means to inspect the protein families (Pfams) that gave rise to these phenotype predictions.


In a paper recently published in mSystems (https://doi.org/10.1128/mSystems.00101-16), we describe the application of Traitar to two novel Clostridiales species with partical genomes recovered from metagenome shotgun sequencing of commercial biogas reactors. Traitar could verify an expert  metabolic reconstruction and furthermore pinpoint additional traits that were missing in the manual metabolic reconstruction.

The software is easy to install and run. It only requires a nucleotide or protein FASTA file per sample as input. Users can inspect the phenotyping results of  from Traitar  for their genome sequences of two prediction modes (phypat and phypat+PGL) through heatmaps (see example of Traitar applied to single-assembled genomes below; Fig 5 in Traitar paper) and flat text files. For phenotyping a single genome, Traitar only requires a couple of minutes. Computation is multithreaded (parallelized) and scales to data sets with hundreds of genomes. We also offer a web service for data sets of up to around ten genomes. If you have larger data sets and troubles running the Traitar stand-alone tool get in touch with us (contact details below).

To build and validate the phenotype models in Traitar, we have used phenotype data from the Global Infectious Disease and Epidemiology Online Network (GIDEON) and Bergey’s Systematic Bacteriology. Internally, the models were created using a machine learning method, namely L1 regularized L2 loss support vector machine trained on information about the presence and absence of protein families as well as ancestral protein family gains and losses.

Some word of advice when applying Traitar for phenotyping your genomes:
The training data from GIDEON and Bergey’s does not cover all known bacterial taxa and some with more data than others. Thus, some of the phenotypes might be realized with different protein families in taxa that are less well represented here and classification accuracy for these taxa be less than for others. Since Traitar provides the Pfam families responsible for your phenotype prediction, you could cross reference the phenotypes predicted by Traitar and the associated protein families with a targeted metabolic reconstruction approach.

We are currently working on incorporating new phenotypes and on further extending the existing phenotype models. For instance, we will apply Traitar to several hundred isolate genomes of the pathogen Pseudomonas aeruginosa to learn phenotype models of antibiotic resistance. We will keep updating the software and models, so please regularly check out our GitHub or Twitter. Traitar is designed to easily incorporate new prediction models. If you have data for phenotypes of interest please get in touch with us. We’re  also preparing a stand-alone software to allow users to train their own phenotype models.

Aaron Weimann: @aaron_weimann
Andreas Bremges: @abremges
Alice C. McHardy: @alicecarolyn

GitHub: https://github.com/hzi-bifo/traitar
Web service: https://research.bifo.helmholtz-hzi.de/webapps/wa-webservice/pipe.php?pr=traitar
Web service and general BIFO software support: bifo-software@helmholtz-hzi.de

16S rRNA Sequencing Using the Oxford Nanopore Minion

There has been some interest in our recent preprint describing Oxford Nanopore MinIONTM sequencing for 16S rRNA microbiome characterization and I was asked to write a post for microbenet on this technology. Disclaimers – this paper is a work in progress – our paper has not yet been peer-reviewed and we are continuing to revise our work and conduct additional experiments based on feedback. We welcome feedback as we continue to work on this topic. I was also part of the Oxford Nanopore ‘Early Access Program’ and was able to access the platform before release at a discounted cost. It is worth noting others are working on this topic as well: Benítez-Páez et al., Shin et al., and Mitsuhashi et al.

The MinION platform (pictured below) is a portable DNA sequencing platform that is capable of providing long read sequence data in near real time. Additional information is on the Oxford Nanopore website: https://nanoporetech.com/ and many academic reviews exist for specific applications. The flowcell can be washed after each run and reused multiple times with decreasing sequence output. Currently, a flowcell is ~$900 and can be used at least 6 times for relatively short runs. The primary drawbacks of this technology are a relatively high error rate (~8% per-base error rate) and higher per-base costs than other current technologies (e.g. Illumina).

The Oxford Nanopore MinION

The appeal of using MinION for 16S rRNA sequencing is the portability, the potential to get near full-length 16S rRNA reads, and the ability for rapid (same day) sequence data. The capital costs are also low (a laptop), which is a step forward in the ‘democratization of sequencing’. While there are many potential applications, some may include sample screening prior to sequencing on another platform, sequencing in the field, or sequencing in the clinic for patient monitoring. The obvious challenge is the error rate.

To initially evaluate the potential of this technology, we sequenced 16S rRNA sequences from pure-culture E. coli and P. fluorescens, as well as a low-diversity sample from hydraulic fracturing produced water that we had previously analyzed using Illumina sequencing. We actually evaluated many more samples, but were forced to exclude them due to sample carryover between washes, which I discuss below.

We attempted to cluster the pure-culture reads into Operational Taxonomic Units (OTUs) but all approaches we tried failed – using a de novo approach at the typically used 97% similarity level, >99% of reads clustered into unique OTUs. Taxonomic assignment using the Ribosomal Database Project’s Naïve Bayesian Classifier was more successful, achieving 93.8% and 82.0% annotation accurate at the phyla and genus levels, respectively (shown below). It will be necessary to examine more diverse pure-culture samples to determine the role of database representation on annotation accuracy. Comparison of the mixed community sample had much higher similarity when using ‘weighted’ (abundance) measures  than when using ‘unweighted’ (presence/absence) measures. Taken together, these results suggest that this approach is potentially useful for initial assessment of microbial communities and for observing broad microbial community shifts. At this stage, this approach has limited utility for ‘fine-grained’ microbial community resolution.

Annotation accuracy of 16S rRNA amplicons from pure-culture samples. Annotation performed by the RDP classifier against the GreenGenes database, as described in the preprint.

When analyzing pure-culture sequence data we also observed apparent between-wash carryover of ~10% of sequence reads. We have since seen some notes to this effect in the literature and a technical note by Oxford Nanopore. Those working on this technology should be aware of this phenomena (especially if working with mixed-culture samples where the carryover would be less obvious). Improved between-run washing or sample barcoding should help to alleviate this challenge.

Dr. Kyle Bibby is an Assistant Professor in Civil and Environmental Engineering at the University of Pittsburgh. You can find him on twitter @kylejbibby

Edit 1/30/17 to fix typo on weighted vs. unweighted measures.

What kind of DNA lingers on ATM keypads? Your food, your skin microbes…and (maybe) parasites

Amidst the November/December holiday chaos, myself and co-authors were proud to witness the publication of a neat new paper focused on ATM keypads in New York City. Yes, just like all other surfaces in the Built Environment, those ATM keypads are harboring lots of microbes and bits of orphaned DNA!

This ATM keypad study was work that I carried out during my previous position in Jane Carlton’s lab in the Center for Genomics & Systems Biology at New York University. The project was a collaboration between the Carlton lab and the lab of Maria Gloria Dominguez-Bello (an associate professor in New York University School of Medicine’s Human Microbiome Program). The research was partially funded by a New York University Grand Challenge project called “Microbes, Sewage, Health and Disease: Mapping the New York City Metagenome” and a grant from the Alfred P. Sloan Foundation.

This study aimed to carry out a baseline assessment of the microbes found on ATM keypads, looking at both prokaryotic (bacteria/archaea) and eukaryotic microbes using 16S rRNA and 18S rRNA amplicon sequencing, respectively. Call it “exploratory research”  – we weren’t looking at any patterns in particular, because we weren’t sure what we were going to find. However, ATM keypads are an interesting micro-habitat, since they can be considered a highly trafficked surface in the Built Environment (think of how many thousands of people are probably using an average bank ATM every day in Manhattan – lots of fingers pressing those keys at all hours of the day!).

With that goal in mind, during June and July 2014 we used cotton swabs to sample microbes from 66 ATM keypads in eight neighborhoods across three New York City boroughs (Manhattan, Queens, and Brooklyn).

NYC neighborhoods sampled during this ATM study, with associated demographic information obtained from the NYC Open Data portal

What did we find? NOTHING. That is to say – we didn’t find that microbial communities lumped together in any obvious way according to NYC geography (neighborhood or borough). We also tried to correlate microbial community patterns with population demographics obtained from the NYC Open data portal – things like age group, predominant ethnicity, etc. in each neighborhood – but again, no strong patterns there.

Microbial communities in NYC did not cluster according to neighborhood, borough, site type, or any other metadata category. Patterns were consistent across 16S and 18S data

The most interesting things were the expected patterns and associations that were neat to actually see in our dataset. First, many microbes on ATM keypads seemed to be derived from the human skin microbiome – ATM microbes were similar to those found on household surfaces such as pillowcases, and TVs (surfaces which probably amalgamate an “average” of skin microbes from everyone living in the house). Other microbes on ATM keypads were similar to outdoor air (perhaps representing dust/pollen/airborne microbes settling on ATM keypads), and restroom surfaces (don’t think about that one too much…)

“Sources” of ATM microbes – household surfaces, outdoor air, and restrooms

Second, in some neighborhoods we found traces of food species (chicken and seafood DNA), indicating a “microbial echo” of a person’s last meal. You might not think about it much, but when you handle food bits of DNA and cells are most likely sloughing off and sticking to your hands – so people in busy NYC may eat their meals on the go (perhaps without washing their hands) and then use an ATM keypad and transfer the food DNA onto the buttons.

Microbial taxa that were enriched (read: more abundant) on some ATM keypads: chicken, seafood, and an “extreme” mold species. Figure generated using LDA Effect Size (LEfSE) analysis.

Finally we also found subtle patterns and exciting taxa in our ATM dataset, all of which are interesting topics for further research. Some ATMs showed higher abundances of specific microbial species, such as the “extreme” mold Xeromyces bisporus (the “enrichment” of this microbe was statistically significant in certain neighborhhoods – see above figure), which could potentially be used as a biomarker for fungal species associated with baked goods. X. bisporus seems to be able to tolerate low water availability and live in comparatively harsh habitats such as sugary/processed foods, hence why we can consider it an extremophile. On other ATMs we also found microbial species representing putatively parasitic taxa – Trichomonas, a sexually transmitted disease in humans, and Toxoplasmosa, the infamous parasite found in cats. While we can’t 100% confirm that these parasites linger on ATM keypads (the rRNA gene region we used does not allow us to definitely separate these parasites from other closely related protist species), these tantalizing data set the stage for future (more focused) studies of the “urban microbiome”.

If you want to hear more about this study (and more about urban microbes in general), have a listen to this radio interview I did for The Innovation Hub (a WGBH/PRI show), as part of an episode focused on “The Future of Cities”.


Bik HM, Maritz JM, Luong A, Shin H, Dominguez-Bello MG, Carlton JM (2016)
Microbial Community Patterns Associated with Automated Teller Machine Keypads in New York City mSphere, 1(6) e00226-16; DOI: 10.1128/mSphere.00226-16

Wrap up of #PSB17 – Pacific Symposium in Biocomputing

Just got back from the Pacific Symposium in Biocomputing. The meeting had some aspects that may be of interest to various folks.

Electronic proceedings of the meeting are here.

I talked in a session on functional predictions.  I was asked to do this at the last minute so I made my slides by hand. Here they are.


I also recorded audio of my talk. Have not synched it to the slides but this may be of interest to some.


And I made a Storify of Tweets related to the meeting:


Multiple Positions Open at University of Oregon BioBE Center

Kevin Van Den Wymelenberg and Jessica Green, of the Biology and the Built Environment Center (BioBE), are currently seeking a microbial ecology Research Associate / Research Assistant Professor / Research Associate Professor (non-tenure track faculty) to investigate fundamental questions surrounding the role of microorganisms (bacteria, archaea, fungi, protists, and viruses) in the built environment and in relation to human health outcomes. Applicants must have a Ph.D. in biology, bioinformatics, or a related discipline.

The ideal candidate will have a combination of domain expertise and leadership potential. With regards to domain expertise, candidates should possess a demonstrated ability to generate and interpret microbiome data. Deep knowledge in data analytics, bioinformatics, and/or clinical microbiology is highly desirable. From a leadership perspective, we are seeking candidates that: are comfortable working on multiple concurrent projects with interdisciplinary scientists comprising a diverse range of experience (undergraduate through postdoc); have demonstrated a record of scientific writing and scholarly productivity; have a record of, or evidence of potential for, obtaining external research funding.

The successful candidate will have the ability to work with faculty, students, and industry partners from a variety of diverse backgrounds and the opportunity to creatively and independently engage in research at the BioBE Center (http://biobe.uoregon.edu/), funded by the Alfred P. Sloan Foundation, federal agencies, and members of industry.

The BioBE Center is training a new generation of innovators to study the built environment microbiome, including the diversity of microorganisms interacting with each other and with the indoor environment. The vision of this national research center is to understand buildings and urban environments as complex systems and to explore how urban, architectural, and building system (passive and active) design work to shape the microbiome, with the ultimate goal of designing healthy and sustainable buildings and cities.

For more information or to apply, see the full job post.

Fungal diversity surveys not using ITS amplicons


I was at a meeting a few weeks ago where the topic of fungal diversity surveys was discussed and many people there commented on how ITS based surveys (one of the main approaches for culture independent studies of fungi) had some limitations. ITS stands for the “internal transcribed sequence” and it is a region in between two rRNA genes that is highly variable and has become used as a means to identify fungi from sequence data. One limitation of ITS is that it is so variable that one cannot align many ITS regions to each other and this in turn means that to identify an organism from its ITS sequence the best way to do it is to have a reference database that has information on which ITS sequences map to which taxa. Unfortunately such reference DBS are not available for many fungi of interest. There are also other limitations of ITS approaches. There are benefits to using ITS too, but that is not the issue here – the issue is that there are some disadvantages and thus people are looking for other approaches.

In my lab we have been looking into fungal diversity studies using approaches other than ITS surveys but were not convinced we had found all the examples out there of such work. So I posted to Twitter and Facebook asking for help and got some useful responses. I have collated them together into a Storify. See below:

If you know of any other approaches, please share.

Wanted – recommended service providers for microbial sequencing (genomes, rRNA, metagenomes, etc)

In January 2014 I wrote this post about “microbiome” sequencing services.

I have gotten many outside requests for the following information — what places (companies, Universities, government agencies, etc) provide contract services for rRNA PCR and sequencing?

Source: Request — Information on Places that do rRNA sequencing as a service — microBEnet: the microbiology of the Built Environment network.

I am writing this new post to solicit information on this topic again.  If you have any recommendations for places that do some type of microbiome or microbe focused sequencing as a service, please post the information here.

I will compile the answers and add them to this post.