The Genomic Standards Consortium recently held its 13th Workshop from March 5-7 in Shenzhen, China, where advances in genomics research were discussed in close proximity to the massive sequencing power of BGI. As a workshop participant, I found the meeting presentations to be both informative and exciting; the data spoke for itself, and many projects outlined significant biological findings. Here I provide an overview of the workshop proceedings, as interpreted from the audience.
The meeting’s theme–“From Genomes to Interactions to Communities to Models”–encapsulated the evolving landscape of high-throughput sequencing projects, highlighting a fundamental shift from baseline data generation towards integrated biological analyses and predictive modeling approaches. The scope of the meeting was comprehensive–session topics ranged from behemoth model organisms (rice genomics) to poorly characterized viral taxa, and the agenda also included a session focused on the Microbiology of the Built Environment. A detailed program is archived on the GSC website, with SciVee video uploads for many of the speaker presentations.
Under the banner of the GSC, common links were highlighted between seemingly discrete projects, in the form of metadata and reporting standards driven by GSC efforts. Speakers noted that an emphasis on rich accompanying metadata will fundamentally expand the reach and utility of ongoing megasequencing projects, promoting community interactions, comparative studies and the dissemination of research products across the wider scientific community.
In addition to data standards, some common topics were echoed across GSC presentations: key insights equally applicable to (and often plaguing) diverse research areas.
Critical Knowledge Gaps Remain
Rita Colwell provided a succinct overview of the most pressing “Elephants in the Room”: research areas which will require focused community efforts, now that the initial frenzy of high-throughput sequencing approaches have grown and transitioned into a maturing field. When 454 and Illumina platforms were brand new to the market, exploratory sequencing approaches could easily suffice; any data, regardless of sampling or sequencing methodology, provided stunning and novel insights. As the scientific community becomes somewhat desensitized to the data deluge, we must begin to think carefully about the resulting impacts of study design and execution. At GSC13 Colwell urged the genomics community to focus on four key areas:
Sampling procedures. Colwell noted that we are currently looking at a “Teaspoon in the Ocean” in terms of data analyzed versus the scale of global biodiversity. For genomic studies, sample design must come to match the thorough methodology followed by classical ecology studies. Researchers must think about sampling across space and time, carefully planning how much material to collect, what sample volume is required to address questions at hand, and determining the appropriate number of technical and biological replicates.
DNA extractions. The variation introduced from different DNA extraction methods has not been fully investigated. Colwell noted that the composition of environmental DNA preps (even if you take a rigorous approach) might not give you a complete, or even accurate picture of community assemblages. In the end, there may be no “perfect” way to handle samples; some bias is inherent to any approach. However, understanding the nature of this introduced bias will at least enable researchers to adjust analyses for accurate biological interpretations. Extrapolation is better than no information at all.
Sequencing. Colwell also stressed that variability across sequencing runs has not been adequately addressed; there can be machine, operator, and inter-lab error. Metrics such as standard deviations must be pursued for quantifying variance across sequencing runs.
Data Banking. In genomic approaches, the current focus continues to fall on informatics. Computational biology is the most exciting approach to data and represents the cutting edge of research. However, Colwell noted that while we are fixated by informatics we are failing to devote equal energy to data storage. Data banking is a pressing concern, but who will pay for it? Colwell suggested that perhaps we should consider banking source material instead of sequence data, as this might represent a less expensive approach in the long run.
Progress and Future Directions
Aside from community challenges, GSC13 speakers relayed the status of ongoing projects and defined future research priorities. Three obvious themes emerged from the context of existing initiatives:
Project scopes shift towards ‘megagenomics’. The increasingly grand scale of new sequencing projects was immediately apparent at GSC13. Research efforts are now focusing on “megasequencing projects”: comparative (meta)genomics incorporating thousands or tens of thousands of samples, harnessing deep sequencing efforts to generate trillions of base pairs using increasingly higher-throughput sequencing platforms. Metadata, adherence to reporting standards and data curation represent important components to these projects; close coordination with GSC efforts look set to establish precedents for such large-scale initiatives, providing robust models and design templates for future sequencing projects.
Lack of Reference Genomes Hinders Biological Analysis. Speaker after speaker aired a common woe: regardless of ecosystem, analyses of environmental sequences are consistently and significantly impacted by the lack of available reference data. Reference genomes are critical for training bioinformatic algorithms and populating comparative database resources for assigning taxonomy to unknown sequences. Patchy, sparsely populated reference databases can significantly reduce the accuracy of taxonomic or functional assignments (and even commonly preclude any assignments whatsoever). Given this ongoing challenge, GSC13 highlighted some community efforts to increase the availability of reference genome data. Jun Wang explained BGI’s commitment to produce a digital library of reference genomes, with initial plans to sequence 1000 plant and animal species. The TARA oceans project is making similar inroads, aiming to build up reference databases for uncultured marine eukaryotes through single-cell genomics approaches.
The Evolution of Database Resources. A significant number of talks at GSC13 focused on databases resources. A key step towards conquering the ongoing data deluge is designing effective, community-driven database resources that provide easy access to sequence data and enable large-scale comparative analyses. Speakers noted that database resources are not perfect, but that administrators are well aware of the challenges and priorities for designing effective portals. The need for intuitive, efficient database tools has become increasingly imperative as more projects are being undertaken at a grander and grander scale. GSC13 highlighted progress across diverse genomic database tools, from metagenome annotation in MG-RAST to comparative fungal genomics. Jason Stajich highlighted innovative features being incorporated into FungiDB, such as word clouds to find genes of interest, and promoting user interactions through community comment features.
What does this mean for microBEnet?
Given the challenges and future directions outlined at GSC13, the burgeoning research network focused on Microbiology of the Built Environment appears well poised to progressively tackle the overarching themes outlined by the wider scientific community. Paula Olsiewski introduced the Sloan foundation’s perspective with the succinct phrase: “Do something early, that is catalytic.”, noting that the first step in Sloan’s approach is to identify important problems. The ongoing bottlenecks and steep hurdles that exist in high-throughput sequencing fields are arguably some of the most critical issues preventing forward progress in genomics. Yet, such widespread issues can be efficiently tacked by a coordinated community effort guided by central themes and core questions, such as those which underpin microBEnet. In understudied environments such as the Built Environment, the lack of reference genomes can severely hinder our capacity to understand microbial community structure and function. A dedicated push by microBEnet could aim towards establishing an effective long-term genomics resource for the Built Environment, akin to efforts to catalog typical microbiota in the Human Microbiome Project. The interdisciplinary nature of microBEnet can also accelerate the pace of progress through unique collaborations. Olsiewski urged the community to forge new, cross-discipline collaborations, hinting at transformative effects that may emerge from partnerships with facilities such as the National Institute of Standards and Technology (NIST). The GSC community has witnessed significant progress since its inception. However, each meeting provides a stark (yet optimistic) reminder that we are still working diligently to overcome many formidable challenges.