(Note: This post was updated on 3/26/18 after re-running some of our analysis based on suggestions in the comments and on Twitter).
We were very excited to get our MinION sequencer from Nanopore 1.5 years ago… so many potential applications; sequencing in real time, use in the classroom, use in the field, not having to wait to get bacterial genomes, etc. But one thing lead to another and it languished in a drawer. Recently we got it back out and tried to sequence a bacterial genome with it. This project is being led by two undergraduates in the lab, Marcus Cohen and Dennett Rodriguez, to whom I basically gave the MinION with the instructions of “figure out how this thing works”.
First observation: This was way harder than we thought. There are many different protocols/suggestions/approaches/ideas/optional steps. After a month, we managed to generate and then basecall some sequence, then assemble it with Canu into a single contig. Cool! Then it gets complicated.
So I thought it was a simple question that I threw out to Twitter, “So we just assembled our first ever
@nanopore bacterial genome. Not understanding how we can get only 65% completeness (CheckM) with 100X coverage. We used an entire flowcell.”
This tweet apparently touched a nerve, starting a wide-ranging discussion about the merits of Nanopore versus Illumina versus PacBio and the utility (or not) of finished (or even decent quality) genomes. There were a few hundred tweets generated, by many of the experts in the field in additions to employees of at least one of the companies. The discussion was surprisingly combative, even degenerating into name calling at one point. However, it was extraordinarily informative to a place like our lab where we do some, but not a ton, of this kind of work.
I couldn’t figure out a good way to collect or archive the tweets which is a shame. But here’s my take home messages from the discussion.
-Nanopore data alone is insufficient to get a decent bacterial genome, due to a high rate of error (homopolymer indels).
-These errors cause frameshifts which lead to genes looking like pseudogenes and renders programs like CheckM (which looks for proteins) basically useless
-Pacbio data alone is better than Nanopore… some people think it sufficient for a finished genome and others disagree and think we always need Illumina plus long reads
-Obviously Illumina data alone is insufficient for closing genomes, but for many applications is quite sufficient
-Hybrid Illumina/PacBio or Illumina/Nanopore data is clearly the best approach for getting good genomes, combining accuracy with long reads. The relative cost of these various approaches is highly debatable
-Both PacBio and Nanopore data require extensive polishing and correction to be usable alone… it seems like these workflows are much more established for PacBio to date
-If I had one bacteria that I wanted a really good genome for, I would probably do PacBio (or PacBio/Illumina if I really really cared), if I had a dozen I would probably do Nanopore/Illumina (because of cost), if I had 100 I would just do Illumina and call it a day.
-Other good uses for Nanopore/PacBio data alone would be anything with repeats, plasmids, CRISPER arrays
So back to our bacteria. Following a blog post from Mick Watson we went ahead and performed his test “First predict proteins using a gene finder. Then map those proteins to UniProt (using blastp or diamond). The ratio of the query length to the length of the top hit should be a tight and normal distribution around 1.” And… ours looks terrible.
So we’re a bit stuck for the moment. Our next steps are to try some more polishing. One of the hardest points is understanding whether our run was just poor or if we’re seeing what’s normal for the technology. We’ve also sent off more DNA for Illumina sequencing so we’ll be trying hybrid approaches next.
After the comments below and some discussion on Twitter, Guillaume re-ran Nanopolish with different parameters and we got much better results… bringing ourselves to 93% completeness instead of 65%. Still not great for many purposes but a huge improvement!
Here’s a Wakelet of the Twitter discussion…