I was first introduced to the swabs to genome workflow project a little over a year ago. I had just started in the Eisen lab and was looking for a project I could work on over the summer and finish up before I graduated in December. David described the extensive work that had gone into the undergraduate reference genome project, explaining that while it had been difficult to figure out how to do various steps, now that he knew how to do it, it was relatively simple and he was hoping to create an explanatory online resource to help other scientists pursuing similar projects. Using what he learned I would go through the entire process as a guinea pig, documenting each step and trouble shooting David’s directions. Because I am technologically cursed, once we addressed all of the problems I ran into in the bioinformatics portion of the workflow, we were confident that a literate monkey would be able to follow the workflow as well (let me know if you have a literate monkey we can test our hypothesis on).
Three months later we had scrapped the guinea pig idea and embraced the Martha Stewart method. Because the early steps were plagued by long waits and lab wide PCR failures (aka the normal scientific process), I used alternative samples and sequence data to complete the bioinformatics steps of the workflow, the same way Martha Stewart cooks a chicken dinner with pre-prepared materials. Throughout the project one of the most frustrating aspects was the constant updating of websites and submission forms. I would finish writing up the protocol for Genbank submission and a week later when someone else was trying to follow it, everything would be different making my protocol useless.
We struggled most with finding an inexpensive (free) open source option for analyzing Sanger data. Despite, or perhaps because Sanger sequencing has been around for almost four decades, it was nearly impossible to find programs that did what we wanted (allowed you to view chromatograms of ABI trace files, and create consensus sequences from forward and reverse reads) that wasn’t written for Windows 95. A cry for help on twitter revealed that the majority of our collaborators used the Geneious free trials, downloading it on a new computer every time they needed to analyze another set of Sanger data. Obviously we couldn’t recommend this as a strategy and a full version of Geneious was out of the budget. Eventually in desperation we asked our bioinformatician (the awesome Guillaume Jospin) to write a new script to create a consensus sequence. He was able to write a script creating a consensus sequence from 2 fastq files and we found an online resource to convert abi files into fastq files. Although we thought the chromatogram was a useful visual tool, it wasn’t essential to the workflow, and we would rather have a script that did most of what we wanted than nothing at all. Shortly after we discovered SeqTrace, a graphical program that did everything we wanted and even allows you to batch edit the sequences!
A final challenge was species identification and tree building. Many people simply BLAST their 16s sequence to determine the species of the unknown bacterium but that simply tells you the closest match, not whether your sequence belongs in that clade. We wanted to do our best to not contribute to the mess that is the current bacterial nomenclature system. We encouraged people using the workflow to create a phylogenetic tree for species identification if the top hits of their BLAST search were not all the same species or didn’t have e-values of 0.0, good query coverage or 99 to 100% identity. We chose not to use the BLAST generated trees for two reasons. Firstly, due to a flawed GUI they are annoying to work with and difficult to interpret. Secondly, they are generated based on a strictly distance based method, whereas the program we chose (FastTree) uses a combination method by generating a distance based tree and then optimizing it via maximum likelihood methods. Although we recommend RDP for generating the alignment, their trees are difficult to manipulate and have a tendency to crash on most computers/operating systems due to an unresolved issue with Flash.
Now over a year later the preprint is up on Peer J (https://peerj.com/preprints/453/) and I am working with 3 new undergraduate guinea pigs to test and troubleshoot the workflow with the eventual goal of developing the workflow into an honors laboratory course.
Madison Dunitz (@MDunitz) is a Junior Specialist in Jonathan Eisen’s lab. A recent graduate of UC Davis, she studies the microbiology of the built environment and blogs about bioinformatics at bioinformatics101.wordpress.com.