December Mothur Workshop, part 2

Here are my notes from day 2 of the Mothur workshop taught by Pat Schloss (pdschloss at in December 2015. For those who are interested in learning to use Mothur for microbiome studies, Pat will be teaching another one in February.

Mothur is better for bacterial characterization than eukaryotes because the sequences are aligned before OTU clustering. This approach doesn’t work as well for 18S sequence data because the databases are so sparse.

It goes without saying that many pathogens (including anthrax and plague) cannot be identified based on sequencing 16S rRNA alone.

Questions that might be interesting for the community to discuss include: do we believe in OTU assignments and comparisons of different approaches in OTU clustering, such as open reference, closed reference, furthest neighbor and average neighbor.










Day 2:

We talked about:
16S has secondary structure
Need to make sure motifs are lined up across all our sequences
want to preserve the positioning of these motifs across sequences
–> positional homology
     overestimates similarity
     doesn’t care about secondary structure
     All vs All alignment ~N2
Multiple Sequence Alignment
     Pairwise and
     Get all PW alignments
     Merge PW alignments
     Effort ~N4
Global alignment:
Local alignment:
BLAST Basic Local Alignment Search Tool
We need a better approach for 16S data.
Profile alignments
     NAST (greengenes) used by Mothur
     Infernal (RDP)
Require reference alignment
     reflects secondary structure
Search refs for closest match to the unaligned sequence
Global PW alignment of the unaligned to the reference sequence
mer counting (k=7 used in Mothur)
Example of k=5
Compare 7mers with References, which ref shares the most kmers?
It seems kind of magical how well this works, like black magic
BLAST was slow and not as good as using kmer counting searching to find reference with closest match to reference sequence
Then use Needleman-Wunsch for alignment
Effort is proportional to the number of sequences that you have
Alignments in  (fasta)
RDP- variable – don’t try
gg – variable – look drunk
SILVA- manually curated
     50,000 columns wide
     16S 1500 bases
     16S and 18S
     Extract V4 region from the alignment
Larger: 100K
Mothur program for this is align.seqs
vary in length and sequence
can have multiple and different ITS fragments per genome
HUGE intragenomic variation
question of whether there is true homology
Could look at unique sequences
pre.cluster allows you to cluster sequences that are within a base count of each other
16S rRNA also has some intragenomic variation in different taxa
V6 evolves 4 times faster than other regions
99% similar across entire region, 3% cutoff in V6: example from Mitch Sogan study with 464 lots of variation in E. coli
16S rRNA – Why use it?
some variation
good taxonomy
not prone to HGT
Why not use 16S?
well conserved
no phenotype
not functional
multiple copies
intragenomic variation
universal (in plants)
Want sequences to overlap some alignment
* space in alignment
     removes sequences
     removes alignment positions
     vertical = trump
     trump= .  (if this column has a dot in it, it is removed) could also set it to remove columns with a –
. is missing data
– is alignment gap
trump = tidies alignment
vertical = removes columns that are 100% gap
then run unique.seqs again (though more important with 454 data than Illumina data)
Preclustering step to remove extra sequencing error
makes the assumption that we see later that high abundance readings are more reliable
Sort reads in decreasing order of abundance
Look for rarer reads that are within a threshold of the more abundant sequence
Remove rare sequence, add its counts to the abundant sequence
say differences allowed are 2 (diffs = 2)
100 reads
50 50 (within a base)
20 (within a base)
2 (within a base)
2 (within a base)
Remove the reads that are within 1 base of the 100 and add to 100 and the 20, end up with 224 instead of 100.
What do you use for number of diffs?
based on sequence length and threshold
length is 50 nt
2 diff is 4%
if length is 100
2 diff is 2%
They use 1 diff per 100 nt
For MiSeq 250, use diff=2
to stay under a threshold of 3%
0.06% TO 0.02%
not a huge reduction but still 3 fold reduction in error
PCR artifact, incomplete extension followed by heterologous priming–> chimera
annealing of a PCR fragment to another fragment that then gets completed by Taq on another fragment, ends up half from one and half from another
Use a 5 min extension time in PCR, as extension time increases, chimerism decreases (Brian Haus et al at Broad Institute)
increase DNA increased chimeras
increase PCR cycles increase chimeras
increase sheared DNA increases chimeras
      5-25% of reads were chimeras
Looked at HMP data for chimeras
26% found in 2 or more samples
14 chimeras in 20/30 libraries
–>not random
most abundant chimera appeared 30 times across the libraries
What can we do to remove chimeras?
Current tools include:
ChimeraSlayer dev by Brian Hass
uchime dev by Robert Edgar
Perseus dev by Chris Quince
All have different quirks
Some require a database (ChimeraSlayer and uchime)
but the databases are littered with chimeras
best sequences that are chimera free come from cultured bugs
For each sample:
     Sort by abundance
     Trust #1 and #2 most abundant sequences
     Ask is #3 a chimera of 1 and 2, if no then keep 3
     Continue to #4, work through whole list
                      FDR           Sensitivity (can detect chimera)         Specificity (is actually a chimera)
Slayer             2.5                      80                                                       94
Uchime           2.6                      88                                                       94
Perseus          2.4                      87                                                       93
Can run all three in Mothur and compare.
Schloss lab prefers Uchime.
Takes sequences breaks them into left and right and compares them to reference group, if similar to different places on the tree, then chimera
Challenge that we face in detecting chimeras: hard to detect them, factors influencing ability to detect them: length, similarity of parents, location of breakpoint
Seq is in ABC
Chimeric sequence is in B
What to do?
Prefer to just remove B
kmer counting
which ref to base classification on?
     k – nearest neighbor
     k=? k=1? k=10?
Probability of Genus j given our sequence
     G1  G2   G3   G4
     kmers that show up multiple times
     randomly pick some number N of khmers from query
     sample with replacement
     # taxonomy –> confidence
     also a bit of black magic
Require confidence > 80% in their classifications (default in QIIME is 50%)
All these methods can be done in a command called classified.seqs
Rarefaction controls for the same amount of junk per sample
only look at OTUs that show up in a certain number of samples
What about gaps? Need to do something with the gaps.
one approach is to ignore gaps 2/6 = 0.33
makes things with gaps look more similar to each other than they really are
count each gap  4 differences 4/8 = 0.50
one gap = 3/7 = .43
for 16S these are all pretty similar
Protein coding sequences
Correction for multiple substitutions
Use distances to calculate OTUs
set your own threshold
OTUs are not a surrogate for species
applies common threshold across sequences
slow/computationally demanding
How do OTUs relate to taxonomy?
Find many OTUs per genus
     subgenus level lineages
Don’t find multiple genera per OTU
Where did 3% come from?
Sequencing error
intragenomic variation
OTU clustering
de novo clustering
          usearch – fast
          vsearch – not great OTU assignment
furthest – everything in OTU is within threshold
nearest – everything in OTU is within threshold of at least one other
average – everything in OTU is on average less than the threshold apart
                    in same OTU               are similar to each other
True Pos      x                                   x
True Neg      0                                  0
False Pos     x                                  0
False Neg    O                                  x
Furthest neighbor has no false positives and more false negatives
Nearest neighbor has no false negatives and high number of false positives
Average has some false positives and some false negatives
Matthew’s Correlation Coefficient:
a metric that allows us to take these four parameters and weight them somewhat evenly
average neighbor is the best for MCC
average neighbor is also better than the greedy algorithms
QIIME closed and open reference:
numerous problems with reference-based approaches
search algorithm is a problem
for closed, take a sequence and find the closest reference and assign your sequence to that reference
for open, do closed reference and then with the leftover, they do usearch to cluster that to OTUs
both of these methods suck compared to average neighbors
Furthest neighbor is not a good method, get different numbers of OTUs (based on when you cluster relative to rarefaction?) see more in PeerJ paper
Can get variation in OTU assignments when you run it over and over again
But these slightly different results are all pretty good
Average neighbor is slow but better
Done some things to speed things up
don’t find OTUs that are made up of multiple genera, families
Take sequences that we’ve classified and then split by taxon and within each taxon, we then cluster
and then put data back together
     speeds things up
     uses less RAM
     can parallelize clustering steps
vsearch is free and will be brought into the next version of Mothur
for last few years average neighbor has been the default in Mothur
can do phylotyping as well as OTUs
     split sequences into OTUs by classification
     can use OTUs to classify
     then count: for each OTU can count the number of times a sequence is classified as say “T”
     use T > 50%
     then assign that taxon name to the OTU
     majority consensus classification
     alternative is to find representative sequence but this has issues
     output is *.otu.taxonomy
Today the goal is to go from aligned seqs to OTUs
Can take our list file and out group/count file and generate a shared file
the shared file is a table where the rows are samples and the columns are the OTUs and the values are the counts
also the biom format is needed for picrust and other programs
Mothur can also generate biom format
Yesterday we left of with count.seqs

Leave a Reply

Holly Ganz

Holly Ganz is a project scientist at UC Davis working with Jonathan Eisen on the microbiomes of built environments where animals live.