December Mothur Workshop, part 2

Here are my notes from day 2 of the Mothur workshop taught by Pat Schloss (pdschloss at gmail.com) in December 2015. For those who are interested in learning to use Mothur for microbiome studies, Pat will be teaching another one in February.

Mothur is better for bacterial characterization than eukaryotes because the sequences are aligned before OTU clustering. This approach doesn’t work as well for 18S sequence data because the databases are so sparse.

It goes without saying that many pathogens (including anthrax and plague) cannot be identified based on sequencing 16S rRNA alone.

Questions that might be interesting for the microbe.net community to discuss include: do we believe in OTU assignments and comparisons of different approaches in OTU clustering, such as open reference, closed reference, furthest neighbor and average neighbor.

Day 2:

We talked about:

alignment

distances

classification

clustering

Alignment

16S has secondary structure

Need to make sure motifs are lined up across all our sequences

want to preserve the positioning of these motifs across sequences

–> positional homology

Pairwise

Needleman-Wunsch

Waterman

BLAST, ESPRIT, USEARCH

PAT—-SCHLOSS

SARAHSCHLOSS

overestimates similarity

doesn’t care about secondary structure

All vs All alignment ~N2

Multiple Sequence Alignment

CLUSTAL and MUSCLE

Pairwise and

Get all PW alignments

Merge PW alignments

PAT-–SCHLOSS

SARAHSCHLOSS

SAM__SCHLOSS

Effort ~N4

Global alignment:

SAM–-SCHLOSS

SARAHSCHLOSS

Local alignment:

SCHLOSS

BLAST Basic Local Alignment Search Tool

We need a better approach for 16S data.

Profile alignments

NAST (greengenes) used by Mothur

SINA (SILVA)

Infernal (RDP)

Require reference alignment

reflects secondary structure

Search refs for closest match to the unaligned sequence

Global PW alignment of the unaligned to the reference sequence

mer counting (k=7 used in Mothur)

Example of k=5

ATGCCTTCCTGA

ATGCC

TGCCT

GCCTT

CCTTC

CTTCC

TTCCT

GCCTT

CCTGA

Compare 7mers with References, which ref shares the most kmers?

It seems kind of magical how well this works, like black magic

BLAST was slow and not as good as using kmer counting searching to find reference with closest match to reference sequence

Then use Needleman-Wunsch for alignment

Effort is proportional to the number of sequences that you have

Alignments in (fasta)

RDP- variable – don’t try

gg – variable – look drunk

SILVA- manually curated

50,000 columns wide

16S 1500 bases

16S and 18S

Extract V4 region from the alignment

SEED: 14K

Larger: 100K

Mothur program for this is align.seqs

ITS

vary in length and sequence

can have multiple and different ITS fragments per genome

HUGE intragenomic variation

question of whether there is true homology

Could look at unique sequences

pre.cluster allows you to cluster sequences that are within a base count of each other

16S rRNA also has some intragenomic variation in different taxa

V6 evolves 4 times faster than other regions

99% similar across entire region, 3% cutoff in V6: example from Mitch Sogan study with 464 lots of variation in E. coli

16S rRNA – Why use it?

universal

well-conserved

some variation

well-studied

sequenceable

good taxonomy

not prone to HGT

Why not use 16S?

well conserved

no phenotype

not functional

multiple copies

intragenomic variation

universal (in plants)

Want sequences to overlap some alignment

* space in alignment

screen.seqs

removes sequences

start

end

filter.seqs

removes alignment positions

vertical = trump

trump= . (if this column has a dot in it, it is removed) could also set it to remove columns with a –

. is missing data

– is alignment gap

trump = tidies alignment

vertical = removes columns that are 100% gap

then run unique.seqs again (though more important with 454 data than Illumina data)

pre.cluster

Preclustering step to remove extra sequencing error

makes the assumption that we see later that high abundance readings are more reliable

Sort reads in decreasing order of abundance

Look for rarer reads that are within a threshold of the more abundant sequence

Remove rare sequence, add its counts to the abundant sequence

say differences allowed are 2 (diffs = 2)

100 reads

50 50 (within a base)

20 (within a base)

2 (within a base)

Remove the reads that are within 1 base of the 100 and add to 100 and the 20, end up with 224 instead of 100.

What do you use for number of diffs?

based on sequence length and threshold

length is 50 nt

2 diff is 4%

if length is 100

2 diff is 2%

They use 1 diff per 100 nt

For MiSeq 250, use diff=2

to stay under a threshold of 3%

0.06% TO 0.02%

not a huge reduction but still 3 fold reduction in error

Chimeras

PCR artifact, incomplete extension followed by heterologous priming–> chimera

annealing of a PCR fragment to another fragment that then gets completed by Taq on another fragment, ends up half from one and half from another

Use a 5 min extension time in PCR, as extension time increases, chimerism decreases (Brian Haus et al at Broad Institute)

increase DNA increased chimeras

increase PCR cycles increase chimeras

increase sheared DNA increases chimeras

5-25% of reads were chimeras

Looked at HMP data for chimeras

26% found in 2 or more samples

14 chimeras in 20/30 libraries

–>not random

most abundant chimera appeared 30 times across the libraries

What can we do to remove chimeras?

Current tools include:

ChimeraSlayer dev by Brian Hass

uchime dev by Robert Edgar

Perseus dev by Chris Quince

All have different quirks

Some require a database (ChimeraSlayer and uchime)

but the databases are littered with chimeras

best sequences that are chimera free come from cultured bugs

database-independent

For each sample:

Sort by abundance

Trust #1 and #2 most abundant sequences

Ask is #3 a chimera of 1 and 2, if no then keep 3

Continue to #4, work through whole list

FDR Sensitivity (can detect chimera) Specificity (is actually a chimera)

Slayer 2.5 80 94

Uchime 2.6 88 94

Perseus 2.4 87 93

Can run all three in Mothur and compare.

Schloss lab prefers Uchime.

Takes sequences breaks them into left and right and compares them to reference group, if similar to different places on the tree, then chimera

Challenge that we face in detecting chimeras: hard to detect them, factors influencing ability to detect them: length, similarity of parents, location of breakpoint

Seq is in ABC

Chimeric sequence is in B

What to do?

Prefer to just remove B

Classification

BLAST

kmer counting

which ref to base classification on?

k – nearest neighbor

k=? k=1? k=10?

Bayesian

Probability of Genus j given our sequence

G1 G2 G3 G4

kmers that show up multiple times

Bootstrapping

randomly pick some number N of khmers from query

sample with replacement

# taxonomy –> confidence

also a bit of black magic

Require confidence > 80% in their classifications (default in QIIME is 50%)

All these methods can be done in a command called classified.seqs

Contaminants

chloroplasts

mitochondria

Eukarya/Archaea

Rarefaction controls for the same amount of junk per sample

only look at OTUs that show up in a certain number of samples

Gaps

What about gaps? Need to do something with the gaps.

ATGCCATG

AACC–TG

one approach is to ignore gaps 2/6 = 0.33

makes things with gaps look more similar to each other than they really are

count each gap 4 differences 4/8 = 0.50

one gap = 3/7 = .43

for 16S these are all pretty similar

Protein coding sequences

Correction for multiple substitutions

jukes-cantor

Use distances to calculate OTUs

Clustering

set your own threshold

OTUs are not a surrogate for species

reference-free

applies common threshold across sequences

slow/computationally demanding

How do OTUs relate to taxonomy?

Find many OTUs per genus

subgenus level lineages

Don’t find multiple genera per OTU

Where did 3% come from?

Stackebrandt and Goebl 1997 IJSEM http://ijs.microbiologyresearch.org/content/journal/ijsem/10.1099/00207713-47-2-479

Sequencing error

intragenomic variation

OTU clustering

de novo clustering

–hierarchical

–greedy

usearch – fast

vsearch – not great OTU assignment

swarm

See Westcott and Schloss 2015 PeerJ

classify:

“neighbors”

furthest – everything in OTU is within threshold

nearest – everything in OTU is within threshold of at least one other

average – everything in OTU is on average less than the threshold apart

in same OTU are similar to each other

True Pos x x

True Neg 0 0

False Pos x 0

False Neg O x

Furthest neighbor has no false positives and more false negatives

Nearest neighbor has no false negatives and high number of false positives

Average has some false positives and some false negatives

Matthew’s Correlation Coefficient:

a metric that allows us to take these four parameters and weight them somewhat evenly

average neighbor is the best for MCC

average neighbor is also better than the greedy algorithms

QIIME closed and open reference:

numerous problems with reference-based approaches

search algorithm is a problem

for closed, take a sequence and find the closest reference and assign your sequence to that reference

for open, do closed reference and then with the leftover, they do usearch to cluster that to OTUs

both of these methods suck compared to average neighbors

Furthest neighbor is not a good method, get different numbers of OTUs (based on when you cluster relative to rarefaction?) see more in PeerJ paper

Can get variation in OTU assignments when you run it over and over again

But these slightly different results are all pretty good

Average neighbor is slow but better

Done some things to speed things up

AEM paper and also in PeerJ 2015 Wescott and Schloss

don’t find OTUs that are made up of multiple genera, families

classify.seqs

Take sequences that we’ve classified and then split by taxon and within each taxon, we then cluster

and then put data back together

speeds things up

uses less RAM

can parallelize clustering steps

vsearch is free and will be brought into the next version of Mothur

for last few years average neighbor has been the default in Mothur

can do phylotyping as well as OTUs

Phylotyping

split sequences into OTUs by classification

can use OTUs to classify

classify.otu

then count: for each OTU can count the number of times a sequence is classified as say “T”

use T > 50%

then assign that taxon name to the OTU

majority consensus classification

alternative is to find representative sequence but this has issues

output is *.otu.taxonomy

Today the goal is to go from aligned seqs to OTUs

Can take our list file and out group/count file and generate a shared file

the shared file is a table where the rows are samples and the columns are the OTUs and the values are the counts

also the biom format is needed for picrust and other programs

Mothur can also generate biom format

—

Yesterday we left of with count.seqs

December Mothur Workshop, part 2

Like this:

Related

Leave a Reply Cancel reply