SeqEM: An adaptive genotype-calling approach for next-generation sequencing studies

Eden R Martin, D. D. Kinnamon, Mike Schmidt, E. H. Powell, Stephan L Zuchner, R. W. Morris

Research output: Contribution to journalArticle

50 Citations (Scopus)

Abstract

Motivation: Next-generation sequencing presents several statistical challenges, with one of the most fundamental being determining an individual's genotype from multiple aligned short read sequences at a position. Some simple approaches for genotype calling apply fixed filters, such as calling a heterozygote if more than a specified percentage of the reads have variant nucleotide calls. Other genotype-calling methods, such as MAQ and SOAPsnp, are implementations of Bayes classifiers in that they classify genotypes using posterior genotype probabilities.Results: Here, we propose a novel genotype-calling algorithm that, in contrast to the other methods, estimates parameters underlying the posterior probabilities in an adaptive way rather than arbitrarily specifying them a priori. The algorithm, which we call SeqEM, applies the well-known Expectation-Maximization algorithm to an appropriate likelihood for a sample of unrelated individuals with next-generation sequence data, leveraging information from the sample to estimate genotype probabilities and the nucleotide-read error rate. We demonstrate using analytic calculations and simulations that SeqEM results in genotype-call error rates as small as or smaller than filtering approaches and MAQ. We also apply SeqEM to exome sequence data in eight related individuals and compare the results to genotypes from an Illumina SNP array, showing that SeqEM behaves well in real data that deviates from idealized assumptions.Conclusion: SeqEM offers an improved, robust and flexible genotype-calling approach that can be widely applied in the next-generation sequencing studies.

Original languageEnglish
Article numberbtq526
Pages (from-to)2803-2810
Number of pages8
JournalBioinformatics
Volume26
Issue number22
DOIs
StatePublished - Nov 1 2010

Fingerprint

Genotype
Sequencing
Nucleotides
Classifiers
Error Rate
Genotyping Techniques
Exome
Bayes Classifier
Heterozygote
Posterior Probability
Expectation-maximization Algorithm
Single Nucleotide Polymorphism
Estimate
Percentage
Likelihood
Filtering
Classify
Filter

ASJC Scopus subject areas

  • Biochemistry
  • Molecular Biology
  • Computational Theory and Mathematics
  • Computer Science Applications
  • Computational Mathematics
  • Statistics and Probability
  • Medicine(all)

Cite this

SeqEM : An adaptive genotype-calling approach for next-generation sequencing studies. / Martin, Eden R; Kinnamon, D. D.; Schmidt, Mike; Powell, E. H.; Zuchner, Stephan L; Morris, R. W.

In: Bioinformatics, Vol. 26, No. 22, btq526, 01.11.2010, p. 2803-2810.

Research output: Contribution to journalArticle

@article{f4db4a9a52dc478ab6127da7c71cf40c,
title = "SeqEM: An adaptive genotype-calling approach for next-generation sequencing studies",
abstract = "Motivation: Next-generation sequencing presents several statistical challenges, with one of the most fundamental being determining an individual's genotype from multiple aligned short read sequences at a position. Some simple approaches for genotype calling apply fixed filters, such as calling a heterozygote if more than a specified percentage of the reads have variant nucleotide calls. Other genotype-calling methods, such as MAQ and SOAPsnp, are implementations of Bayes classifiers in that they classify genotypes using posterior genotype probabilities.Results: Here, we propose a novel genotype-calling algorithm that, in contrast to the other methods, estimates parameters underlying the posterior probabilities in an adaptive way rather than arbitrarily specifying them a priori. The algorithm, which we call SeqEM, applies the well-known Expectation-Maximization algorithm to an appropriate likelihood for a sample of unrelated individuals with next-generation sequence data, leveraging information from the sample to estimate genotype probabilities and the nucleotide-read error rate. We demonstrate using analytic calculations and simulations that SeqEM results in genotype-call error rates as small as or smaller than filtering approaches and MAQ. We also apply SeqEM to exome sequence data in eight related individuals and compare the results to genotypes from an Illumina SNP array, showing that SeqEM behaves well in real data that deviates from idealized assumptions.Conclusion: SeqEM offers an improved, robust and flexible genotype-calling approach that can be widely applied in the next-generation sequencing studies.",
author = "Martin, {Eden R} and Kinnamon, {D. D.} and Mike Schmidt and Powell, {E. H.} and Zuchner, {Stephan L} and Morris, {R. W.}",
year = "2010",
month = "11",
day = "1",
doi = "10.1093/bioinformatics/btq526",
language = "English",
volume = "26",
pages = "2803--2810",
journal = "Bioinformatics (Oxford, England)",
issn = "1367-4803",
publisher = "Oxford University Press",
number = "22",

}

TY - JOUR

T1 - SeqEM

T2 - An adaptive genotype-calling approach for next-generation sequencing studies

AU - Martin, Eden R

AU - Kinnamon, D. D.

AU - Schmidt, Mike

AU - Powell, E. H.

AU - Zuchner, Stephan L

AU - Morris, R. W.

PY - 2010/11/1

Y1 - 2010/11/1

N2 - Motivation: Next-generation sequencing presents several statistical challenges, with one of the most fundamental being determining an individual's genotype from multiple aligned short read sequences at a position. Some simple approaches for genotype calling apply fixed filters, such as calling a heterozygote if more than a specified percentage of the reads have variant nucleotide calls. Other genotype-calling methods, such as MAQ and SOAPsnp, are implementations of Bayes classifiers in that they classify genotypes using posterior genotype probabilities.Results: Here, we propose a novel genotype-calling algorithm that, in contrast to the other methods, estimates parameters underlying the posterior probabilities in an adaptive way rather than arbitrarily specifying them a priori. The algorithm, which we call SeqEM, applies the well-known Expectation-Maximization algorithm to an appropriate likelihood for a sample of unrelated individuals with next-generation sequence data, leveraging information from the sample to estimate genotype probabilities and the nucleotide-read error rate. We demonstrate using analytic calculations and simulations that SeqEM results in genotype-call error rates as small as or smaller than filtering approaches and MAQ. We also apply SeqEM to exome sequence data in eight related individuals and compare the results to genotypes from an Illumina SNP array, showing that SeqEM behaves well in real data that deviates from idealized assumptions.Conclusion: SeqEM offers an improved, robust and flexible genotype-calling approach that can be widely applied in the next-generation sequencing studies.

AB - Motivation: Next-generation sequencing presents several statistical challenges, with one of the most fundamental being determining an individual's genotype from multiple aligned short read sequences at a position. Some simple approaches for genotype calling apply fixed filters, such as calling a heterozygote if more than a specified percentage of the reads have variant nucleotide calls. Other genotype-calling methods, such as MAQ and SOAPsnp, are implementations of Bayes classifiers in that they classify genotypes using posterior genotype probabilities.Results: Here, we propose a novel genotype-calling algorithm that, in contrast to the other methods, estimates parameters underlying the posterior probabilities in an adaptive way rather than arbitrarily specifying them a priori. The algorithm, which we call SeqEM, applies the well-known Expectation-Maximization algorithm to an appropriate likelihood for a sample of unrelated individuals with next-generation sequence data, leveraging information from the sample to estimate genotype probabilities and the nucleotide-read error rate. We demonstrate using analytic calculations and simulations that SeqEM results in genotype-call error rates as small as or smaller than filtering approaches and MAQ. We also apply SeqEM to exome sequence data in eight related individuals and compare the results to genotypes from an Illumina SNP array, showing that SeqEM behaves well in real data that deviates from idealized assumptions.Conclusion: SeqEM offers an improved, robust and flexible genotype-calling approach that can be widely applied in the next-generation sequencing studies.

UR - http://www.scopus.com/inward/record.url?scp=78149246036&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=78149246036&partnerID=8YFLogxK

U2 - 10.1093/bioinformatics/btq526

DO - 10.1093/bioinformatics/btq526

M3 - Article

C2 - 20861027

AN - SCOPUS:78149246036

VL - 26

SP - 2803

EP - 2810

JO - Bioinformatics (Oxford, England)

JF - Bioinformatics (Oxford, England)

SN - 1367-4803

IS - 22

M1 - btq526

ER -