Statistical Methods for Next-Gen Sequencing in Disease Association Studies

Project: Research project

Project Details


Statistical Methods for Next-Generation Sequencing in Disease Association Studies Through this project we propose to develop statistical approaches and software for genotype calling and
association testing in next-generation sequence data. The field is driven by molecular advances that allow for
affordable, massively parallel sequencing. The rapid development of statistical methods for next-generation
sequence data in disease studies is necessary to keep pace with the advancing molecular technology. Next-
generation sequencing is based on random, short-read technology;thus the coverage of any nucleotide is
highly variable and subject to error. Distinguishing random error from truly variable sites is required for "SNP-
calling". One step beyond this is identifying the individual's actual genotype at the site. This is a highly
statistical problem and we have yet to see this problem addressed in a statistically rigorous manner. The solution that we propose, and what makes our approach novel, assumes that we have a sample of
individuals, each with next-generation sequence data. We anticipate that sequencing may ultimately replace
GWAS SNP arrays for disease-association studies. While this may be several years away for whole-genome
sequencing, sequencing enough people individually for a small association study is already becoming practical
with target capture arrays. We can leverage the information from a sample of individuals with next-generation
sequence data to more accurately estimate an individual's genotype and the position-specific error rate. Our
approach is to express the genotype probabilities and error rate in a likelihood framework. We can then use
standard statistical theory to help us call genotypes. This approach should perform better than calling
genotypes for a single individual at a time based on an arbitrary filter as is currently done. A distinct advantage of this statistical framework is that the uncertainty in the genotype calls can be
incorporated directly into our disease-association tests (e.g., case-control and rare variant analysis). In this
way we will increase power of our association tests and reduce bias due to error or systematic missingness.
Incorporation of next-generation sequence data into the association tests provides a complete analysis pipeline
from sequence to association.
Effective start/end date9/30/097/31/12


  • National Institutes of Health: $500,000.00
  • National Institutes of Health: $500,000.00


  • Medicine(all)
  • Biochemistry, Genetics and Molecular Biology(all)


Explore the research topics touched on by this project. These labels are generated based on the underlying awards/grants. Together they form a unique fingerprint.