Quality control and integration of genotypes from two calling pipelines for whole genome sequence data in the Alzheimer's disease sequencing project

Alzheimer's Disease Sequencing Project (ADSP)

Research output: Contribution to journalArticle

1 Citation (Scopus)

Abstract

The Alzheimer's Disease Sequencing Project (ADSP) performed whole genome sequencing (WGS) of 584 subjects from 111 multiplex families at three sequencing centers. Genotype calling of single nucleotide variants (SNVs) and insertion-deletion variants (indels) was performed centrally using GATK-HaplotypeCaller and Atlas V2. The ADSP Quality Control (QC) Working Group applied QC protocols to project-level variant call format files (VCFs) from each pipeline, and developed and implemented a novel protocol, termed “consensus calling,” to combine genotype calls from both pipelines into a single high-quality set. QC was applied to autosomal bi-allelic SNVs and indels, and included pipeline-recommended QC filters, variant-level QC, and sample-level QC. Low-quality variants or genotypes were excluded, and sample outliers were noted. Quality was assessed by examining Mendelian inconsistencies (MIs) among 67 parent-offspring pairs, and MIs were used to establish additional genotype-specific filters for GATK calls. After QC, 578 subjects remained. Pipeline-specific QC excluded ~12.0% of GATK and 14.5% of Atlas SNVs. Between pipelines, ~91% of SNV genotypes across all QCed variants were concordant; 4.23% and 4.56% of genotypes were exclusive to Atlas or GATK, respectively; the remaining ~0.01% of discordant genotypes were excluded. For indels, variant-level QC excluded ~36.8% of GATK and 35.3% of Atlas indels. Between pipelines, ~55.6% of indel genotypes were concordant; while 10.3% and 28.3% were exclusive to Atlas or GATK, respectively; and ~0.29% of discordant genotypes were. The final WGS consensus dataset contains 27,896,774 SNVs and 3,133,926 indels and is publicly available.

Original languageEnglish (US)
JournalGenomics
DOIs
StateAccepted/In press - Jan 1 2018

Fingerprint

Quality Control
Alzheimer Disease
Genotype
Genome
Atlases
Nucleotides
Consensus
Control Groups

Keywords

  • Atlas
  • Consensus calling
  • GATK
  • Mendelian inconsistencies
  • Quality control
  • Whole genome sequencing

ASJC Scopus subject areas

  • Genetics

Cite this

Quality control and integration of genotypes from two calling pipelines for whole genome sequence data in the Alzheimer's disease sequencing project. / Alzheimer's Disease Sequencing Project (ADSP).

In: Genomics, 01.01.2018.

Research output: Contribution to journalArticle

@article{5db64445a27f42b38dc3a639427cfd0d,
title = "Quality control and integration of genotypes from two calling pipelines for whole genome sequence data in the Alzheimer's disease sequencing project",
abstract = "The Alzheimer's Disease Sequencing Project (ADSP) performed whole genome sequencing (WGS) of 584 subjects from 111 multiplex families at three sequencing centers. Genotype calling of single nucleotide variants (SNVs) and insertion-deletion variants (indels) was performed centrally using GATK-HaplotypeCaller and Atlas V2. The ADSP Quality Control (QC) Working Group applied QC protocols to project-level variant call format files (VCFs) from each pipeline, and developed and implemented a novel protocol, termed “consensus calling,” to combine genotype calls from both pipelines into a single high-quality set. QC was applied to autosomal bi-allelic SNVs and indels, and included pipeline-recommended QC filters, variant-level QC, and sample-level QC. Low-quality variants or genotypes were excluded, and sample outliers were noted. Quality was assessed by examining Mendelian inconsistencies (MIs) among 67 parent-offspring pairs, and MIs were used to establish additional genotype-specific filters for GATK calls. After QC, 578 subjects remained. Pipeline-specific QC excluded ~12.0{\%} of GATK and 14.5{\%} of Atlas SNVs. Between pipelines, ~91{\%} of SNV genotypes across all QCed variants were concordant; 4.23{\%} and 4.56{\%} of genotypes were exclusive to Atlas or GATK, respectively; the remaining ~0.01{\%} of discordant genotypes were excluded. For indels, variant-level QC excluded ~36.8{\%} of GATK and 35.3{\%} of Atlas indels. Between pipelines, ~55.6{\%} of indel genotypes were concordant; while 10.3{\%} and 28.3{\%} were exclusive to Atlas or GATK, respectively; and ~0.29{\%} of discordant genotypes were. The final WGS consensus dataset contains 27,896,774 SNVs and 3,133,926 indels and is publicly available.",
keywords = "Atlas, Consensus calling, GATK, Mendelian inconsistencies, Quality control, Whole genome sequencing",
author = "{Alzheimer's Disease Sequencing Project (ADSP)} and Naj, {Adam C.} and Honghuang Lin and Vardarajan, {Badri N.} and Simon White and Daniel Lancour and Yiyi Ma and Mike Schmidt and Fangui Sun and Mariusz Butkiewicz and Bush, {William S.} and Kunkle, {Brian W.} and John Malamon and Najaf Amin and Choi, {Seung Hoan} and Hamilton-Nelson, {Kara L.} and {van der Lee}, {Sven J.} and Namrata Gupta and Koboldt, {Daniel C.} and Mohamad Saad and Bowen Wang and Nato, {Alejandro Q.} and Sohi, {Harkirat K.} and Amanda Kuzma and Wang, {Li San} and Cupples, {L. Adrienne} and {van Duijn}, Cornelia and Sudha Seshadri and Schellenberg, {Gerard D.} and Eric Boerwinkle and Bis, {Joshua C.} and Jos{\'e}e Dupuis and Salerno, {William J.} and Wijsman, {Ellen M.} and Martin, {Eden R} and DeStefano, {Anita L.}",
year = "2018",
month = "1",
day = "1",
doi = "10.1016/j.ygeno.2018.05.004",
language = "English (US)",
journal = "Genomics",
issn = "0888-7543",
publisher = "Academic Press Inc.",

}

TY - JOUR

T1 - Quality control and integration of genotypes from two calling pipelines for whole genome sequence data in the Alzheimer's disease sequencing project

AU - Alzheimer's Disease Sequencing Project (ADSP)

AU - Naj, Adam C.

AU - Lin, Honghuang

AU - Vardarajan, Badri N.

AU - White, Simon

AU - Lancour, Daniel

AU - Ma, Yiyi

AU - Schmidt, Mike

AU - Sun, Fangui

AU - Butkiewicz, Mariusz

AU - Bush, William S.

AU - Kunkle, Brian W.

AU - Malamon, John

AU - Amin, Najaf

AU - Choi, Seung Hoan

AU - Hamilton-Nelson, Kara L.

AU - van der Lee, Sven J.

AU - Gupta, Namrata

AU - Koboldt, Daniel C.

AU - Saad, Mohamad

AU - Wang, Bowen

AU - Nato, Alejandro Q.

AU - Sohi, Harkirat K.

AU - Kuzma, Amanda

AU - Wang, Li San

AU - Cupples, L. Adrienne

AU - van Duijn, Cornelia

AU - Seshadri, Sudha

AU - Schellenberg, Gerard D.

AU - Boerwinkle, Eric

AU - Bis, Joshua C.

AU - Dupuis, Josée

AU - Salerno, William J.

AU - Wijsman, Ellen M.

AU - Martin, Eden R

AU - DeStefano, Anita L.

PY - 2018/1/1

Y1 - 2018/1/1

N2 - The Alzheimer's Disease Sequencing Project (ADSP) performed whole genome sequencing (WGS) of 584 subjects from 111 multiplex families at three sequencing centers. Genotype calling of single nucleotide variants (SNVs) and insertion-deletion variants (indels) was performed centrally using GATK-HaplotypeCaller and Atlas V2. The ADSP Quality Control (QC) Working Group applied QC protocols to project-level variant call format files (VCFs) from each pipeline, and developed and implemented a novel protocol, termed “consensus calling,” to combine genotype calls from both pipelines into a single high-quality set. QC was applied to autosomal bi-allelic SNVs and indels, and included pipeline-recommended QC filters, variant-level QC, and sample-level QC. Low-quality variants or genotypes were excluded, and sample outliers were noted. Quality was assessed by examining Mendelian inconsistencies (MIs) among 67 parent-offspring pairs, and MIs were used to establish additional genotype-specific filters for GATK calls. After QC, 578 subjects remained. Pipeline-specific QC excluded ~12.0% of GATK and 14.5% of Atlas SNVs. Between pipelines, ~91% of SNV genotypes across all QCed variants were concordant; 4.23% and 4.56% of genotypes were exclusive to Atlas or GATK, respectively; the remaining ~0.01% of discordant genotypes were excluded. For indels, variant-level QC excluded ~36.8% of GATK and 35.3% of Atlas indels. Between pipelines, ~55.6% of indel genotypes were concordant; while 10.3% and 28.3% were exclusive to Atlas or GATK, respectively; and ~0.29% of discordant genotypes were. The final WGS consensus dataset contains 27,896,774 SNVs and 3,133,926 indels and is publicly available.

AB - The Alzheimer's Disease Sequencing Project (ADSP) performed whole genome sequencing (WGS) of 584 subjects from 111 multiplex families at three sequencing centers. Genotype calling of single nucleotide variants (SNVs) and insertion-deletion variants (indels) was performed centrally using GATK-HaplotypeCaller and Atlas V2. The ADSP Quality Control (QC) Working Group applied QC protocols to project-level variant call format files (VCFs) from each pipeline, and developed and implemented a novel protocol, termed “consensus calling,” to combine genotype calls from both pipelines into a single high-quality set. QC was applied to autosomal bi-allelic SNVs and indels, and included pipeline-recommended QC filters, variant-level QC, and sample-level QC. Low-quality variants or genotypes were excluded, and sample outliers were noted. Quality was assessed by examining Mendelian inconsistencies (MIs) among 67 parent-offspring pairs, and MIs were used to establish additional genotype-specific filters for GATK calls. After QC, 578 subjects remained. Pipeline-specific QC excluded ~12.0% of GATK and 14.5% of Atlas SNVs. Between pipelines, ~91% of SNV genotypes across all QCed variants were concordant; 4.23% and 4.56% of genotypes were exclusive to Atlas or GATK, respectively; the remaining ~0.01% of discordant genotypes were excluded. For indels, variant-level QC excluded ~36.8% of GATK and 35.3% of Atlas indels. Between pipelines, ~55.6% of indel genotypes were concordant; while 10.3% and 28.3% were exclusive to Atlas or GATK, respectively; and ~0.29% of discordant genotypes were. The final WGS consensus dataset contains 27,896,774 SNVs and 3,133,926 indels and is publicly available.

KW - Atlas

KW - Consensus calling

KW - GATK

KW - Mendelian inconsistencies

KW - Quality control

KW - Whole genome sequencing

UR - http://www.scopus.com/inward/record.url?scp=85049300452&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85049300452&partnerID=8YFLogxK

U2 - 10.1016/j.ygeno.2018.05.004

DO - 10.1016/j.ygeno.2018.05.004

M3 - Article

JO - Genomics

JF - Genomics

SN - 0888-7543

ER -