A multi-model based approach for driver missense identification

Ahmed T. Soliman, Mei-Ling Shyu

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

The rapid growth in DNA and protein sequencing techniques over the last decade boosted the availability and scale of mutations data, and therefore the necessity of developing automated approaches to predict driver mutations arises. Identifying driver mutations is essential to better understand and measure cancer progression and thus enable proper diagnosis and targeted treatment of cancer. Here, we present a scalable machine learning based approach to identify driver missense mutations. The proposed approach builds on and expands our previously proposed framework. A group of independent parallel classifiers where each classifier handles a single set of features can be deployed. Then, a model fusion module combines the classifiers' outputs to produce a final mutation label. Each classifier is trained and validated independently with its corresponding feature set. Feature sets undergo a feature selection process to filter out low significance features. In this paper, four protein sequence-level feature sets are leveraged, namely two amino acid indices (AAIndex1 and AAIndex2) feature sets, one pseudo amino acid composition (PseAAC) feature set, and one feature set generated using wavelet analysis. The proposed approach is extensible to consume new additional features with the minimal impact on the computational complexity due to the parallel design of its components. Experiments were performed to assess the performance of the proposed approach and to compare it with other similar approaches.

Original languageEnglish (US)
Title of host publicationProceedings - 2018 IEEE 19th International Conference on Information Reuse and Integration for Data Science, IRI 2018
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages419-425
Number of pages7
ISBN (Print)9781538626597
DOIs
StatePublished - Aug 2 2018
Event19th IEEE International Conference on Information Reuse and Integration for Data Science, IRI 2018 - Salt Lake City, United States
Duration: Jul 7 2018Jul 9 2018

Other

Other19th IEEE International Conference on Information Reuse and Integration for Data Science, IRI 2018
CountryUnited States
CitySalt Lake City
Period7/7/187/9/18

Fingerprint

Identification (control systems)
Classifiers
driver
cancer
Amino acids
Proteins
Wavelet analysis
Learning systems
Feature extraction
Labels
Computational complexity
DNA
Fusion reactions
experiment
Availability
learning
performance
Group
Chemical analysis
Mutation

Keywords

  • Cancer genome
  • Driver mutation
  • Passenger mutation

ASJC Scopus subject areas

  • Computer Networks and Communications
  • Software
  • Artificial Intelligence
  • Information Systems and Management
  • Safety, Risk, Reliability and Quality
  • Public Administration

Cite this

Soliman, A. T., & Shyu, M-L. (2018). A multi-model based approach for driver missense identification. In Proceedings - 2018 IEEE 19th International Conference on Information Reuse and Integration for Data Science, IRI 2018 (pp. 419-425). [8424739] Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/IRI.2018.00068

A multi-model based approach for driver missense identification. / Soliman, Ahmed T.; Shyu, Mei-Ling.

Proceedings - 2018 IEEE 19th International Conference on Information Reuse and Integration for Data Science, IRI 2018. Institute of Electrical and Electronics Engineers Inc., 2018. p. 419-425 8424739.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Soliman, AT & Shyu, M-L 2018, A multi-model based approach for driver missense identification. in Proceedings - 2018 IEEE 19th International Conference on Information Reuse and Integration for Data Science, IRI 2018., 8424739, Institute of Electrical and Electronics Engineers Inc., pp. 419-425, 19th IEEE International Conference on Information Reuse and Integration for Data Science, IRI 2018, Salt Lake City, United States, 7/7/18. https://doi.org/10.1109/IRI.2018.00068
Soliman AT, Shyu M-L. A multi-model based approach for driver missense identification. In Proceedings - 2018 IEEE 19th International Conference on Information Reuse and Integration for Data Science, IRI 2018. Institute of Electrical and Electronics Engineers Inc. 2018. p. 419-425. 8424739 https://doi.org/10.1109/IRI.2018.00068
Soliman, Ahmed T. ; Shyu, Mei-Ling. / A multi-model based approach for driver missense identification. Proceedings - 2018 IEEE 19th International Conference on Information Reuse and Integration for Data Science, IRI 2018. Institute of Electrical and Electronics Engineers Inc., 2018. pp. 419-425
@inproceedings{a926deb1f1b4425a9a28e6e4b1e2df10,
title = "A multi-model based approach for driver missense identification",
abstract = "The rapid growth in DNA and protein sequencing techniques over the last decade boosted the availability and scale of mutations data, and therefore the necessity of developing automated approaches to predict driver mutations arises. Identifying driver mutations is essential to better understand and measure cancer progression and thus enable proper diagnosis and targeted treatment of cancer. Here, we present a scalable machine learning based approach to identify driver missense mutations. The proposed approach builds on and expands our previously proposed framework. A group of independent parallel classifiers where each classifier handles a single set of features can be deployed. Then, a model fusion module combines the classifiers' outputs to produce a final mutation label. Each classifier is trained and validated independently with its corresponding feature set. Feature sets undergo a feature selection process to filter out low significance features. In this paper, four protein sequence-level feature sets are leveraged, namely two amino acid indices (AAIndex1 and AAIndex2) feature sets, one pseudo amino acid composition (PseAAC) feature set, and one feature set generated using wavelet analysis. The proposed approach is extensible to consume new additional features with the minimal impact on the computational complexity due to the parallel design of its components. Experiments were performed to assess the performance of the proposed approach and to compare it with other similar approaches.",
keywords = "Cancer genome, Driver mutation, Passenger mutation",
author = "Soliman, {Ahmed T.} and Mei-Ling Shyu",
year = "2018",
month = "8",
day = "2",
doi = "10.1109/IRI.2018.00068",
language = "English (US)",
isbn = "9781538626597",
pages = "419--425",
booktitle = "Proceedings - 2018 IEEE 19th International Conference on Information Reuse and Integration for Data Science, IRI 2018",
publisher = "Institute of Electrical and Electronics Engineers Inc.",

}

TY - GEN

T1 - A multi-model based approach for driver missense identification

AU - Soliman, Ahmed T.

AU - Shyu, Mei-Ling

PY - 2018/8/2

Y1 - 2018/8/2

N2 - The rapid growth in DNA and protein sequencing techniques over the last decade boosted the availability and scale of mutations data, and therefore the necessity of developing automated approaches to predict driver mutations arises. Identifying driver mutations is essential to better understand and measure cancer progression and thus enable proper diagnosis and targeted treatment of cancer. Here, we present a scalable machine learning based approach to identify driver missense mutations. The proposed approach builds on and expands our previously proposed framework. A group of independent parallel classifiers where each classifier handles a single set of features can be deployed. Then, a model fusion module combines the classifiers' outputs to produce a final mutation label. Each classifier is trained and validated independently with its corresponding feature set. Feature sets undergo a feature selection process to filter out low significance features. In this paper, four protein sequence-level feature sets are leveraged, namely two amino acid indices (AAIndex1 and AAIndex2) feature sets, one pseudo amino acid composition (PseAAC) feature set, and one feature set generated using wavelet analysis. The proposed approach is extensible to consume new additional features with the minimal impact on the computational complexity due to the parallel design of its components. Experiments were performed to assess the performance of the proposed approach and to compare it with other similar approaches.

AB - The rapid growth in DNA and protein sequencing techniques over the last decade boosted the availability and scale of mutations data, and therefore the necessity of developing automated approaches to predict driver mutations arises. Identifying driver mutations is essential to better understand and measure cancer progression and thus enable proper diagnosis and targeted treatment of cancer. Here, we present a scalable machine learning based approach to identify driver missense mutations. The proposed approach builds on and expands our previously proposed framework. A group of independent parallel classifiers where each classifier handles a single set of features can be deployed. Then, a model fusion module combines the classifiers' outputs to produce a final mutation label. Each classifier is trained and validated independently with its corresponding feature set. Feature sets undergo a feature selection process to filter out low significance features. In this paper, four protein sequence-level feature sets are leveraged, namely two amino acid indices (AAIndex1 and AAIndex2) feature sets, one pseudo amino acid composition (PseAAC) feature set, and one feature set generated using wavelet analysis. The proposed approach is extensible to consume new additional features with the minimal impact on the computational complexity due to the parallel design of its components. Experiments were performed to assess the performance of the proposed approach and to compare it with other similar approaches.

KW - Cancer genome

KW - Driver mutation

KW - Passenger mutation

UR - http://www.scopus.com/inward/record.url?scp=85052302408&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85052302408&partnerID=8YFLogxK

U2 - 10.1109/IRI.2018.00068

DO - 10.1109/IRI.2018.00068

M3 - Conference contribution

SN - 9781538626597

SP - 419

EP - 425

BT - Proceedings - 2018 IEEE 19th International Conference on Information Reuse and Integration for Data Science, IRI 2018

PB - Institute of Electrical and Electronics Engineers Inc.

ER -