Integrating biological knowledge with gene expression profiles for survival prediction of cancer

Research output: Contribution to journalArticle

41 Citations (Scopus)

Abstract

Due to the large variability in survival times between cancer patients and the plethora of genes on microarrays unrelated to outcome, building accurate prediction models that are easy to interpret remains a challenge. In this paper, we propose a general strategy for improving performance and interpretability of prediction models by integrating gene expression data with prior biological knowledge. First, we link gene identifiers in expression dataset with gene annotation databases such as Gene Ontology (GO). Then we construct "supergenes" for each gene category by summarizing information from genes related to outcome using a modified principal component analysis (PCA) method. Finally, instead of using genes as predictors, we use these supergenes representing information from each gene category as predictors to predict survival outcome. In addition to identifying gene categories associated with outcome, the proposed approach also carries out additional within-category selection to select important genes within each gene set. We show, using two real breast cancer microarray datasets, that the prediction models constructed based on gene sets (or pathway) information outperform the prediction models based on expression values of single genes, with improved prediction accuracy and interpretability.

Original languageEnglish (US)
Pages (from-to)265-278
Number of pages14
JournalJournal of Computational Biology
Volume16
Issue number2
DOIs
StatePublished - Feb 12 2009
Externally publishedYes

Fingerprint

Gene Expression Profile
Transcriptome
Gene expression
Cancer
Genes
Gene
Survival
Prediction
Neoplasms
Prediction Model
Interpretability
Microarray
Microarrays
Knowledge
Predictors
Molecular Sequence Annotation
Gene Ontology
Survival Time
Principal Component Analysis
Gene Expression Data

Keywords

  • Gene expression
  • Gene ontology
  • Microarrays
  • Pathway analysis
  • Survival prediction

ASJC Scopus subject areas

  • Modeling and Simulation
  • Molecular Biology
  • Genetics
  • Computational Theory and Mathematics
  • Computational Mathematics

Cite this

Integrating biological knowledge with gene expression profiles for survival prediction of cancer. / Chen, Xi; Wang, Lily.

In: Journal of Computational Biology, Vol. 16, No. 2, 12.02.2009, p. 265-278.

Research output: Contribution to journalArticle

@article{e38788eb724c4dc987317ba35e05fe0b,
title = "Integrating biological knowledge with gene expression profiles for survival prediction of cancer",
abstract = "Due to the large variability in survival times between cancer patients and the plethora of genes on microarrays unrelated to outcome, building accurate prediction models that are easy to interpret remains a challenge. In this paper, we propose a general strategy for improving performance and interpretability of prediction models by integrating gene expression data with prior biological knowledge. First, we link gene identifiers in expression dataset with gene annotation databases such as Gene Ontology (GO). Then we construct {"}supergenes{"} for each gene category by summarizing information from genes related to outcome using a modified principal component analysis (PCA) method. Finally, instead of using genes as predictors, we use these supergenes representing information from each gene category as predictors to predict survival outcome. In addition to identifying gene categories associated with outcome, the proposed approach also carries out additional within-category selection to select important genes within each gene set. We show, using two real breast cancer microarray datasets, that the prediction models constructed based on gene sets (or pathway) information outperform the prediction models based on expression values of single genes, with improved prediction accuracy and interpretability.",
keywords = "Gene expression, Gene ontology, Microarrays, Pathway analysis, Survival prediction",
author = "Xi Chen and Lily Wang",
year = "2009",
month = "2",
day = "12",
doi = "10.1089/cmb.2008.12TT",
language = "English (US)",
volume = "16",
pages = "265--278",
journal = "Journal of Computational Biology",
issn = "1066-5277",
publisher = "Mary Ann Liebert Inc.",
number = "2",

}

TY - JOUR

T1 - Integrating biological knowledge with gene expression profiles for survival prediction of cancer

AU - Chen, Xi

AU - Wang, Lily

PY - 2009/2/12

Y1 - 2009/2/12

N2 - Due to the large variability in survival times between cancer patients and the plethora of genes on microarrays unrelated to outcome, building accurate prediction models that are easy to interpret remains a challenge. In this paper, we propose a general strategy for improving performance and interpretability of prediction models by integrating gene expression data with prior biological knowledge. First, we link gene identifiers in expression dataset with gene annotation databases such as Gene Ontology (GO). Then we construct "supergenes" for each gene category by summarizing information from genes related to outcome using a modified principal component analysis (PCA) method. Finally, instead of using genes as predictors, we use these supergenes representing information from each gene category as predictors to predict survival outcome. In addition to identifying gene categories associated with outcome, the proposed approach also carries out additional within-category selection to select important genes within each gene set. We show, using two real breast cancer microarray datasets, that the prediction models constructed based on gene sets (or pathway) information outperform the prediction models based on expression values of single genes, with improved prediction accuracy and interpretability.

AB - Due to the large variability in survival times between cancer patients and the plethora of genes on microarrays unrelated to outcome, building accurate prediction models that are easy to interpret remains a challenge. In this paper, we propose a general strategy for improving performance and interpretability of prediction models by integrating gene expression data with prior biological knowledge. First, we link gene identifiers in expression dataset with gene annotation databases such as Gene Ontology (GO). Then we construct "supergenes" for each gene category by summarizing information from genes related to outcome using a modified principal component analysis (PCA) method. Finally, instead of using genes as predictors, we use these supergenes representing information from each gene category as predictors to predict survival outcome. In addition to identifying gene categories associated with outcome, the proposed approach also carries out additional within-category selection to select important genes within each gene set. We show, using two real breast cancer microarray datasets, that the prediction models constructed based on gene sets (or pathway) information outperform the prediction models based on expression values of single genes, with improved prediction accuracy and interpretability.

KW - Gene expression

KW - Gene ontology

KW - Microarrays

KW - Pathway analysis

KW - Survival prediction

UR - http://www.scopus.com/inward/record.url?scp=59649124237&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=59649124237&partnerID=8YFLogxK

U2 - 10.1089/cmb.2008.12TT

DO - 10.1089/cmb.2008.12TT

M3 - Article

C2 - 19183004

AN - SCOPUS:59649124237

VL - 16

SP - 265

EP - 278

JO - Journal of Computational Biology

JF - Journal of Computational Biology

SN - 1066-5277

IS - 2

ER -