A comprehensive comparison of random forests and support vector machines for microarray-based cancer classification

Alexander Statnikov, Lily Wang, Constantin F. Aliferis

Research output: Contribution to journalArticle

337 Citations (Scopus)

Abstract

Background: Cancer diagnosis and clinical outcome prediction are among the most important emerging applications of gene expression microarray technology with several molecular signatures on their way toward clinical deployment. Use of the most accurate classification algorithms available for microarray gene expression data is a critical ingredient in order to develop the best possible molecular signatures for patient care. As suggested by a large body of literature to date, support vector machines can be considered "best of class" algorithms for classification of such data. Recent work, however, suggests that random forest classifiers may outperform support vector machines in this domain. Results: In the present paper we identify methodological biases of prior work comparing random forests and support vector machines and conduct a new rigorous evaluation of the two algorithms that corrects these limitations. Our experiments use 22 diagnostic and prognostic datasets and show that support vector machines outperform random forests, often by a large margin. Our data also underlines the importance of sound research design in benchmarking and comparison of bioinformatics algorithms. Conclusion: We found that both on average and in the majority of microarray datasets, random forests are outperformed by support vector machines both in the settings when no gene selection is performed and when several popular gene selection methods are used.

Original languageEnglish (US)
Article number319
JournalBMC Bioinformatics
Volume9
DOIs
StatePublished - Jul 22 2008
Externally publishedYes

Fingerprint

Cancer Classification
Random Forest
Microarrays
Microarray
Support vector machines
Support Vector Machine
Gene Selection
Neoplasms
Gene expression
Signature
Genes
Gene Expression
Benchmarking
Bioinformatics
Classification Algorithm
Gene Expression Data
Microarray Data
Computational Biology
Margin
Diagnostics

ASJC Scopus subject areas

  • Structural Biology
  • Biochemistry
  • Molecular Biology
  • Computer Science Applications
  • Applied Mathematics

Cite this

A comprehensive comparison of random forests and support vector machines for microarray-based cancer classification. / Statnikov, Alexander; Wang, Lily; Aliferis, Constantin F.

In: BMC Bioinformatics, Vol. 9, 319, 22.07.2008.

Research output: Contribution to journalArticle

@article{17625962cbe046b0bcbf836319b67dfe,
title = "A comprehensive comparison of random forests and support vector machines for microarray-based cancer classification",
abstract = "Background: Cancer diagnosis and clinical outcome prediction are among the most important emerging applications of gene expression microarray technology with several molecular signatures on their way toward clinical deployment. Use of the most accurate classification algorithms available for microarray gene expression data is a critical ingredient in order to develop the best possible molecular signatures for patient care. As suggested by a large body of literature to date, support vector machines can be considered {"}best of class{"} algorithms for classification of such data. Recent work, however, suggests that random forest classifiers may outperform support vector machines in this domain. Results: In the present paper we identify methodological biases of prior work comparing random forests and support vector machines and conduct a new rigorous evaluation of the two algorithms that corrects these limitations. Our experiments use 22 diagnostic and prognostic datasets and show that support vector machines outperform random forests, often by a large margin. Our data also underlines the importance of sound research design in benchmarking and comparison of bioinformatics algorithms. Conclusion: We found that both on average and in the majority of microarray datasets, random forests are outperformed by support vector machines both in the settings when no gene selection is performed and when several popular gene selection methods are used.",
author = "Alexander Statnikov and Lily Wang and Aliferis, {Constantin F.}",
year = "2008",
month = "7",
day = "22",
doi = "10.1186/1471-2105-9-319",
language = "English (US)",
volume = "9",
journal = "BMC Bioinformatics",
issn = "1471-2105",
publisher = "BioMed Central",

}

TY - JOUR

T1 - A comprehensive comparison of random forests and support vector machines for microarray-based cancer classification

AU - Statnikov, Alexander

AU - Wang, Lily

AU - Aliferis, Constantin F.

PY - 2008/7/22

Y1 - 2008/7/22

N2 - Background: Cancer diagnosis and clinical outcome prediction are among the most important emerging applications of gene expression microarray technology with several molecular signatures on their way toward clinical deployment. Use of the most accurate classification algorithms available for microarray gene expression data is a critical ingredient in order to develop the best possible molecular signatures for patient care. As suggested by a large body of literature to date, support vector machines can be considered "best of class" algorithms for classification of such data. Recent work, however, suggests that random forest classifiers may outperform support vector machines in this domain. Results: In the present paper we identify methodological biases of prior work comparing random forests and support vector machines and conduct a new rigorous evaluation of the two algorithms that corrects these limitations. Our experiments use 22 diagnostic and prognostic datasets and show that support vector machines outperform random forests, often by a large margin. Our data also underlines the importance of sound research design in benchmarking and comparison of bioinformatics algorithms. Conclusion: We found that both on average and in the majority of microarray datasets, random forests are outperformed by support vector machines both in the settings when no gene selection is performed and when several popular gene selection methods are used.

AB - Background: Cancer diagnosis and clinical outcome prediction are among the most important emerging applications of gene expression microarray technology with several molecular signatures on their way toward clinical deployment. Use of the most accurate classification algorithms available for microarray gene expression data is a critical ingredient in order to develop the best possible molecular signatures for patient care. As suggested by a large body of literature to date, support vector machines can be considered "best of class" algorithms for classification of such data. Recent work, however, suggests that random forest classifiers may outperform support vector machines in this domain. Results: In the present paper we identify methodological biases of prior work comparing random forests and support vector machines and conduct a new rigorous evaluation of the two algorithms that corrects these limitations. Our experiments use 22 diagnostic and prognostic datasets and show that support vector machines outperform random forests, often by a large margin. Our data also underlines the importance of sound research design in benchmarking and comparison of bioinformatics algorithms. Conclusion: We found that both on average and in the majority of microarray datasets, random forests are outperformed by support vector machines both in the settings when no gene selection is performed and when several popular gene selection methods are used.

UR - http://www.scopus.com/inward/record.url?scp=48549094895&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=48549094895&partnerID=8YFLogxK

U2 - 10.1186/1471-2105-9-319

DO - 10.1186/1471-2105-9-319

M3 - Article

C2 - 18647401

AN - SCOPUS:48549094895

VL - 9

JO - BMC Bioinformatics

JF - BMC Bioinformatics

SN - 1471-2105

M1 - 319

ER -