High-dimensional variable selection for survival data

Hemant Ishwaran, Udaya B. Kogalur, Eiran Z. Gorodeski, Andy J. Minn, Michael S. Lauer

Research output: Contribution to journalArticle

153 Scopus citations

Abstract

The minimal depth of a maximal subtree is a dimensionless order statistic measuring the predictiveness of a variable in a survival tree. We derive the distribution of the minimal depth and use it for high-dimensional variable selection using random survival forests. In big p and small n problems (where p is the dimension and n is the sample size), the distribution of the minimal depth reveals a "ceiling effect" in which a tree simply cannot be grown deep enough to properly identify predictive variables. Motivated by this limitation, we develop a new regularized algorithm, termed RSF-Variable Hunting. This algorithm exploits maximal subtrees for effective variable selection under such scenarios. Several applications are presented demonstrating the methodology, including the problem of gene selection using microarray data. In this work we focus only on survival settings, although our methodology also applies to other random forests applications, including regression and classification settings. All examples presented here use the R-software package randomSurvivalForest.

Original languageEnglish (US)
Pages (from-to)205-217
Number of pages13
JournalJournal of the American Statistical Association
Volume105
Issue number489
DOIs
StatePublished - Mar 1 2010

    Fingerprint

Keywords

  • Forest
  • Maximal subtree
  • Minimal depth
  • Random survival forest
  • Tree
  • VIMP

ASJC Scopus subject areas

  • Statistics and Probability
  • Statistics, Probability and Uncertainty

Cite this