Standard errors and confidence intervals for variable importance in random forest regression, classification, and survival

Research output: Contribution to journalArticle

4 Citations (Scopus)

Abstract

Random forests are a popular nonparametric tree ensemble procedure with broad applications to data analysis. While its widespread popularity stems from its prediction performance, an equally important feature is that it provides a fully nonparametric measure of variable importance (VIMP). A current limitation of VIMP, however, is that no systematic method exists for estimating its variance. As a solution, we propose a subsampling approach that can be used to estimate the variance of VIMP and for constructing confidence intervals. The method is general enough that it can be applied to many useful settings, including regression, classification, and survival problems. Using extensive simulations, we demonstrate the effectiveness of the subsampling estimator and in particular find that the delete-d jackknife variance estimator, a close cousin, is especially effective under low subsampling rates due to its bias correction properties. These 2 estimators are highly competitive when compared with the.164 bootstrap estimator, a modified bootstrap procedure designed to deal with ties in out-of-sample data. Most importantly, subsampling is computationally fast, thus making it especially attractive for big data settings.

Original languageEnglish (US)
Pages (from-to)558-582
Number of pages25
JournalStatistics in Medicine
Volume38
Issue number4
DOIs
StatePublished - Feb 20 2019
Externally publishedYes

Fingerprint

Subsampling
Random Forest
Standard error
Confidence interval
Regression
Confidence Intervals
Estimator
Bootstrap
Bias Correction
Jackknife
Variance Estimator
Performance Prediction
Tie
Data analysis
Ensemble
Forests
Estimate
Demonstrate
Simulation

Keywords

  • bootstrap
  • delete-d jackknife
  • permutation importance
  • prediction error
  • subsampling
  • VIMP

ASJC Scopus subject areas

  • Epidemiology
  • Statistics and Probability

Cite this

Standard errors and confidence intervals for variable importance in random forest regression, classification, and survival. / Ishwaran, Hemant; Lu, Min.

In: Statistics in Medicine, Vol. 38, No. 4, 20.02.2019, p. 558-582.

Research output: Contribution to journalArticle

@article{3af17e9de57b4c6f885dfe19d95301fa,
title = "Standard errors and confidence intervals for variable importance in random forest regression, classification, and survival",
abstract = "Random forests are a popular nonparametric tree ensemble procedure with broad applications to data analysis. While its widespread popularity stems from its prediction performance, an equally important feature is that it provides a fully nonparametric measure of variable importance (VIMP). A current limitation of VIMP, however, is that no systematic method exists for estimating its variance. As a solution, we propose a subsampling approach that can be used to estimate the variance of VIMP and for constructing confidence intervals. The method is general enough that it can be applied to many useful settings, including regression, classification, and survival problems. Using extensive simulations, we demonstrate the effectiveness of the subsampling estimator and in particular find that the delete-d jackknife variance estimator, a close cousin, is especially effective under low subsampling rates due to its bias correction properties. These 2 estimators are highly competitive when compared with the.164 bootstrap estimator, a modified bootstrap procedure designed to deal with ties in out-of-sample data. Most importantly, subsampling is computationally fast, thus making it especially attractive for big data settings.",
keywords = "bootstrap, delete-d jackknife, permutation importance, prediction error, subsampling, VIMP",
author = "Hemant Ishwaran and Min Lu",
year = "2019",
month = "2",
day = "20",
doi = "10.1002/sim.7803",
language = "English (US)",
volume = "38",
pages = "558--582",
journal = "Statistics in Medicine",
issn = "0277-6715",
publisher = "John Wiley and Sons Ltd",
number = "4",

}

TY - JOUR

T1 - Standard errors and confidence intervals for variable importance in random forest regression, classification, and survival

AU - Ishwaran, Hemant

AU - Lu, Min

PY - 2019/2/20

Y1 - 2019/2/20

N2 - Random forests are a popular nonparametric tree ensemble procedure with broad applications to data analysis. While its widespread popularity stems from its prediction performance, an equally important feature is that it provides a fully nonparametric measure of variable importance (VIMP). A current limitation of VIMP, however, is that no systematic method exists for estimating its variance. As a solution, we propose a subsampling approach that can be used to estimate the variance of VIMP and for constructing confidence intervals. The method is general enough that it can be applied to many useful settings, including regression, classification, and survival problems. Using extensive simulations, we demonstrate the effectiveness of the subsampling estimator and in particular find that the delete-d jackknife variance estimator, a close cousin, is especially effective under low subsampling rates due to its bias correction properties. These 2 estimators are highly competitive when compared with the.164 bootstrap estimator, a modified bootstrap procedure designed to deal with ties in out-of-sample data. Most importantly, subsampling is computationally fast, thus making it especially attractive for big data settings.

AB - Random forests are a popular nonparametric tree ensemble procedure with broad applications to data analysis. While its widespread popularity stems from its prediction performance, an equally important feature is that it provides a fully nonparametric measure of variable importance (VIMP). A current limitation of VIMP, however, is that no systematic method exists for estimating its variance. As a solution, we propose a subsampling approach that can be used to estimate the variance of VIMP and for constructing confidence intervals. The method is general enough that it can be applied to many useful settings, including regression, classification, and survival problems. Using extensive simulations, we demonstrate the effectiveness of the subsampling estimator and in particular find that the delete-d jackknife variance estimator, a close cousin, is especially effective under low subsampling rates due to its bias correction properties. These 2 estimators are highly competitive when compared with the.164 bootstrap estimator, a modified bootstrap procedure designed to deal with ties in out-of-sample data. Most importantly, subsampling is computationally fast, thus making it especially attractive for big data settings.

KW - bootstrap

KW - delete-d jackknife

KW - permutation importance

KW - prediction error

KW - subsampling

KW - VIMP

UR - http://www.scopus.com/inward/record.url?scp=85059163546&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85059163546&partnerID=8YFLogxK

U2 - 10.1002/sim.7803

DO - 10.1002/sim.7803

M3 - Article

VL - 38

SP - 558

EP - 582

JO - Statistics in Medicine

JF - Statistics in Medicine

SN - 0277-6715

IS - 4

ER -