Standard errors and confidence intervals for variable importance in random forest regression, classification, and survival

Research output: Contribution to journalArticle

25 Scopus citations

Abstract

Random forests are a popular nonparametric tree ensemble procedure with broad applications to data analysis. While its widespread popularity stems from its prediction performance, an equally important feature is that it provides a fully nonparametric measure of variable importance (VIMP). A current limitation of VIMP, however, is that no systematic method exists for estimating its variance. As a solution, we propose a subsampling approach that can be used to estimate the variance of VIMP and for constructing confidence intervals. The method is general enough that it can be applied to many useful settings, including regression, classification, and survival problems. Using extensive simulations, we demonstrate the effectiveness of the subsampling estimator and in particular find that the delete-d jackknife variance estimator, a close cousin, is especially effective under low subsampling rates due to its bias correction properties. These 2 estimators are highly competitive when compared with the.164 bootstrap estimator, a modified bootstrap procedure designed to deal with ties in out-of-sample data. Most importantly, subsampling is computationally fast, thus making it especially attractive for big data settings.

Original languageEnglish (US)
Pages (from-to)558-582
Number of pages25
JournalStatistics in Medicine
Volume38
Issue number4
DOIs
StatePublished - Feb 20 2019

Keywords

  • VIMP
  • bootstrap
  • delete-d jackknife
  • permutation importance
  • prediction error
  • subsampling

ASJC Scopus subject areas

  • Epidemiology
  • Statistics and Probability

Fingerprint Dive into the research topics of 'Standard errors and confidence intervals for variable importance in random forest regression, classification, and survival'. Together they form a unique fingerprint.

  • Cite this