Project Summary/Abstract Building from the PI's current R01, we propose next generation random forests (RF) designed for unprecedented accuracy and computational scalability to meet the challenges of today's complex and big data in the health sciences. Superior accuracy is achieved using super greedy trees which circumvent limitations on local adaptivity imposed by classical tree splitting. We identify a key quantity, forest weights, and show how these can be leveraged for further improvements and generalizability. In one application, improved survival estimators are applied to worldwide esophageal cancer data to develop guidelines for clinical decision making. Richer RF inference is another issue explored. Cutting edge machine learning methods rarely consider the problem of estimating variability. For RF, bootstrapping currently exists as the only tool for reliably estimating con?dence intervals, but due to heavy computations is rarely applied. We introduce tools to rapidily calculate standard errors based on U-statistic theory. These will be used to increase robustness of esophageal clinical recommendations and to investigate survival temporal trends in cardiovascular disease. In another application, we make use of our new massive data scalability for discovery of tumor and immune regulators of immunotherapy in cancers. This project will set the standard for RF computational performance. Building from the core libraries of the highly accessed R-package randomForestSRC (RF-SRC), software developed under the PIs current R01, we develop open source next generation RF software, RF-SRC Everywhere, Big Data RF-SRC, and HPC RF-SRC. The software will be deployable on a number of popular machine learning workbenches, use distributed data storage technologies, and be optimized for big-p, big-n, and big-np scenarios.
|Effective start/end date||9/1/17 → 5/31/21|
- National Institutes of Health: $347,834.00
- National Institutes of Health: $359,872.00
Data storage equipment