Survival Bump Hunting for Finding Informative Subgroups in High Dimensional Data.

Project: Research project

Project Details


DESCRIPTION (provided by applicant): Subgroup discovery based on high dimensional genomic data can potentially provide novel insights into a disease process. Typically this has been done with various forms of cluster analysis (both supervised and unsupervised). Extreme subgroups are defined as those which are homogeneous in nature but which present extreme valued outcomes. Of particular interest in this project is to develop methodology to identify such subgroups which are extreme with respect to survival outcomes (e.g. those individuals that do unusually well on a cancer treatment and can be delineated based on high dimensional genomic predictors). If such subgroups are real and are uncovered, implications would include improved understanding of the disease etiology, discovery of new biomarkers with potential therapeutic targets, and allow early and personalized therapeutic interventions. Statistically, thi problem can be framed within a sparse survival bump hunting framework. We have brought together a team of biostatisticians who have pioneered the first sparse bump hunting models for continuous responses, as well as two internationally recognized laboratories as collaborators, who work on multi-platform genomic profiling for pediatric medulloblastoma and non-small cell lung cancer respectively. We thus propose the following specific aims: 1) To develop new models for sparse bump hunting that allow survival outcomes with both continuous and nominal predictors (e.g. gene expression and SNPs).; 2) To develop a sparse survival bump hunting approach that will allow us to integrate SNP and gene expression profile data by three different approaches - sparse coaching, bump phenotyping and sparse mediation analysis; 3) To develop detailed theory for asymptotic performance of these sparse survival bump hunting models; theory for a new fence-based methodology for studying model validation; and to empirically study and compare the performance in detailed simulations as well as on the datasets provided by our collaborator laboratories; 4) To develop a Java-based user-friendly interface and a command line end-user CRAN package in the R language that will implement all of our methodologies and its extensions.
Effective start/end date3/1/132/28/18


  • National Institutes of Health: $250,249.00
  • National Institutes of Health: $261,828.00
  • National Institutes of Health: $268,926.00
  • National Institutes of Health: $260,014.00


  • Medicine(all)


Explore the research topics touched on by this project. These labels are generated based on the underlying awards/grants. Together they form a unique fingerprint.