Regularized k-means clustering of high-dimensional data and its asymptotic consistency

Wei Sun, Junhui Wang, Yixin Fang

Research output: Contribution to journalArticle

44 Citations (Scopus)

Abstract

K-means clustering is a widely used tool for cluster analysis due to its conceptual simplicity and computational efficiency. However, its performance can be distorted when clustering high-dimensional data where the number of variables becomes relatively large and many of them may contain no information about the clustering structure. This article proposes a high-dimensional cluster analysis method via regularized k-means clus- tering, which can simultaneously cluster similar observations and eliminate redundant variables. The key idea is to formulate the k-means clustering in a form of regularization, with an adaptive group lasso penalty term on cluster centers. In order to optimally balance the trade-off between the clustering model fitting and sparsity, a selection criterion based on clustering stabil- ity is developed. The asymptotic estimation and selection consistency of the regularized k-means clustering with diverging dimension is established. The effectiveness of the regularized k-means clustering is also demonstrated through a variety of numerical experiments as well as applications to two gene microarray examples. The regularized clustering framework can also be extended to the general model-based clustering.

Original languageEnglish (US)
Pages (from-to)148-167
Number of pages20
JournalElectronic Journal of Statistics
Volume6
DOIs
StatePublished - Feb 21 2012
Externally publishedYes

Fingerprint

K-means Clustering
High-dimensional Data
Clustering
Cluster Analysis
Model-based Clustering
Lasso
Model Fitting
Dimensional Analysis
K-means
Sparsity
Microarray
Computational Efficiency
Penalty
K-means clustering
Simplicity
Regularization
High-dimensional
Eliminate
Trade-offs
Numerical Experiment

Keywords

  • Diverging dimension
  • K-means
  • Lasso
  • Selection consistency
  • Stability
  • Variable selection

ASJC Scopus subject areas

  • Statistics and Probability

Cite this

Regularized k-means clustering of high-dimensional data and its asymptotic consistency. / Sun, Wei; Wang, Junhui; Fang, Yixin.

In: Electronic Journal of Statistics, Vol. 6, 21.02.2012, p. 148-167.

Research output: Contribution to journalArticle

@article{51de1a4d963a4a6998b2e2dbfa7d7204,
title = "Regularized k-means clustering of high-dimensional data and its asymptotic consistency",
abstract = "K-means clustering is a widely used tool for cluster analysis due to its conceptual simplicity and computational efficiency. However, its performance can be distorted when clustering high-dimensional data where the number of variables becomes relatively large and many of them may contain no information about the clustering structure. This article proposes a high-dimensional cluster analysis method via regularized k-means clus- tering, which can simultaneously cluster similar observations and eliminate redundant variables. The key idea is to formulate the k-means clustering in a form of regularization, with an adaptive group lasso penalty term on cluster centers. In order to optimally balance the trade-off between the clustering model fitting and sparsity, a selection criterion based on clustering stabil- ity is developed. The asymptotic estimation and selection consistency of the regularized k-means clustering with diverging dimension is established. The effectiveness of the regularized k-means clustering is also demonstrated through a variety of numerical experiments as well as applications to two gene microarray examples. The regularized clustering framework can also be extended to the general model-based clustering.",
keywords = "Diverging dimension, K-means, Lasso, Selection consistency, Stability, Variable selection",
author = "Wei Sun and Junhui Wang and Yixin Fang",
year = "2012",
month = "2",
day = "21",
doi = "10.1214/12-EJS668",
language = "English (US)",
volume = "6",
pages = "148--167",
journal = "Electronic Journal of Statistics",
issn = "1935-7524",
publisher = "Institute of Mathematical Statistics",

}

TY - JOUR

T1 - Regularized k-means clustering of high-dimensional data and its asymptotic consistency

AU - Sun, Wei

AU - Wang, Junhui

AU - Fang, Yixin

PY - 2012/2/21

Y1 - 2012/2/21

N2 - K-means clustering is a widely used tool for cluster analysis due to its conceptual simplicity and computational efficiency. However, its performance can be distorted when clustering high-dimensional data where the number of variables becomes relatively large and many of them may contain no information about the clustering structure. This article proposes a high-dimensional cluster analysis method via regularized k-means clus- tering, which can simultaneously cluster similar observations and eliminate redundant variables. The key idea is to formulate the k-means clustering in a form of regularization, with an adaptive group lasso penalty term on cluster centers. In order to optimally balance the trade-off between the clustering model fitting and sparsity, a selection criterion based on clustering stabil- ity is developed. The asymptotic estimation and selection consistency of the regularized k-means clustering with diverging dimension is established. The effectiveness of the regularized k-means clustering is also demonstrated through a variety of numerical experiments as well as applications to two gene microarray examples. The regularized clustering framework can also be extended to the general model-based clustering.

AB - K-means clustering is a widely used tool for cluster analysis due to its conceptual simplicity and computational efficiency. However, its performance can be distorted when clustering high-dimensional data where the number of variables becomes relatively large and many of them may contain no information about the clustering structure. This article proposes a high-dimensional cluster analysis method via regularized k-means clus- tering, which can simultaneously cluster similar observations and eliminate redundant variables. The key idea is to formulate the k-means clustering in a form of regularization, with an adaptive group lasso penalty term on cluster centers. In order to optimally balance the trade-off between the clustering model fitting and sparsity, a selection criterion based on clustering stabil- ity is developed. The asymptotic estimation and selection consistency of the regularized k-means clustering with diverging dimension is established. The effectiveness of the regularized k-means clustering is also demonstrated through a variety of numerical experiments as well as applications to two gene microarray examples. The regularized clustering framework can also be extended to the general model-based clustering.

KW - Diverging dimension

KW - K-means

KW - Lasso

KW - Selection consistency

KW - Stability

KW - Variable selection

UR - http://www.scopus.com/inward/record.url?scp=84863154815&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84863154815&partnerID=8YFLogxK

U2 - 10.1214/12-EJS668

DO - 10.1214/12-EJS668

M3 - Article

AN - SCOPUS:84863154815

VL - 6

SP - 148

EP - 167

JO - Electronic Journal of Statistics

JF - Electronic Journal of Statistics

SN - 1935-7524

ER -