Algorithms for clustering high dimensional and distributed data

Tao Li, Shenghuo Zhu, Mitsunori Ogihara

Research output: Contribution to journalArticle

13 Citations (Scopus)

Abstract

Clustering is the problem of identifying the distribution of patterns and intrinsic correlations in large data sets by partitioning the data points into similarity classes. The clustering problem has been widely studied in machine learning, databases, and statistics. This paper studies the problem of clustering high dimensional data. The paper proposes an algorithm called the CoFD algorithm, which is a non-distance based clustering algorithm for high dimensional spaces. Based on the Maximum Likelihood Principle, CoFD attempts to optimize its parameter settings to maximize the likelihood between data points and the model generated by the parameters. The distributed versions of the problem, called the D-CoFD algorithms, are also proposed. Experimental results on both synthetic and real data sets show the efficiency and effectiveness of CoFD and D-CoFD algorithms.

Original languageEnglish (US)
Pages (from-to)305-326
Number of pages22
JournalIntelligent Data Analysis
Volume7
Issue number4
StatePublished - 2003
Externally publishedYes

Fingerprint

High-dimensional
Clustering
Likelihood Principle
High-dimensional Data
Clustering algorithms
Large Data Sets
Maximum likelihood
Clustering Algorithm
Maximum Likelihood
Learning systems
Partitioning
Likelihood
Machine Learning
Maximise
Optimise
Statistics
Experimental Results
Model

Keywords

  • clustering
  • CoFD
  • distributed
  • high dimensional
  • maximum likelihood

ASJC Scopus subject areas

  • Artificial Intelligence
  • Theoretical Computer Science
  • Computer Vision and Pattern Recognition

Cite this

Algorithms for clustering high dimensional and distributed data. / Li, Tao; Zhu, Shenghuo; Ogihara, Mitsunori.

In: Intelligent Data Analysis, Vol. 7, No. 4, 2003, p. 305-326.

Research output: Contribution to journalArticle

@article{6ff9aea14937478092b2f6863f315a77,
title = "Algorithms for clustering high dimensional and distributed data",
abstract = "Clustering is the problem of identifying the distribution of patterns and intrinsic correlations in large data sets by partitioning the data points into similarity classes. The clustering problem has been widely studied in machine learning, databases, and statistics. This paper studies the problem of clustering high dimensional data. The paper proposes an algorithm called the CoFD algorithm, which is a non-distance based clustering algorithm for high dimensional spaces. Based on the Maximum Likelihood Principle, CoFD attempts to optimize its parameter settings to maximize the likelihood between data points and the model generated by the parameters. The distributed versions of the problem, called the D-CoFD algorithms, are also proposed. Experimental results on both synthetic and real data sets show the efficiency and effectiveness of CoFD and D-CoFD algorithms.",
keywords = "clustering, CoFD, distributed, high dimensional, maximum likelihood",
author = "Tao Li and Shenghuo Zhu and Mitsunori Ogihara",
year = "2003",
language = "English (US)",
volume = "7",
pages = "305--326",
journal = "Intelligent Data Analysis",
issn = "1088-467X",
publisher = "IOS Press",
number = "4",

}

TY - JOUR

T1 - Algorithms for clustering high dimensional and distributed data

AU - Li, Tao

AU - Zhu, Shenghuo

AU - Ogihara, Mitsunori

PY - 2003

Y1 - 2003

N2 - Clustering is the problem of identifying the distribution of patterns and intrinsic correlations in large data sets by partitioning the data points into similarity classes. The clustering problem has been widely studied in machine learning, databases, and statistics. This paper studies the problem of clustering high dimensional data. The paper proposes an algorithm called the CoFD algorithm, which is a non-distance based clustering algorithm for high dimensional spaces. Based on the Maximum Likelihood Principle, CoFD attempts to optimize its parameter settings to maximize the likelihood between data points and the model generated by the parameters. The distributed versions of the problem, called the D-CoFD algorithms, are also proposed. Experimental results on both synthetic and real data sets show the efficiency and effectiveness of CoFD and D-CoFD algorithms.

AB - Clustering is the problem of identifying the distribution of patterns and intrinsic correlations in large data sets by partitioning the data points into similarity classes. The clustering problem has been widely studied in machine learning, databases, and statistics. This paper studies the problem of clustering high dimensional data. The paper proposes an algorithm called the CoFD algorithm, which is a non-distance based clustering algorithm for high dimensional spaces. Based on the Maximum Likelihood Principle, CoFD attempts to optimize its parameter settings to maximize the likelihood between data points and the model generated by the parameters. The distributed versions of the problem, called the D-CoFD algorithms, are also proposed. Experimental results on both synthetic and real data sets show the efficiency and effectiveness of CoFD and D-CoFD algorithms.

KW - clustering

KW - CoFD

KW - distributed

KW - high dimensional

KW - maximum likelihood

UR - http://www.scopus.com/inward/record.url?scp=18744411943&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=18744411943&partnerID=8YFLogxK

M3 - Article

AN - SCOPUS:18744411943

VL - 7

SP - 305

EP - 326

JO - Intelligent Data Analysis

JF - Intelligent Data Analysis

SN - 1088-467X

IS - 4

ER -