Exploiting dataset similarity for distributed mining

Srinivasan Parthasarathy, Mitsunori Ogihara

Research output: Chapter in Book/Report/Conference proceedingConference contribution

6 Citations (Scopus)

Abstract

The notion of similarity is an important one in data mining. It can be used to pro vide useful structural information on data as w ell as enable clustering. In this paper we presen t an elegant method for measuring the similarity between homogeneous datasets. The algorithm presented is eÆcient in storage and scale, has the ability to adjust to time constraints. and can provide the user with likely causes of similarity or dis-similarity. One potential application of our similarity measure is in the distributed data mining domain. Using the notion of similarity across databases as a distance metric one cangenerate clusters of similar datasets. Once similar datasets are clustered, each cluster can be independently mined to generate the appropriate rules for a given cluster. The similarity measure is evaluated on a dataset from the Census Bureau, and synthetic datasets from IBM. ?

Original languageEnglish (US)
Title of host publicationLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Pages399-406
Number of pages8
Volume1800 LNCS
StatePublished - 2000
Externally publishedYes
Event15 Workshops Held in Conjunction with the IEEE International Parallel and Distributed Processing Symposium, IPDPS 2000 - Cancun, Mexico
Duration: May 1 2000May 5 2000

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume1800 LNCS
ISSN (Print)03029743
ISSN (Electronic)16113349

Other

Other15 Workshops Held in Conjunction with the IEEE International Parallel and Distributed Processing Symposium, IPDPS 2000
CountryMexico
CityCancun
Period5/1/005/5/00

Fingerprint

Data mining
Mining
Similarity Measure
Distributed Data Mining
Distance Metric
Census
Data Mining
Likely
Similarity
Clustering

ASJC Scopus subject areas

  • Computer Science(all)
  • Theoretical Computer Science

Cite this

Parthasarathy, S., & Ogihara, M. (2000). Exploiting dataset similarity for distributed mining. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 1800 LNCS, pp. 399-406). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 1800 LNCS).

Exploiting dataset similarity for distributed mining. / Parthasarathy, Srinivasan; Ogihara, Mitsunori.

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). Vol. 1800 LNCS 2000. p. 399-406 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 1800 LNCS).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Parthasarathy, S & Ogihara, M 2000, Exploiting dataset similarity for distributed mining. in Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). vol. 1800 LNCS, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 1800 LNCS, pp. 399-406, 15 Workshops Held in Conjunction with the IEEE International Parallel and Distributed Processing Symposium, IPDPS 2000, Cancun, Mexico, 5/1/00.
Parthasarathy S, Ogihara M. Exploiting dataset similarity for distributed mining. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). Vol. 1800 LNCS. 2000. p. 399-406. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).
Parthasarathy, Srinivasan ; Ogihara, Mitsunori. / Exploiting dataset similarity for distributed mining. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). Vol. 1800 LNCS 2000. pp. 399-406 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).
@inproceedings{05479e8d276e498e99bc484774579cc5,
title = "Exploiting dataset similarity for distributed mining",
abstract = "The notion of similarity is an important one in data mining. It can be used to pro vide useful structural information on data as w ell as enable clustering. In this paper we presen t an elegant method for measuring the similarity between homogeneous datasets. The algorithm presented is e{\AE}cient in storage and scale, has the ability to adjust to time constraints. and can provide the user with likely causes of similarity or dis-similarity. One potential application of our similarity measure is in the distributed data mining domain. Using the notion of similarity across databases as a distance metric one cangenerate clusters of similar datasets. Once similar datasets are clustered, each cluster can be independently mined to generate the appropriate rules for a given cluster. The similarity measure is evaluated on a dataset from the Census Bureau, and synthetic datasets from IBM. ?",
author = "Srinivasan Parthasarathy and Mitsunori Ogihara",
year = "2000",
language = "English (US)",
isbn = "354067442X",
volume = "1800 LNCS",
series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",
pages = "399--406",
booktitle = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",

}

TY - GEN

T1 - Exploiting dataset similarity for distributed mining

AU - Parthasarathy, Srinivasan

AU - Ogihara, Mitsunori

PY - 2000

Y1 - 2000

N2 - The notion of similarity is an important one in data mining. It can be used to pro vide useful structural information on data as w ell as enable clustering. In this paper we presen t an elegant method for measuring the similarity between homogeneous datasets. The algorithm presented is eÆcient in storage and scale, has the ability to adjust to time constraints. and can provide the user with likely causes of similarity or dis-similarity. One potential application of our similarity measure is in the distributed data mining domain. Using the notion of similarity across databases as a distance metric one cangenerate clusters of similar datasets. Once similar datasets are clustered, each cluster can be independently mined to generate the appropriate rules for a given cluster. The similarity measure is evaluated on a dataset from the Census Bureau, and synthetic datasets from IBM. ?

AB - The notion of similarity is an important one in data mining. It can be used to pro vide useful structural information on data as w ell as enable clustering. In this paper we presen t an elegant method for measuring the similarity between homogeneous datasets. The algorithm presented is eÆcient in storage and scale, has the ability to adjust to time constraints. and can provide the user with likely causes of similarity or dis-similarity. One potential application of our similarity measure is in the distributed data mining domain. Using the notion of similarity across databases as a distance metric one cangenerate clusters of similar datasets. Once similar datasets are clustered, each cluster can be independently mined to generate the appropriate rules for a given cluster. The similarity measure is evaluated on a dataset from the Census Bureau, and synthetic datasets from IBM. ?

UR - http://www.scopus.com/inward/record.url?scp=84876371181&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84876371181&partnerID=8YFLogxK

M3 - Conference contribution

SN - 354067442X

SN - 9783540674429

VL - 1800 LNCS

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 399

EP - 406

BT - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

ER -