Category cluster discovery from distributed WWW directories

Mei-Ling Shyu, Choochart Haruechaiyasak, Shu Ching Chen

Research output: Contribution to journalArticle

33 Scopus citations

Abstract

Due to the inherently distributed nature of many networks, including the Internet, information and knowledge are generated and organized independently by different groups of people. To discover and exploit all the knowledge from different sources, a method of knowledge integration is usually required. Considering the document category sets as information sources, we define a problem of information integration called category merging. The purpose of category merging is to automatically construct a unified category set which represents and exploits document information from several different sources. This merging process is based on the clustering concept where categories with similar characteristics are merged into the same cluster under certain distributed constraints. To evaluate the quality of the merged category set, we measure the precision and recall values under three classification methods, Naive Bayes, Vector Space Model, and K-Nearest Neighbor. In addition, we propose a performance measure called cluster entropy, which determines how well the categories from different sources are distributed over the resulting clusters. We perform the merging process by using the real data sets collected from three different Web directories. The results show that our merging process improves the classification performance over the non-merged approach and also provides a better representation for all categories from distributed directories.

Original languageEnglish
Pages (from-to)181-197
Number of pages17
JournalInformation Sciences
Volume155
Issue number3-4
DOIs
StatePublished - Oct 15 2003

    Fingerprint

Keywords

  • Cluster analysis
  • Distributed information sources
  • Document classification
  • Information integration
  • Web mining

ASJC Scopus subject areas

  • Statistics and Probability
  • Electrical and Electronic Engineering
  • Statistics, Probability and Uncertainty
  • Information Systems and Management
  • Information Systems
  • Computer Science Applications
  • Artificial Intelligence

Cite this