Affinity-based similarity measure for Web document clustering

Mei-Ling Shyu, Shu Ching Chen, Min Chen, Stuart H. Rubin

Research output: Chapter in Book/Report/Conference proceedingConference contribution

11 Citations (Scopus)

Abstract

Compared to the regular documents, the major distinguishing characteristics of the Web documents is the dynamic hyper-structure. Thus, in addition to terms or keywords for regular document clustering, Web document clustering can incorporate some dynamic information such as the hyperlinks and the access patterns extracted from the user query logs. In this paper, we extend the concept of document clustering into Web document clustering by introducing the strategy of affinity-based similarity measure, which utilizes the user access patterns in determining the similarities among Web documents via a probabilistic model. Several comparison experiments are conducted using a real data set and the experimental results demonstrate that the proposed similarity measure outperforms the Cosine coefficient and the Euclidean distance method under different document clustering algorithms.

Original languageEnglish
Title of host publicationProceedings of the 2004 IEEE International Conference on Information Reuse and Integration, IRI-2004
EditorsA.M. Memon, N. Zhao
Pages247-252
Number of pages6
StatePublished - Dec 1 2004
EventProceedings of the 2004 IEEE International Conference on Information Reuse and Integration, IRI-2004 - Las Vegas, NV, United States
Duration: Nov 8 2004Nov 10 2004

Other

OtherProceedings of the 2004 IEEE International Conference on Information Reuse and Integration, IRI-2004
CountryUnited States
CityLas Vegas, NV
Period11/8/0411/10/04

Fingerprint

Clustering algorithms
Experiments
Statistical Models

ASJC Scopus subject areas

  • Engineering(all)

Cite this

Shyu, M-L., Chen, S. C., Chen, M., & Rubin, S. H. (2004). Affinity-based similarity measure for Web document clustering. In A. M. Memon, & N. Zhao (Eds.), Proceedings of the 2004 IEEE International Conference on Information Reuse and Integration, IRI-2004 (pp. 247-252)

Affinity-based similarity measure for Web document clustering. / Shyu, Mei-Ling; Chen, Shu Ching; Chen, Min; Rubin, Stuart H.

Proceedings of the 2004 IEEE International Conference on Information Reuse and Integration, IRI-2004. ed. / A.M. Memon; N. Zhao. 2004. p. 247-252.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Shyu, M-L, Chen, SC, Chen, M & Rubin, SH 2004, Affinity-based similarity measure for Web document clustering. in AM Memon & N Zhao (eds), Proceedings of the 2004 IEEE International Conference on Information Reuse and Integration, IRI-2004. pp. 247-252, Proceedings of the 2004 IEEE International Conference on Information Reuse and Integration, IRI-2004, Las Vegas, NV, United States, 11/8/04.
Shyu M-L, Chen SC, Chen M, Rubin SH. Affinity-based similarity measure for Web document clustering. In Memon AM, Zhao N, editors, Proceedings of the 2004 IEEE International Conference on Information Reuse and Integration, IRI-2004. 2004. p. 247-252
Shyu, Mei-Ling ; Chen, Shu Ching ; Chen, Min ; Rubin, Stuart H. / Affinity-based similarity measure for Web document clustering. Proceedings of the 2004 IEEE International Conference on Information Reuse and Integration, IRI-2004. editor / A.M. Memon ; N. Zhao. 2004. pp. 247-252
@inproceedings{ea113814333c4a9db051941497817dfc,
title = "Affinity-based similarity measure for Web document clustering",
abstract = "Compared to the regular documents, the major distinguishing characteristics of the Web documents is the dynamic hyper-structure. Thus, in addition to terms or keywords for regular document clustering, Web document clustering can incorporate some dynamic information such as the hyperlinks and the access patterns extracted from the user query logs. In this paper, we extend the concept of document clustering into Web document clustering by introducing the strategy of affinity-based similarity measure, which utilizes the user access patterns in determining the similarities among Web documents via a probabilistic model. Several comparison experiments are conducted using a real data set and the experimental results demonstrate that the proposed similarity measure outperforms the Cosine coefficient and the Euclidean distance method under different document clustering algorithms.",
author = "Mei-Ling Shyu and Chen, {Shu Ching} and Min Chen and Rubin, {Stuart H.}",
year = "2004",
month = "12",
day = "1",
language = "English",
isbn = "0780388194",
pages = "247--252",
editor = "A.M. Memon and N. Zhao",
booktitle = "Proceedings of the 2004 IEEE International Conference on Information Reuse and Integration, IRI-2004",

}

TY - GEN

T1 - Affinity-based similarity measure for Web document clustering

AU - Shyu, Mei-Ling

AU - Chen, Shu Ching

AU - Chen, Min

AU - Rubin, Stuart H.

PY - 2004/12/1

Y1 - 2004/12/1

N2 - Compared to the regular documents, the major distinguishing characteristics of the Web documents is the dynamic hyper-structure. Thus, in addition to terms or keywords for regular document clustering, Web document clustering can incorporate some dynamic information such as the hyperlinks and the access patterns extracted from the user query logs. In this paper, we extend the concept of document clustering into Web document clustering by introducing the strategy of affinity-based similarity measure, which utilizes the user access patterns in determining the similarities among Web documents via a probabilistic model. Several comparison experiments are conducted using a real data set and the experimental results demonstrate that the proposed similarity measure outperforms the Cosine coefficient and the Euclidean distance method under different document clustering algorithms.

AB - Compared to the regular documents, the major distinguishing characteristics of the Web documents is the dynamic hyper-structure. Thus, in addition to terms or keywords for regular document clustering, Web document clustering can incorporate some dynamic information such as the hyperlinks and the access patterns extracted from the user query logs. In this paper, we extend the concept of document clustering into Web document clustering by introducing the strategy of affinity-based similarity measure, which utilizes the user access patterns in determining the similarities among Web documents via a probabilistic model. Several comparison experiments are conducted using a real data set and the experimental results demonstrate that the proposed similarity measure outperforms the Cosine coefficient and the Euclidean distance method under different document clustering algorithms.

UR - http://www.scopus.com/inward/record.url?scp=16244393736&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=16244393736&partnerID=8YFLogxK

M3 - Conference contribution

AN - SCOPUS:16244393736

SN - 0780388194

SP - 247

EP - 252

BT - Proceedings of the 2004 IEEE International Conference on Information Reuse and Integration, IRI-2004

A2 - Memon, A.M.

A2 - Zhao, N.

ER -