Finding global icebergs over distributed data sets

Qi Zhao, Mitsunori Ogihara, Haixun Wang, Jun Xu

Research output: Chapter in Book/Report/Conference proceedingConference contribution

39 Scopus citations

Abstract

Finding icebergs-items whose frequency of occurrence is above a certain threshold-is an important problem with a wide range of applications. Most of the existing work focuses on iceberg queries at a single node. However, in many real-life applications, data sets are distributed across a large number of nodes. Two naïve approaches might be considered. In the first, each node ships its entire data set to a central server, and the central server uses single-node algorithms to find icebergs. But it may incur prohibitive communication overhead. In the second, each node submits local icebergs, and the central server combines local icebergs to find global icebergs. But it may fail because in many important applications, globally frequent items may not be frequent at any node. In this work, we propose two novel schemes that provide accurate and efficient solutions to this problem: a sampling-based scheme and a counting-sketch-based scheme. In particular, the latter scheme incurs a communication cost at least an order of magnitude smaller than the naïve scheme of shipping all data, yet is able to achieve very high accuracy. Through rigorous theoretical and experimental analysis we establish the statistical properties of our proposed algorithms, including their accuracy bounds.

Original languageEnglish (US)
Title of host publicationProceedings of the Twenty-Fifth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 2006
Pages298-307
Number of pages10
DOIs
StatePublished - Dec 1 2006
Externally publishedYes
Event25th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 2006 - Chicago, IL, United States
Duration: Jun 26 2006Jun 28 2006

Publication series

NameProceedings of the ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems

Other

Other25th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 2006
CountryUnited States
CityChicago, IL
Period6/26/066/28/06

Keywords

  • Data streaming
  • Icebergs
  • Statistical inference

ASJC Scopus subject areas

  • Software
  • Information Systems
  • Hardware and Architecture

Fingerprint Dive into the research topics of 'Finding global icebergs over distributed data sets'. Together they form a unique fingerprint.

  • Cite this

    Zhao, Q., Ogihara, M., Wang, H., & Xu, J. (2006). Finding global icebergs over distributed data sets. In Proceedings of the Twenty-Fifth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 2006 (pp. 298-307). (Proceedings of the ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems). https://doi.org/10.1145/1142351.1142394