Global iceberg detection over distributed data streams

Haiquan Zhao, Ashwin Lall, Mitsunori Ogihara, Jun Xu

Research output: Chapter in Book/Report/Conference proceedingConference contribution

15 Citations (Scopus)

Abstract

In today's Internet applications or sensor networks we often encounter large amounts of data spread over many physically distributed nodes. The sheer volume of the data and bandwidth constraints make it impractical to send all the data to one central node for query processing. Finding distributed icebergs - elements that may have low frequency at individual nodes but high aggregate frequency - is a problem that arises commonly in practice. In this paper we present a novel algorithm with two notable properties. First, its accuracy guarantee and communication cost are independent of the way in which element counts (for both icebergs and non-icebergs) are split amongst the nodes. Second, it works even when each distributed data set is a stream (i.e., one pass data access only). Our algorithm builds upon sketches constructed for the estimation of the second frequency moment (F2) of data streams. The intuition of our idea is that when there are global icebergs in the union of these data streams the F2 of the union becomes very large. This quantity can be estimated due to the summable nature of F2 sketches. Our key innovation here is to establish tight theoretical guarantees of our algorithm, under certain reasonable assumptions, using an interesting combination of convex ordering theory and large deviation techniques.

Original languageEnglish (US)
Title of host publicationProceedings - International Conference on Data Engineering
Pages557-568
Number of pages12
DOIs
StatePublished - 2010
Event26th IEEE International Conference on Data Engineering, ICDE 2010 - Long Beach, CA, United States
Duration: Mar 1 2010Mar 6 2010

Other

Other26th IEEE International Conference on Data Engineering, ICDE 2010
CountryUnited States
CityLong Beach, CA
Period3/1/103/6/10

Fingerprint

Query processing
Sensor networks
Innovation
Internet
Bandwidth
Communication
Costs

ASJC Scopus subject areas

  • Information Systems
  • Signal Processing
  • Software

Cite this

Zhao, H., Lall, A., Ogihara, M., & Xu, J. (2010). Global iceberg detection over distributed data streams. In Proceedings - International Conference on Data Engineering (pp. 557-568). [5447825] https://doi.org/10.1109/ICDE.2010.5447825

Global iceberg detection over distributed data streams. / Zhao, Haiquan; Lall, Ashwin; Ogihara, Mitsunori; Xu, Jun.

Proceedings - International Conference on Data Engineering. 2010. p. 557-568 5447825.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Zhao, H, Lall, A, Ogihara, M & Xu, J 2010, Global iceberg detection over distributed data streams. in Proceedings - International Conference on Data Engineering., 5447825, pp. 557-568, 26th IEEE International Conference on Data Engineering, ICDE 2010, Long Beach, CA, United States, 3/1/10. https://doi.org/10.1109/ICDE.2010.5447825
Zhao H, Lall A, Ogihara M, Xu J. Global iceberg detection over distributed data streams. In Proceedings - International Conference on Data Engineering. 2010. p. 557-568. 5447825 https://doi.org/10.1109/ICDE.2010.5447825
Zhao, Haiquan ; Lall, Ashwin ; Ogihara, Mitsunori ; Xu, Jun. / Global iceberg detection over distributed data streams. Proceedings - International Conference on Data Engineering. 2010. pp. 557-568
@inproceedings{296343696e614d5caba312913a088b7e,
title = "Global iceberg detection over distributed data streams",
abstract = "In today's Internet applications or sensor networks we often encounter large amounts of data spread over many physically distributed nodes. The sheer volume of the data and bandwidth constraints make it impractical to send all the data to one central node for query processing. Finding distributed icebergs - elements that may have low frequency at individual nodes but high aggregate frequency - is a problem that arises commonly in practice. In this paper we present a novel algorithm with two notable properties. First, its accuracy guarantee and communication cost are independent of the way in which element counts (for both icebergs and non-icebergs) are split amongst the nodes. Second, it works even when each distributed data set is a stream (i.e., one pass data access only). Our algorithm builds upon sketches constructed for the estimation of the second frequency moment (F2) of data streams. The intuition of our idea is that when there are global icebergs in the union of these data streams the F2 of the union becomes very large. This quantity can be estimated due to the summable nature of F2 sketches. Our key innovation here is to establish tight theoretical guarantees of our algorithm, under certain reasonable assumptions, using an interesting combination of convex ordering theory and large deviation techniques.",
author = "Haiquan Zhao and Ashwin Lall and Mitsunori Ogihara and Jun Xu",
year = "2010",
doi = "10.1109/ICDE.2010.5447825",
language = "English (US)",
isbn = "9781424454440",
pages = "557--568",
booktitle = "Proceedings - International Conference on Data Engineering",

}

TY - GEN

T1 - Global iceberg detection over distributed data streams

AU - Zhao, Haiquan

AU - Lall, Ashwin

AU - Ogihara, Mitsunori

AU - Xu, Jun

PY - 2010

Y1 - 2010

N2 - In today's Internet applications or sensor networks we often encounter large amounts of data spread over many physically distributed nodes. The sheer volume of the data and bandwidth constraints make it impractical to send all the data to one central node for query processing. Finding distributed icebergs - elements that may have low frequency at individual nodes but high aggregate frequency - is a problem that arises commonly in practice. In this paper we present a novel algorithm with two notable properties. First, its accuracy guarantee and communication cost are independent of the way in which element counts (for both icebergs and non-icebergs) are split amongst the nodes. Second, it works even when each distributed data set is a stream (i.e., one pass data access only). Our algorithm builds upon sketches constructed for the estimation of the second frequency moment (F2) of data streams. The intuition of our idea is that when there are global icebergs in the union of these data streams the F2 of the union becomes very large. This quantity can be estimated due to the summable nature of F2 sketches. Our key innovation here is to establish tight theoretical guarantees of our algorithm, under certain reasonable assumptions, using an interesting combination of convex ordering theory and large deviation techniques.

AB - In today's Internet applications or sensor networks we often encounter large amounts of data spread over many physically distributed nodes. The sheer volume of the data and bandwidth constraints make it impractical to send all the data to one central node for query processing. Finding distributed icebergs - elements that may have low frequency at individual nodes but high aggregate frequency - is a problem that arises commonly in practice. In this paper we present a novel algorithm with two notable properties. First, its accuracy guarantee and communication cost are independent of the way in which element counts (for both icebergs and non-icebergs) are split amongst the nodes. Second, it works even when each distributed data set is a stream (i.e., one pass data access only). Our algorithm builds upon sketches constructed for the estimation of the second frequency moment (F2) of data streams. The intuition of our idea is that when there are global icebergs in the union of these data streams the F2 of the union becomes very large. This quantity can be estimated due to the summable nature of F2 sketches. Our key innovation here is to establish tight theoretical guarantees of our algorithm, under certain reasonable assumptions, using an interesting combination of convex ordering theory and large deviation techniques.

UR - http://www.scopus.com/inward/record.url?scp=77952758693&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=77952758693&partnerID=8YFLogxK

U2 - 10.1109/ICDE.2010.5447825

DO - 10.1109/ICDE.2010.5447825

M3 - Conference contribution

SN - 9781424454440

SP - 557

EP - 568

BT - Proceedings - International Conference on Data Engineering

ER -