TY - GEN
T1 - Global iceberg detection over distributed data streams
AU - Zhao, Haiquan
AU - Lall, Ashwin
AU - Ogihara, Mitsunori
AU - Xu, Jun
PY - 2010/6/1
Y1 - 2010/6/1
N2 - In today's Internet applications or sensor networks we often encounter large amounts of data spread over many physically distributed nodes. The sheer volume of the data and bandwidth constraints make it impractical to send all the data to one central node for query processing. Finding distributed icebergs - elements that may have low frequency at individual nodes but high aggregate frequency - is a problem that arises commonly in practice. In this paper we present a novel algorithm with two notable properties. First, its accuracy guarantee and communication cost are independent of the way in which element counts (for both icebergs and non-icebergs) are split amongst the nodes. Second, it works even when each distributed data set is a stream (i.e., one pass data access only). Our algorithm builds upon sketches constructed for the estimation of the second frequency moment (F2) of data streams. The intuition of our idea is that when there are global icebergs in the union of these data streams the F2 of the union becomes very large. This quantity can be estimated due to the summable nature of F2 sketches. Our key innovation here is to establish tight theoretical guarantees of our algorithm, under certain reasonable assumptions, using an interesting combination of convex ordering theory and large deviation techniques.
AB - In today's Internet applications or sensor networks we often encounter large amounts of data spread over many physically distributed nodes. The sheer volume of the data and bandwidth constraints make it impractical to send all the data to one central node for query processing. Finding distributed icebergs - elements that may have low frequency at individual nodes but high aggregate frequency - is a problem that arises commonly in practice. In this paper we present a novel algorithm with two notable properties. First, its accuracy guarantee and communication cost are independent of the way in which element counts (for both icebergs and non-icebergs) are split amongst the nodes. Second, it works even when each distributed data set is a stream (i.e., one pass data access only). Our algorithm builds upon sketches constructed for the estimation of the second frequency moment (F2) of data streams. The intuition of our idea is that when there are global icebergs in the union of these data streams the F2 of the union becomes very large. This quantity can be estimated due to the summable nature of F2 sketches. Our key innovation here is to establish tight theoretical guarantees of our algorithm, under certain reasonable assumptions, using an interesting combination of convex ordering theory and large deviation techniques.
UR - http://www.scopus.com/inward/record.url?scp=77952758693&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=77952758693&partnerID=8YFLogxK
U2 - 10.1109/ICDE.2010.5447825
DO - 10.1109/ICDE.2010.5447825
M3 - Conference contribution
AN - SCOPUS:77952758693
SN - 9781424454440
T3 - Proceedings - International Conference on Data Engineering
SP - 557
EP - 568
BT - 26th IEEE International Conference on Data Engineering, ICDE 2010 - Conference Proceedings
T2 - 26th IEEE International Conference on Data Engineering, ICDE 2010
Y2 - 1 March 2010 through 6 March 2010
ER -