Table of Contents
Fetching ...

Frequency Moments in Noisy Streaming and Distributed Data under Mismatch Ambiguity

Kaiwen Liu, Qin Zhang

Abstract

We propose a novel framework for statistical estimation on noisy datasets. Within this framework, we focus on the frequency moments ($F_p$) problem and demonstrate that it is possible to approximate $F_p$ of the unknown ground-truth dataset using sublinear space in the data stream model and sublinear communication in the coordinator model, provided that the approximation ratio is parameterized by a data-dependent quantity, which we call the $F_p$-mismatch-ambiguity. We also establish a set of lower bounds, which are tight in terms of the input size. Our results yield several interesting insights: (1) In the data stream model, the $F_p$ problem is inherently more difficult in the noisy setting than in the noiseless one. In particular, while $F_2$ can be approximated in logarithmic space in terms of the input size in the noiseless setting, any algorithm for $F_2$ in the noisy setting requires polynomial space. (2) In the coordinator model, in sharp contrast to the noiseless case, achieving polylogarithmic communication in the input size is generally impossible for $F_p$ under noise. However, when the $F_p$ mismatch ambiguity falls below a certain threshold, it becomes possible to achieve communication that is entirely independent of the input size.

Frequency Moments in Noisy Streaming and Distributed Data under Mismatch Ambiguity

Abstract

We propose a novel framework for statistical estimation on noisy datasets. Within this framework, we focus on the frequency moments () problem and demonstrate that it is possible to approximate of the unknown ground-truth dataset using sublinear space in the data stream model and sublinear communication in the coordinator model, provided that the approximation ratio is parameterized by a data-dependent quantity, which we call the -mismatch-ambiguity. We also establish a set of lower bounds, which are tight in terms of the input size. Our results yield several interesting insights: (1) In the data stream model, the problem is inherently more difficult in the noisy setting than in the noiseless one. In particular, while can be approximated in logarithmic space in terms of the input size in the noiseless setting, any algorithm for in the noisy setting requires polynomial space. (2) In the coordinator model, in sharp contrast to the noiseless case, achieving polylogarithmic communication in the input size is generally impossible for under noise. However, when the mismatch ambiguity falls below a certain threshold, it becomes possible to achieve communication that is entirely independent of the input size.
Paper Structure (9 sections, 17 theorems, 14 equations, 1 table, 4 algorithms)

This paper contains 9 sections, 17 theorems, 14 equations, 1 table, 4 algorithms.

Key Result

Theorem 2.1

For any constant $p \in \mathbb{Z}^+$, given a noisy input data stream of length $m$ with $\eta_p \leq \frac{1}{3(p!)}$, Algorithm alg:Fp-one-pass computes an $((\epsilon+O(\eta_p), 0.01)$-approximation of $F_p$, using a single pass and $O\left(\frac{1}{\epsilon^2}m^{1-1/p}\right)$ words of space.

Theorems & Definitions (24)

  • Definition 1.1: Frequency Moments
  • Definition 1.2: $F_p$-mismatch-ambiguity ($p \ge 1$)
  • Theorem 2.1
  • Lemma 2.2: Feige04
  • Definition 2.3: Ordered $p$-Clique
  • Lemma 2.5
  • Definition 2.6: Increasingly Ordered $p$-Clique
  • Definition 2.7: Tail Set
  • Lemma 2.8
  • Lemma 2.9: $\Bar{X}$ is not too large
  • ...and 14 more