Table of Contents
Fetching ...

Simple and Optimal Algorithms for Heavy Hitters and Frequency Moments in Distributed Models

Zengfeng Huang, Zhongzheng Xiong, Xiaoyi Zhu, Zhewei Wei

TL;DR

This work addresses distributed computation of heavy hitters and frequency moments with $p\ge 2$ in both the coordinator and distributed tracking models. It introduces a simple, unified framework based on improved $\ell_2$-heavy hitter methods, thresholding, and non-linear sampling that yields near-optimal $\ell_p$-HH and $F_p$ algorithms, with one-round or two-round variants and polylogarithmic overhead. By combining these heavy hitter primitives with recursive sketching, the authors obtain near-optimal/optimal bounds for $F_p$ estimation in both static and tracking settings, significantly improving log factors and, in several cases, prior bounds (notably for $F_2$ and general $p$). The techniques—including sparsification, thresholded sampling, and careful tracking of heavy hitters—lead to practical communication costs that scale as $\tilde{O}(k^{p-1}/\varepsilon^p)$ for $\ell_p$-HH and $\tilde{O}(k^{p-1}/\varepsilon^2)$ for $F_p$ tracking, representing substantial theoretical and practical progress in distributed data analysis.

Abstract

We consider the problems of distributed heavy hitters and frequency moments in both the coordinator model and the distributed tracking model (also known as the distributed functional monitoring model). We present simple and optimal (up to logarithmic factors) algorithms for $\ell_p$ heavy hitters and $F_p$ estimation ($p \geq 2$) in these distributed models. For $\ell_p$ heavy hitters in the coordinator model, our algorithm requires only one round and uses $\tilde{O}(k^{p-1}/\eps^p)$ bits of communication. For $p > 2$, this is the first near-optimal result. By combining our algorithm with the standard recursive sketching technique, we obtain a near-optimal two-round algorithm for $F_p$ in the coordinator model, matching a significant result from recent work by Esfandiari et al.\ (STOC 2024). Our algorithm and analysis are much simpler and have better costs with respect to logarithmic factors. Furthermore, our technique provides a one-round algorithm for $F_p$, which is a significant improvement over a result of Woodruff and Zhang (STOC 2012). Thanks to the simplicity of our heavy hitter algorithms, we manage to adapt them to the distributed tracking model with only a $\polylog(n)$ increase in communication. For $\ell_p$ heavy hitters, our algorithm has a communication cost of $\tilde{O}(k^{p-1}/\eps^p)$, representing the first near-optimal algorithm for all $p \geq 2$. By applying the recursive sketching technique, we also provide the first near-optimal algorithm for $F_p$ in the distributed tracking model, with a communication cost of $\tilde{O}(k^{p-1}/\eps^2)$ for all $p \geq 2$. Even for $F_2$, our result improves upon the bounds established by Cormode, Muthukrishnan, and Yi (SODA 2008) and Woodruff and Zhang (STOC 2012), nearly matching the existing lower bound for the first time.

Simple and Optimal Algorithms for Heavy Hitters and Frequency Moments in Distributed Models

TL;DR

This work addresses distributed computation of heavy hitters and frequency moments with in both the coordinator and distributed tracking models. It introduces a simple, unified framework based on improved -heavy hitter methods, thresholding, and non-linear sampling that yields near-optimal -HH and algorithms, with one-round or two-round variants and polylogarithmic overhead. By combining these heavy hitter primitives with recursive sketching, the authors obtain near-optimal/optimal bounds for estimation in both static and tracking settings, significantly improving log factors and, in several cases, prior bounds (notably for and general ). The techniques—including sparsification, thresholded sampling, and careful tracking of heavy hitters—lead to practical communication costs that scale as for -HH and for tracking, representing substantial theoretical and practical progress in distributed data analysis.

Abstract

We consider the problems of distributed heavy hitters and frequency moments in both the coordinator model and the distributed tracking model (also known as the distributed functional monitoring model). We present simple and optimal (up to logarithmic factors) algorithms for heavy hitters and estimation () in these distributed models. For heavy hitters in the coordinator model, our algorithm requires only one round and uses bits of communication. For , this is the first near-optimal result. By combining our algorithm with the standard recursive sketching technique, we obtain a near-optimal two-round algorithm for in the coordinator model, matching a significant result from recent work by Esfandiari et al.\ (STOC 2024). Our algorithm and analysis are much simpler and have better costs with respect to logarithmic factors. Furthermore, our technique provides a one-round algorithm for , which is a significant improvement over a result of Woodruff and Zhang (STOC 2012). Thanks to the simplicity of our heavy hitter algorithms, we manage to adapt them to the distributed tracking model with only a increase in communication. For heavy hitters, our algorithm has a communication cost of , representing the first near-optimal algorithm for all . By applying the recursive sketching technique, we also provide the first near-optimal algorithm for in the distributed tracking model, with a communication cost of for all . Even for , our result improves upon the bounds established by Cormode, Muthukrishnan, and Yi (SODA 2008) and Woodruff and Zhang (STOC 2012), nearly matching the existing lower bound for the first time.

Paper Structure

This paper contains 28 sections, 11 theorems, 27 equations, 1 table, 11 algorithms.

Key Result

Theorem 1

Recursive sketching (Algorithm alg:recursive_sktech) outputs a $(1 \pm \varepsilon)$-approximation of $|u|$ w.p. at least 0.9. The Communication cost is $O\left(\log (n) \cdot \mu\left(n, \frac{\varepsilon^2 }{\log ^3(n)}, \varepsilon, \frac{1}{\log (n)}\right)\right)$ bits.

Theorems & Definitions (22)

  • Definition 1: ($\alpha, \varepsilon$)-cover
  • Theorem 1
  • Remark
  • Theorem 2
  • Remark
  • Lemma 1
  • Theorem 3
  • Theorem 4
  • Theorem 5
  • Theorem 6
  • ...and 12 more