Table of Contents
Fetching ...

The communication complexity of distributed estimation

Parikshit Gopalan, Raghu Meka, Prasad Raghavendra, Mihir Singhal, Avi Wigderson

TL;DR

The paper develops a comprehensive framework for the communication complexity of distributed estimation, where two players hold distributions p and q and must estimate E_{x~p,y~q}[f(x,y)]. It introduces a debiasing protocol that reduces the ε-dependence from 1/ε^2 to 1/ε and provides spectral, discrepancy-based, and rank-based lower bounds, establishing near-optimality for broad function classes. It delivers concrete, low-communication protocols for fundamental functions like EQ and GT, as well as for smooth and convex Lipschitz functions, and it connects estimation complexity to direct-sum and lifting techniques. The results yield both tight upper and lower bounds across a spectrum of models, showing, for example, that EQ and GT are among the easier high-rank Boolean cases, while general discrepancy imposes fundamental limits. Overall, the work lays a unified theory of how function structure and error tolerance govern distributed-estimation communication costs, with implications for sketching, joins, databases, and learning systems.

Abstract

We study an extension of the standard two-party communication model in which Alice and Bob hold probability distributions $p$ and $q$ over domains $X$ and $Y$, respectively. Their goal is to estimate \[ \mathbb{E}_{x \sim p,\, y \sim q}[f(x, y)] \] to within additive error $\varepsilon$ for a bounded function $f$, known to both parties. We refer to this as the distributed estimation problem. Special cases of this problem arise in a variety of areas including sketching, databases and learning. Our goal is to understand how the required communication scales with the communication complexity of $f$ and the error parameter $\varepsilon$. The random sampling approach -- estimating the mean by averaging $f$ over $O(1/\varepsilon^2)$ random samples -- requires $O(R(f)/\varepsilon^2)$ total communication, where $R(f)$ is the randomized communication complexity of $f$. We design a new debiasing protocol which improves the dependence on $1/\varepsilon$ to be linear instead of quadratic. Additionally we show better upper bounds for several special classes of functions, including the Equality and Greater-than functions. We introduce lower bound techniques based on spectral methods and discrepancy, and show the optimality of many of our protocols: the debiasing protocol is tight for general functions, and that our protocols for the equality and greater-than functions are also optimal. Furthermore, we show that among full-rank Boolean functions, Equality is essentially the easiest.

The communication complexity of distributed estimation

TL;DR

The paper develops a comprehensive framework for the communication complexity of distributed estimation, where two players hold distributions p and q and must estimate E_{x~p,y~q}[f(x,y)]. It introduces a debiasing protocol that reduces the ε-dependence from 1/ε^2 to 1/ε and provides spectral, discrepancy-based, and rank-based lower bounds, establishing near-optimality for broad function classes. It delivers concrete, low-communication protocols for fundamental functions like EQ and GT, as well as for smooth and convex Lipschitz functions, and it connects estimation complexity to direct-sum and lifting techniques. The results yield both tight upper and lower bounds across a spectrum of models, showing, for example, that EQ and GT are among the easier high-rank Boolean cases, while general discrepancy imposes fundamental limits. Overall, the work lays a unified theory of how function structure and error tolerance govern distributed-estimation communication costs, with implications for sketching, joins, databases, and learning systems.

Abstract

We study an extension of the standard two-party communication model in which Alice and Bob hold probability distributions and over domains and , respectively. Their goal is to estimate \[ \mathbb{E}_{x \sim p,\, y \sim q}[f(x, y)] \] to within additive error for a bounded function , known to both parties. We refer to this as the distributed estimation problem. Special cases of this problem arise in a variety of areas including sketching, databases and learning. Our goal is to understand how the required communication scales with the communication complexity of and the error parameter . The random sampling approach -- estimating the mean by averaging over random samples -- requires total communication, where is the randomized communication complexity of . We design a new debiasing protocol which improves the dependence on to be linear instead of quadratic. Additionally we show better upper bounds for several special classes of functions, including the Equality and Greater-than functions. We introduce lower bound techniques based on spectral methods and discrepancy, and show the optimality of many of our protocols: the debiasing protocol is tight for general functions, and that our protocols for the equality and greater-than functions are also optimal. Furthermore, we show that among full-rank Boolean functions, Equality is essentially the easiest.

Paper Structure

This paper contains 78 sections, 51 theorems, 248 equations, 5 algorithms.

Key Result

Theorem 1.3

For a function $f: \mathcal{X} \times \mathcal{Y} \to \{-1,1\}$, let If $\Pi$ is a protocol for distributed estimation of $f$ where the two players transmit $M_A$ and $M_B$ bits respectively then, In particular, $\bar{\mathrm{R}}^{\mathrm{ow}}(f) = \Omega(k/\varepsilon^2)$, and $\bar{\mathrm{R}}(f) = \Omega(k/\varepsilon)$.

Theorems & Definitions (109)

  • Definition 1.1: Distributed estimation for $f$
  • Definition 1.2
  • Theorem 1.3: Informal version of \ref{['thm:main']}
  • Corollary 1.4: Restatement of \ref{['cor:ipcor']}
  • Theorem 1.5: Informal version of \ref{['thm:spectral']}
  • Theorem 1.6: Restatement of \ref{['thm:boolean-lb']}
  • Theorem 2.1
  • proof
  • Theorem 2.2
  • Lemma 2.3
  • ...and 99 more