The communication complexity of distributed estimation

Parikshit Gopalan; Raghu Meka; Prasad Raghavendra; Mihir Singhal; Avi Wigderson

The communication complexity of distributed estimation

Parikshit Gopalan, Raghu Meka, Prasad Raghavendra, Mihir Singhal, Avi Wigderson

TL;DR

The paper develops a comprehensive framework for the communication complexity of distributed estimation, where two players hold distributions p and q and must estimate E_{x~p,y~q}[f(x,y)]. It introduces a debiasing protocol that reduces the ε-dependence from 1/ε^2 to 1/ε and provides spectral, discrepancy-based, and rank-based lower bounds, establishing near-optimality for broad function classes. It delivers concrete, low-communication protocols for fundamental functions like EQ and GT, as well as for smooth and convex Lipschitz functions, and it connects estimation complexity to direct-sum and lifting techniques. The results yield both tight upper and lower bounds across a spectrum of models, showing, for example, that EQ and GT are among the easier high-rank Boolean cases, while general discrepancy imposes fundamental limits. Overall, the work lays a unified theory of how function structure and error tolerance govern distributed-estimation communication costs, with implications for sketching, joins, databases, and learning systems.

Abstract

We study an extension of the standard two-party communication model in which Alice and Bob hold probability distributions $p$ and $q$ over domains $X$ and $Y$, respectively. Their goal is to estimate \[ \mathbb{E}_{x \sim p,\, y \sim q}[f(x, y)] \] to within additive error $\varepsilon$ for a bounded function $f$, known to both parties. We refer to this as the distributed estimation problem. Special cases of this problem arise in a variety of areas including sketching, databases and learning. Our goal is to understand how the required communication scales with the communication complexity of $f$ and the error parameter $\varepsilon$. The random sampling approach -- estimating the mean by averaging $f$ over $O(1/\varepsilon^2)$ random samples -- requires $O(R(f)/\varepsilon^2)$ total communication, where $R(f)$ is the randomized communication complexity of $f$. We design a new debiasing protocol which improves the dependence on $1/\varepsilon$ to be linear instead of quadratic. Additionally we show better upper bounds for several special classes of functions, including the Equality and Greater-than functions. We introduce lower bound techniques based on spectral methods and discrepancy, and show the optimality of many of our protocols: the debiasing protocol is tight for general functions, and that our protocols for the equality and greater-than functions are also optimal. Furthermore, we show that among full-rank Boolean functions, Equality is essentially the easiest.

The communication complexity of distributed estimation

TL;DR

Abstract

We study an extension of the standard two-party communication model in which Alice and Bob hold probability distributions

and

over domains

and

, respectively. Their goal is to estimate \[ \mathbb{E}_{x \sim p,\, y \sim q}[f(x, y)] \] to within additive error

for a bounded function

, known to both parties. We refer to this as the distributed estimation problem. Special cases of this problem arise in a variety of areas including sketching, databases and learning. Our goal is to understand how the required communication scales with the communication complexity of

and the error parameter

. The random sampling approach -- estimating the mean by averaging

over

random samples -- requires

total communication, where

is the randomized communication complexity of

. We design a new debiasing protocol which improves the dependence on

to be linear instead of quadratic. Additionally we show better upper bounds for several special classes of functions, including the Equality and Greater-than functions. We introduce lower bound techniques based on spectral methods and discrepancy, and show the optimality of many of our protocols: the debiasing protocol is tight for general functions, and that our protocols for the equality and greater-than functions are also optimal. Furthermore, we show that among full-rank Boolean functions, Equality is essentially the easiest.

The communication complexity of distributed estimation

TL;DR

Abstract

The communication complexity of distributed estimation

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Theorems & Definitions (109)