Efficient Unbiased Sparsification

Leighton Barnes; Stephen Cameron; Timothy Chow; Emma Cohen; Keith Frankston; Benjamin Howard; Fred Kochman; Daniel Scheinerman; Jeffrey VanderKam

Efficient Unbiased Sparsification

Leighton Barnes, Stephen Cameron, Timothy Chow, Emma Cohen, Keith Frankston, Benjamin Howard, Fred Kochman, Daniel Scheinerman, Jeffrey VanderKam

TL;DR

The paper tackles unbiased $m$-sparsification (EUS): constructing random vectors $Q$ with at most $m$ nonzeros that satisfy $\mathbb{E}[Q]=p$ while minimizing a divergence $\mathsf{Div}(Q,p)$. It develops two principal frameworks: permutation-invariant divergences, where preserving heavy coordinates and deterministically filling a common tail value (Algorithm USA) yields universally optimal sparsifications; and strictly convex, additively separable divergences, where a coordinate-wise reduction leads to Algorithm UWSA with a principled threshold $\lambda$ and marginal sampling. The results establish a robustness phenomenon: for permutation-invariant divergences, the optimal $Q$ is largely divergence-agnostic beyond convexity and invariance, with UWSA providing a complementary approach for separable divergences. The methods leverage facet concentration, critical-point optimality, and Lagrangian analysis to produce explicit, efficient sparsification schemes applicable to federated learning and sparse-probability sampling, while highlighting open questions about general divergences and relaxed constraints.

Abstract

An unbiased $m$-sparsification of a vector $p\in \mathbb{R}^n$ is a random vector $Q\in \mathbb{R}^n$ with mean $p$ that has at most $m<n$ nonzero coordinates. Unbiased sparsification compresses the original vector without introducing bias; it arises in various contexts, such as in federated learning and sampling sparse probability distributions. Ideally, unbiased sparsification should also minimize the expected value of a divergence function $\mathsf{Div}(Q,p)$ that measures how far away $Q$ is from the original $p$. If $Q$ is optimal in this sense, then we call it efficient. Our main results describe efficient unbiased sparsifications for divergences that are either permutation-invariant or additively separable. Surprisingly, the characterization for permutation-invariant divergences is robust to the choice of divergence function, in the sense that our class of optimal $Q$ for squared Euclidean distance coincides with our class of optimal $Q$ for Kullback-Leibler divergence, or indeed any of a wide variety of divergences.

Efficient Unbiased Sparsification

TL;DR

The paper tackles unbiased

-sparsification (EUS): constructing random vectors

with at most

nonzeros that satisfy

while minimizing a divergence

. It develops two principal frameworks: permutation-invariant divergences, where preserving heavy coordinates and deterministically filling a common tail value (Algorithm USA) yields universally optimal sparsifications; and strictly convex, additively separable divergences, where a coordinate-wise reduction leads to Algorithm UWSA with a principled threshold

and marginal sampling. The results establish a robustness phenomenon: for permutation-invariant divergences, the optimal

is largely divergence-agnostic beyond convexity and invariance, with UWSA providing a complementary approach for separable divergences. The methods leverage facet concentration, critical-point optimality, and Lagrangian analysis to produce explicit, efficient sparsification schemes applicable to federated learning and sparse-probability sampling, while highlighting open questions about general divergences and relaxed constraints.

Abstract

An unbiased

-sparsification of a vector

is a random vector

with mean

that has at most

nonzero coordinates. Unbiased sparsification compresses the original vector without introducing bias; it arises in various contexts, such as in federated learning and sampling sparse probability distributions. Ideally, unbiased sparsification should also minimize the expected value of a divergence function

that measures how far away

is from the original

. If

is optimal in this sense, then we call it efficient. Our main results describe efficient unbiased sparsifications for divergences that are either permutation-invariant or additively separable. Surprisingly, the characterization for permutation-invariant divergences is robust to the choice of divergence function, in the sense that our class of optimal

for squared Euclidean distance coincides with our class of optimal

for Kullback-Leibler divergence, or indeed any of a wide variety of divergences.

Paper Structure (18 sections, 5 theorems, 64 equations, 1 figure)

This paper contains 18 sections, 5 theorems, 64 equations, 1 figure.

Introduction
Sampling with Specified Marginals
Efficient Unbiased Sparsification
Permutation-Invariant Divergences
Facet Concentration
Optimality of Critical Points
Solving \ref{['opt:dist']} for Permutation-Invariant Divergences
Additively Separable Divergences
Generalizations and Open Questions
Proofs
Proof of Lemma \ref{['lem:face-concentration']}
Proof of Lemma \ref{['thm:critical']}
Proof of Theorem \ref{['thm:pinv']}
Removing the Smoothness Condition
Uniqueness under Strict Convexity
...and 3 more sections

Key Result

Lemma 1

Assume that $\mathop{\mathrm{Div}}\nolimits$ is convex, and let $Q$ be an $m$-sparsification of $p\in \mathbb{R}_{>0}^n$. For each $I$ with $\Pr(Q\in \Delta^I) > 0$, write $q^I \vcentcolon= \mathbf{E}[Q | Q \in \Delta^I]$. Then the sparsification $Q'$ such that $\Pr(Q'\in \Delta^I) = \Pr(Q\in\Delta^

Figures (1)

Figure 1: Illustration of the probability simplex when using \ref{['alg:usa']} on a probability distribution with $n=3$ and $m=2$.

Theorems & Definitions (13)

Definition 1.1
Definition 1.2
Definition 1.3
Lemma 1: Facet concentration
proof
Lemma 2
proof
Definition 2.1
Theorem 1
proof
...and 3 more

Efficient Unbiased Sparsification

TL;DR

Abstract

Efficient Unbiased Sparsification

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (1)

Theorems & Definitions (13)