Table of Contents
Fetching ...

Efficient Unbiased Sparsification

Leighton Barnes, Stephen Cameron, Timothy Chow, Emma Cohen, Keith Frankston, Benjamin Howard, Fred Kochman, Daniel Scheinerman, Jeffrey VanderKam

TL;DR

The paper tackles unbiased $m$-sparsification (EUS): constructing random vectors $Q$ with at most $m$ nonzeros that satisfy $\mathbb{E}[Q]=p$ while minimizing a divergence $\mathsf{Div}(Q,p)$. It develops two principal frameworks: permutation-invariant divergences, where preserving heavy coordinates and deterministically filling a common tail value (Algorithm USA) yields universally optimal sparsifications; and strictly convex, additively separable divergences, where a coordinate-wise reduction leads to Algorithm UWSA with a principled threshold $\lambda$ and marginal sampling. The results establish a robustness phenomenon: for permutation-invariant divergences, the optimal $Q$ is largely divergence-agnostic beyond convexity and invariance, with UWSA providing a complementary approach for separable divergences. The methods leverage facet concentration, critical-point optimality, and Lagrangian analysis to produce explicit, efficient sparsification schemes applicable to federated learning and sparse-probability sampling, while highlighting open questions about general divergences and relaxed constraints.

Abstract

An unbiased $m$-sparsification of a vector $p\in \mathbb{R}^n$ is a random vector $Q\in \mathbb{R}^n$ with mean $p$ that has at most $m<n$ nonzero coordinates. Unbiased sparsification compresses the original vector without introducing bias; it arises in various contexts, such as in federated learning and sampling sparse probability distributions. Ideally, unbiased sparsification should also minimize the expected value of a divergence function $\mathsf{Div}(Q,p)$ that measures how far away $Q$ is from the original $p$. If $Q$ is optimal in this sense, then we call it efficient. Our main results describe efficient unbiased sparsifications for divergences that are either permutation-invariant or additively separable. Surprisingly, the characterization for permutation-invariant divergences is robust to the choice of divergence function, in the sense that our class of optimal $Q$ for squared Euclidean distance coincides with our class of optimal $Q$ for Kullback-Leibler divergence, or indeed any of a wide variety of divergences.

Efficient Unbiased Sparsification

TL;DR

The paper tackles unbiased -sparsification (EUS): constructing random vectors with at most nonzeros that satisfy while minimizing a divergence . It develops two principal frameworks: permutation-invariant divergences, where preserving heavy coordinates and deterministically filling a common tail value (Algorithm USA) yields universally optimal sparsifications; and strictly convex, additively separable divergences, where a coordinate-wise reduction leads to Algorithm UWSA with a principled threshold and marginal sampling. The results establish a robustness phenomenon: for permutation-invariant divergences, the optimal is largely divergence-agnostic beyond convexity and invariance, with UWSA providing a complementary approach for separable divergences. The methods leverage facet concentration, critical-point optimality, and Lagrangian analysis to produce explicit, efficient sparsification schemes applicable to federated learning and sparse-probability sampling, while highlighting open questions about general divergences and relaxed constraints.

Abstract

An unbiased -sparsification of a vector is a random vector with mean that has at most nonzero coordinates. Unbiased sparsification compresses the original vector without introducing bias; it arises in various contexts, such as in federated learning and sampling sparse probability distributions. Ideally, unbiased sparsification should also minimize the expected value of a divergence function that measures how far away is from the original . If is optimal in this sense, then we call it efficient. Our main results describe efficient unbiased sparsifications for divergences that are either permutation-invariant or additively separable. Surprisingly, the characterization for permutation-invariant divergences is robust to the choice of divergence function, in the sense that our class of optimal for squared Euclidean distance coincides with our class of optimal for Kullback-Leibler divergence, or indeed any of a wide variety of divergences.
Paper Structure (18 sections, 5 theorems, 64 equations, 1 figure)

This paper contains 18 sections, 5 theorems, 64 equations, 1 figure.

Key Result

Lemma 1

Assume that $\mathop{\mathrm{Div}}\nolimits$ is convex, and let $Q$ be an $m$-sparsification of $p\in \mathbb{R}_{>0}^n$. For each $I$ with $\Pr(Q\in \Delta^I) > 0$, write $q^I \vcentcolon= \mathbf{E}[Q | Q \in \Delta^I]$. Then the sparsification $Q'$ such that $\Pr(Q'\in \Delta^I) = \Pr(Q\in\Delta^

Figures (1)

  • Figure 1: Illustration of the probability simplex when using \ref{['alg:usa']} on a probability distribution with $n=3$ and $m=2$.

Theorems & Definitions (13)

  • Definition 1.1
  • Definition 1.2
  • Definition 1.3
  • Lemma 1: Facet concentration
  • proof
  • Lemma 2
  • proof
  • Definition 2.1
  • Theorem 1
  • proof
  • ...and 3 more