Efficient Unbiased Sparsification
Leighton Barnes, Stephen Cameron, Timothy Chow, Emma Cohen, Keith Frankston, Benjamin Howard, Fred Kochman, Daniel Scheinerman, Jeffrey VanderKam
TL;DR
The paper tackles unbiased $m$-sparsification (EUS): constructing random vectors $Q$ with at most $m$ nonzeros that satisfy $\mathbb{E}[Q]=p$ while minimizing a divergence $\mathsf{Div}(Q,p)$. It develops two principal frameworks: permutation-invariant divergences, where preserving heavy coordinates and deterministically filling a common tail value (Algorithm USA) yields universally optimal sparsifications; and strictly convex, additively separable divergences, where a coordinate-wise reduction leads to Algorithm UWSA with a principled threshold $\lambda$ and marginal sampling. The results establish a robustness phenomenon: for permutation-invariant divergences, the optimal $Q$ is largely divergence-agnostic beyond convexity and invariance, with UWSA providing a complementary approach for separable divergences. The methods leverage facet concentration, critical-point optimality, and Lagrangian analysis to produce explicit, efficient sparsification schemes applicable to federated learning and sparse-probability sampling, while highlighting open questions about general divergences and relaxed constraints.
Abstract
An unbiased $m$-sparsification of a vector $p\in \mathbb{R}^n$ is a random vector $Q\in \mathbb{R}^n$ with mean $p$ that has at most $m<n$ nonzero coordinates. Unbiased sparsification compresses the original vector without introducing bias; it arises in various contexts, such as in federated learning and sampling sparse probability distributions. Ideally, unbiased sparsification should also minimize the expected value of a divergence function $\mathsf{Div}(Q,p)$ that measures how far away $Q$ is from the original $p$. If $Q$ is optimal in this sense, then we call it efficient. Our main results describe efficient unbiased sparsifications for divergences that are either permutation-invariant or additively separable. Surprisingly, the characterization for permutation-invariant divergences is robust to the choice of divergence function, in the sense that our class of optimal $Q$ for squared Euclidean distance coincides with our class of optimal $Q$ for Kullback-Leibler divergence, or indeed any of a wide variety of divergences.
