Beyond Worst-Case Dimensionality Reduction for Sparse Vectors
Sandeep Silwal, David P. Woodruff, Qiuyi Zhang
TL;DR
This work advances beyond worst-case guarantees for dimensionality reduction on sparse data by establishing strong average-case lower bounds showing the folklore birthday-paradox embedding is tight for general sparse vectors under linear and smooth encodings. It simultaneously proves that non-negative sparse vectors admit powerful nonlinear, non-smooth embeddings achieving near-optimal dimension bounds (up to polylog factors) for $\ell_p$ distances and exact $\ell_\infty$ embedding, revealing a separation between non-negative and general sparse settings. The authors further provide comprehensive lower bounds illustrating the necessity of non-linearity and non-smoothness, and present a suite of practical applications (diameter, Max-Cut, clustering, distance estimation) where the nonlinear non-negative embedding yields tangible speedups and accuracy guarantees. Overall, the paper delineates when beyond-worst-case dimensionality reduction is possible for sparse data, clarifying the trade-offs between linear vs nonlinear, smooth vs non-smooth, and non-negative vs general sparsity models with implications for efficient geometric algorithms in ML and data analysis.
Abstract
We study beyond worst-case dimensionality reduction for $s$-sparse vectors. Our work is divided into two parts, each focusing on a different facet of beyond worst-case analysis: We first consider average-case guarantees. A folklore upper bound based on the birthday-paradox states: For any collection $X$ of $s$-sparse vectors in $\mathbb{R}^d$, there exists a linear map to $\mathbb{R}^{O(s^2)}$ which \emph{exactly} preserves the norm of $99\%$ of the vectors in $X$ in any $\ell_p$ norm (as opposed to the usual setting where guarantees hold for all vectors). We give lower bounds showing that this is indeed optimal in many settings: any oblivious linear map satisfying similar average-case guarantees must map to $Ω(s^2)$ dimensions. The same lower bound also holds for a wide class of smooth maps, including `encoder-decoder schemes', where we compare the norm of the original vector to that of a smooth function of the embedding. These lower bounds reveal a separation result, as an upper bound of $O(s \log(d))$ is possible if we instead use arbitrary (possibly non-smooth) functions, e.g., via compressed sensing algorithms. Given these lower bounds, we specialize to sparse \emph{non-negative} vectors. For a dataset $X$ of non-negative $s$-sparse vectors and any $p \ge 1$, we can non-linearly embed $X$ to $O(s\log(|X|s)/ε^2)$ dimensions while preserving all pairwise distances in $\ell_p$ norm up to $1\pm ε$, with no dependence on $p$. Surprisingly, the non-negativity assumption enables much smaller embeddings than arbitrary sparse vectors, where the best known bounds suffer exponential dependence. Our map also guarantees \emph{exact} dimensionality reduction for $\ell_{\infty}$ by embedding into $O(s\log |X|)$ dimensions, which is tight. We show that both the non-linearity of $f$ and the non-negativity of $X$ are necessary, and provide downstream algorithmic improvements.
