Ridge Leverage Score Sampling for $\ell_p$ Subspace Approximation
David P. Woodruff, Taisuke Yasuda
TL;DR
This work advances the theory and practice of ℓ_p subspace approximation by constructing strong coresets whose sizes are nearly optimal in k for all p ≠ 2, using a novel ridge-leverage-score sampling framework. It introduces root ridge leverage sampling, additive-multiplicative ℓ_p subspace embeddings, and strategic flattening, along with a reduction to low-rank matrix embeddings, to achieve tight, dimension-free guarantees with favorable ε dependencies: ~Ō(k/ε^{4/p}) for 1≤p<2 and ~Ō(k^{p/2}/ε^{p}) for p>2 (up to polylog factors). The paper also delivers nearly optimal online and streaming coresets, making strong coreset technology practical in dynamic data settings, and connects these coresets to entrywise ℓ_p low-rank approximation. Overall, the methods circumvent the limitations of prior representative-subspace approaches, enabling faster, scalable, and online-capable ℓ_p subspace approximation with strong guarantee properties.
Abstract
The $\ell_p$ subspace approximation problem is an NP-hard low rank approximation problem that generalizes the median hyperplane ($p = 1$), principal component analysis ($p = 2$), and center hyperplane problems ($p = \infty$). A popular approach to cope with the NP-hardness is to compute a strong coreset, which is a weighted subset of input points that simultaneously approximates the cost of every $k$-dimensional subspace, typically to $(1+ε)$ relative error for a small constant $ε$. We obtain an algorithm for constructing a strong coreset for $\ell_p$ subspace approximation of size $\tilde O(kε^{-4/p})$ for $p<2$ and $\tilde O(k^{p/2}ε^{-p})$ for $p>2$. This offers the following improvements over prior work: - We construct the first strong coresets with nearly optimal dependence on $k$ for all $p\neq 2$. In prior work, [SW18] constructed coresets of modified points with a similar dependence on $k$, while [HV20] constructed true coresets with polynomially worse dependence on $k$. - We recover or improve the best known $ε$ dependence for all $p$. In particular, for $p > 2$, the [SW18] coreset of modified points had a dependence of $ε^{-p^2/2}$ and the [HV20] coreset had a dependence of $ε^{-3p}$. Our algorithm is based on sampling by root ridge leverage scores, which admits fast algorithms, especially for sparse or structured matrices. Our analysis avoids the use of the representative subspace theorem [SW18], which is a critical component of all prior dimension-independent coresets for $\ell_p$ subspace approximation. Our techniques also lead to the first nearly optimal online strong coresets for $\ell_p$ subspace approximation with similar bounds as the offline setting, resolving a problem of [WY23]. All prior approaches lose $\mathrm{poly}(k)$ factors in this setting, even when allowed to modify the original points.
