High-Dimensional Geometric Streaming for Nearly Low Rank Data
Hossein Esfandiari, Vahab Mirrokni, Praneeth Kacham, David P. Woodruff, Peilin Zhong
TL;DR
This work develops scalable streaming algorithms for high-dimensional $oldsymbol{\\ell_p}$ subspace approximation in insertion-only settings. It introduces a deterministic strong coreset for the $oldsymbol{9}$ (ell-infty) subspace problem, achieving distortion $O(\\sqrt{k}\,\log(n\kappa))$ with size $O(k\log^2(n\kappa))$, and then extends to general $oldsymbol{\\ell_p}$ via exponential embeddings, yielding poly$(k,\log n\kappa)$-accurate solutions with modest space. A fast, online algorithm based on online ridge leverage scores constructs the coreset efficiently, and lower bounds show the distortion is near-optimal up to polylog factors. The approach yields applications to outer radius, width estimation, and Löwner–John ellipsoid problems, and empirical results demonstrate fast, scalable performance on large datasets. The results push practical streaming subspace methods closer to offline guarantees while maintaining strong subset-selection properties.
Abstract
We study streaming algorithms for the $\ell_p$ subspace approximation problem. Given points $a_1, \ldots, a_n$ as an insertion-only stream and a rank parameter $k$, the $\ell_p$ subspace approximation problem is to find a $k$-dimensional subspace $V$ such that $(\sum_{i=1}^n d(a_i, V)^p)^{1/p}$ is minimized, where $d(a, V)$ denotes the Euclidean distance between $a$ and $V$ defined as $\min_{v \in V}\|{a - v}\|_{\infty}$. When $p = \infty$, we need to find a subspace $V$ that minimizes $\max_i d(a_i, V)$. For $\ell_{\infty}$ subspace approximation, we give a deterministic strong coreset construction algorithm and show that it can be used to compute a $\text{poly}(k, \log n)$ approximate solution. We show that the distortion obtained by our coreset is nearly tight for any sublinear space algorithm. For $\ell_p$ subspace approximation, we show that suitably scaling the points and then using our $\ell_{\infty}$ coreset construction, we can compute a $\text{poly}(k, \log n)$ approximation. Our algorithms are easy to implement and run very fast on large datasets. We also use our strong coreset construction to improve the results in a recent work of Woodruff and Yasuda (FOCS 2022) which gives streaming algorithms for high-dimensional geometric problems such as width estimation, convex hull estimation, and volume estimation.
