Table of Contents
Fetching ...

High-Dimensional Geometric Streaming for Nearly Low Rank Data

Hossein Esfandiari, Vahab Mirrokni, Praneeth Kacham, David P. Woodruff, Peilin Zhong

TL;DR

This work develops scalable streaming algorithms for high-dimensional $oldsymbol{\\ell_p}$ subspace approximation in insertion-only settings. It introduces a deterministic strong coreset for the $oldsymbol{9}$ (ell-infty) subspace problem, achieving distortion $O(\\sqrt{k}\,\log(n\kappa))$ with size $O(k\log^2(n\kappa))$, and then extends to general $oldsymbol{\\ell_p}$ via exponential embeddings, yielding poly$(k,\log n\kappa)$-accurate solutions with modest space. A fast, online algorithm based on online ridge leverage scores constructs the coreset efficiently, and lower bounds show the distortion is near-optimal up to polylog factors. The approach yields applications to outer radius, width estimation, and Löwner–John ellipsoid problems, and empirical results demonstrate fast, scalable performance on large datasets. The results push practical streaming subspace methods closer to offline guarantees while maintaining strong subset-selection properties.

Abstract

We study streaming algorithms for the $\ell_p$ subspace approximation problem. Given points $a_1, \ldots, a_n$ as an insertion-only stream and a rank parameter $k$, the $\ell_p$ subspace approximation problem is to find a $k$-dimensional subspace $V$ such that $(\sum_{i=1}^n d(a_i, V)^p)^{1/p}$ is minimized, where $d(a, V)$ denotes the Euclidean distance between $a$ and $V$ defined as $\min_{v \in V}\|{a - v}\|_{\infty}$. When $p = \infty$, we need to find a subspace $V$ that minimizes $\max_i d(a_i, V)$. For $\ell_{\infty}$ subspace approximation, we give a deterministic strong coreset construction algorithm and show that it can be used to compute a $\text{poly}(k, \log n)$ approximate solution. We show that the distortion obtained by our coreset is nearly tight for any sublinear space algorithm. For $\ell_p$ subspace approximation, we show that suitably scaling the points and then using our $\ell_{\infty}$ coreset construction, we can compute a $\text{poly}(k, \log n)$ approximation. Our algorithms are easy to implement and run very fast on large datasets. We also use our strong coreset construction to improve the results in a recent work of Woodruff and Yasuda (FOCS 2022) which gives streaming algorithms for high-dimensional geometric problems such as width estimation, convex hull estimation, and volume estimation.

High-Dimensional Geometric Streaming for Nearly Low Rank Data

TL;DR

This work develops scalable streaming algorithms for high-dimensional subspace approximation in insertion-only settings. It introduces a deterministic strong coreset for the (ell-infty) subspace problem, achieving distortion with size , and then extends to general via exponential embeddings, yielding poly-accurate solutions with modest space. A fast, online algorithm based on online ridge leverage scores constructs the coreset efficiently, and lower bounds show the distortion is near-optimal up to polylog factors. The approach yields applications to outer radius, width estimation, and Löwner–John ellipsoid problems, and empirical results demonstrate fast, scalable performance on large datasets. The results push practical streaming subspace methods closer to offline guarantees while maintaining strong subset-selection properties.

Abstract

We study streaming algorithms for the subspace approximation problem. Given points as an insertion-only stream and a rank parameter , the subspace approximation problem is to find a -dimensional subspace such that is minimized, where denotes the Euclidean distance between and defined as . When , we need to find a subspace that minimizes . For subspace approximation, we give a deterministic strong coreset construction algorithm and show that it can be used to compute a approximate solution. We show that the distortion obtained by our coreset is nearly tight for any sublinear space algorithm. For subspace approximation, we show that suitably scaling the points and then using our coreset construction, we can compute a approximation. Our algorithms are easy to implement and run very fast on large datasets. We also use our strong coreset construction to improve the results in a recent work of Woodruff and Yasuda (FOCS 2022) which gives streaming algorithms for high-dimensional geometric problems such as width estimation, convex hull estimation, and volume estimation.
Paper Structure (21 sections, 14 theorems, 73 equations, 1 figure, 1 algorithm)

This paper contains 21 sections, 14 theorems, 73 equations, 1 figure, 1 algorithm.

Key Result

Theorem 1.1

Given a parameter $k$ and $n$ points $a_1,\ldots,a_n \in \mathbb R^d$, Algorithm alg:efficient selects a subset $S \subseteq [n]$ of points with $|S| = O(k\log^2 n\kappa)$, such that for all $k$-dimensional subspaces $V$, The streaming algorithm requires only enough space to store $O(k\log^{2} n\kappa)$ rows of $A$ and can be implemented in time $O(\operatorname{nnz}(A)\log n + d\operatorname{pol

Figures (1)

  • Figure 1: Images used for experiments

Theorems & Definitions (24)

  • Theorem 1.1: Informal
  • Theorem 1.2: Informal
  • Theorem 1.3: Informal
  • Lemma 3.1
  • Lemma 3.2
  • Definition 3.3: Online Rank-$k$ Condition Number
  • Lemma 3.4: Sum of online rank-$k$ ridge leverage scores
  • Theorem 3.5
  • Theorem 3.6
  • Theorem 3.7: Outer $(d-k)$ radius estimation
  • ...and 14 more