Table of Contents
Fetching ...

Approximating the Top Eigenvector in Random Order Streams

Praneeth Kacham, David P. Woodruff

TL;DR

This work studies memory-efficient, one-pass streaming algorithms for approximating the top eigenvector of $A^{\mathsf{T}}A$ when rows arrive in uniformly random order. It introduces a row-norm sampling scheme and a block power method that together yield a high-correlation solution with space $O(h\,d\,\mathrm{polylog}(d))$ bits, where $h$ is the number of heavy rows, under gap $R$ and random-order assumptions. The paper also proves a near-optimal lower bound of $\Omega(h\,d/R)$ space and strengthens the gap requirements for related methods, showing $R=\Omega(\log^2 d)$ suffices for arbitrary order streams and $R=\Omega(\log d)$ for random-order streams in the Price–Xun framework. Additionally, it provides a hard instance demonstrating the limitations of Oja's algorithm with fixed learning rates in low-gap settings. Overall, the results advance memory-efficient streaming PCA by clarifying how random-order assumptions and heavy-row structure influence achievable accuracy and space.

Abstract

When rows of an $n \times d$ matrix $A$ are given in a stream, we study algorithms for approximating the top eigenvector of the matrix ${A}^TA$ (equivalently, the top right singular vector of $A$). We consider worst case inputs $A$ but assume that the rows are presented to the streaming algorithm in a uniformly random order. We show that when the gap parameter $R = σ_1(A)^2/σ_2(A)^2 = Ω(1)$, then there is a randomized algorithm that uses $O(h \cdot d \cdot \operatorname{polylog}(d))$ bits of space and outputs a unit vector $v$ that has a correlation $1 - O(1/\sqrt{R})$ with the top eigenvector $v_1$. Here $h$ denotes the number of \emph{heavy rows} in the matrix, defined as the rows with Euclidean norm at least $\|{A}\|_F/\sqrt{d \cdot \operatorname{polylog}(d)}$. We also provide a lower bound showing that any algorithm using $O(hd/R)$ bits of space can obtain at most $1 - Ω(1/R^2)$ correlation with the top eigenvector. Thus, parameterizing the space complexity in terms of the number of heavy rows is necessary for high accuracy solutions. Our results improve upon the $R = Ω(\log n \cdot \log d)$ requirement in a recent work of Price and Xun (FOCS 2024). We note that the algorithm of Price and Xun works for arbitrary order streams whereas our algorithm requires a stronger assumption that the rows are presented in a uniformly random order. We additionally show that the gap requirements in their analysis can be brought down to $R = Ω(\log^2 d)$ for arbitrary order streams and $R = Ω(\log d)$ for random order streams. The requirement of $R = Ω(\log d)$ for random order streams is nearly tight for their analysis as we obtain a simple instance with $R = Ω(\log d/\log\log d)$ for which their algorithm, with any fixed learning rate, cannot output a vector approximating the top eigenvector $v_1$.

Approximating the Top Eigenvector in Random Order Streams

TL;DR

This work studies memory-efficient, one-pass streaming algorithms for approximating the top eigenvector of when rows arrive in uniformly random order. It introduces a row-norm sampling scheme and a block power method that together yield a high-correlation solution with space bits, where is the number of heavy rows, under gap and random-order assumptions. The paper also proves a near-optimal lower bound of space and strengthens the gap requirements for related methods, showing suffices for arbitrary order streams and for random-order streams in the Price–Xun framework. Additionally, it provides a hard instance demonstrating the limitations of Oja's algorithm with fixed learning rates in low-gap settings. Overall, the results advance memory-efficient streaming PCA by clarifying how random-order assumptions and heavy-row structure influence achievable accuracy and space.

Abstract

When rows of an matrix are given in a stream, we study algorithms for approximating the top eigenvector of the matrix (equivalently, the top right singular vector of ). We consider worst case inputs but assume that the rows are presented to the streaming algorithm in a uniformly random order. We show that when the gap parameter , then there is a randomized algorithm that uses bits of space and outputs a unit vector that has a correlation with the top eigenvector . Here denotes the number of \emph{heavy rows} in the matrix, defined as the rows with Euclidean norm at least . We also provide a lower bound showing that any algorithm using bits of space can obtain at most correlation with the top eigenvector. Thus, parameterizing the space complexity in terms of the number of heavy rows is necessary for high accuracy solutions. Our results improve upon the requirement in a recent work of Price and Xun (FOCS 2024). We note that the algorithm of Price and Xun works for arbitrary order streams whereas our algorithm requires a stronger assumption that the rows are presented in a uniformly random order. We additionally show that the gap requirements in their analysis can be brought down to for arbitrary order streams and for random order streams. The requirement of for random order streams is nearly tight for their analysis as we obtain a simple instance with for which their algorithm, with any fixed learning rate, cannot output a vector approximating the top eigenvector .

Paper Structure

This paper contains 13 sections, 12 theorems, 69 equations, 1 algorithm.

Key Result

Theorem 1.1

Let $a_1, \ldots, a_n \in \mathbb R^d$ be a randomly ordered stream and let $A$ denote the $n \times d$ matrix with rows given by $a_1, \ldots, a_n$. If $R = \lambda_1(A^{\mathsf{T}}A)/\lambda_2(A^{\mathsf{T}}A) > C$ for a large enough constant $C$ and the number of heavy rows in the stream is at mo with a probability $\ge 4/5$.

Theorems & Definitions (21)

  • Theorem 1.1
  • Theorem 1.2
  • Theorem 2.1
  • proof
  • Lemma 2.3
  • proof
  • Theorem 2.4: wang1997some
  • Lemma 2.5
  • proof
  • Theorem 2.6
  • ...and 11 more