Approximating the Top Eigenvector in Random Order Streams
Praneeth Kacham, David P. Woodruff
TL;DR
This work studies memory-efficient, one-pass streaming algorithms for approximating the top eigenvector of $A^{\mathsf{T}}A$ when rows arrive in uniformly random order. It introduces a row-norm sampling scheme and a block power method that together yield a high-correlation solution with space $O(h\,d\,\mathrm{polylog}(d))$ bits, where $h$ is the number of heavy rows, under gap $R$ and random-order assumptions. The paper also proves a near-optimal lower bound of $\Omega(h\,d/R)$ space and strengthens the gap requirements for related methods, showing $R=\Omega(\log^2 d)$ suffices for arbitrary order streams and $R=\Omega(\log d)$ for random-order streams in the Price–Xun framework. Additionally, it provides a hard instance demonstrating the limitations of Oja's algorithm with fixed learning rates in low-gap settings. Overall, the results advance memory-efficient streaming PCA by clarifying how random-order assumptions and heavy-row structure influence achievable accuracy and space.
Abstract
When rows of an $n \times d$ matrix $A$ are given in a stream, we study algorithms for approximating the top eigenvector of the matrix ${A}^TA$ (equivalently, the top right singular vector of $A$). We consider worst case inputs $A$ but assume that the rows are presented to the streaming algorithm in a uniformly random order. We show that when the gap parameter $R = σ_1(A)^2/σ_2(A)^2 = Ω(1)$, then there is a randomized algorithm that uses $O(h \cdot d \cdot \operatorname{polylog}(d))$ bits of space and outputs a unit vector $v$ that has a correlation $1 - O(1/\sqrt{R})$ with the top eigenvector $v_1$. Here $h$ denotes the number of \emph{heavy rows} in the matrix, defined as the rows with Euclidean norm at least $\|{A}\|_F/\sqrt{d \cdot \operatorname{polylog}(d)}$. We also provide a lower bound showing that any algorithm using $O(hd/R)$ bits of space can obtain at most $1 - Ω(1/R^2)$ correlation with the top eigenvector. Thus, parameterizing the space complexity in terms of the number of heavy rows is necessary for high accuracy solutions. Our results improve upon the $R = Ω(\log n \cdot \log d)$ requirement in a recent work of Price and Xun (FOCS 2024). We note that the algorithm of Price and Xun works for arbitrary order streams whereas our algorithm requires a stronger assumption that the rows are presented in a uniformly random order. We additionally show that the gap requirements in their analysis can be brought down to $R = Ω(\log^2 d)$ for arbitrary order streams and $R = Ω(\log d)$ for random order streams. The requirement of $R = Ω(\log d)$ for random order streams is nearly tight for their analysis as we obtain a simple instance with $R = Ω(\log d/\log\log d)$ for which their algorithm, with any fixed learning rate, cannot output a vector approximating the top eigenvector $v_1$.
