Accumulation of Sub-Sampling Matrices with Applications to Statistical Computation
Yifan Chen, Yun Yang
TL;DR
The work tackles the computational bottlenecks of large-scale statistical computation by proposing accumulative sub-sampling, a data-adaptive random projection that aggregates multiple sub-sampled sketches to reduce the effective projection dimension. The authors prove spectral AMM guarantees showing that a small projection dimension $d$ (up to poly-log factors) suffices, with the total sub-sample budget $md$ controlled by the sampling quality parameter $\beta$ and the desired accuracy. They connect this framework to compositional sketching, relate it to Gaussian sketching and classical sub-sampling, and demonstrate substantial computational savings in downstream tasks such as eigendecomposition (via randomized SVD) and kernel ridge regression (via Nyström), while maintaining statistical accuracy. Extensive experiments across matrix multiplication, spectral clustering, and KRR validate the approach, showing consistent improvements in efficiency and accuracy under suboptimal sampling conditions, thus enabling scalable statistical inference on large datasets.
Abstract
With appropriately chosen sampling probabilities, sampling-based random projection can be used to implement large-scale statistical methods, substantially reducing computational cost while maintaining low statistical error. However, computing optimal sampling probabilities is often itself expensive, and in practice one typically resorts to suboptimal schemes. This generally leads to increased time and space costs, as more subsamples are required and the resulting projection matrices become larger, thereby making the inference procedure more computationally demanding. In this paper, we extend the framework of sampling-based random projection and propose a new projection method, \emph{accumulative sub-sampling}. By carefully accumulating multiple such projections, accumulative sub-sampling improves statistical efficiency while controlling the effective matrix size throughout the statistical computation. On the theoretical side, we quantify how the quality of the subsampling scheme affects the error in approximating matrix products and positive semidefinite matrices, and show how the proposed accumulation strategy mitigates this effect. Moreover, we apply our method to statistical models involving intensive matrix operations, such as eigendecomposition in spectral clustering and matrix inversion in kernel ridge regression, and demonstrate that reducing the effective matrix size leads to substantial computational savings. Numerical experiments across a range of problems further show that our approach consistently improves computational efficiency compared to existing random projection baselines under suboptimal sampling schemes.
