A Framework for Statistical Inference via Randomized Algorithms
Zhixiang Zhang, Sokbae Lee, Edgar Dobriban
TL;DR
This work develops a general statistical-inference framework for outputs of randomized algorithms, treating the data as deterministic and the randomness as arising from algorithmic procedures such as sketching and stochastic optimization. It introduces three main inference methods—sub-randomization, multi-run plug-in, and multi-run aggregation—along with an asymptotically pivotal baseline, and demonstrates how to apply them to least-squares problems under growing-dimension regimes and to stochastic optimization, including SGD with Polyak-Ruppert averaging and momentum methods. The results establish conditions under which valid confidence regions can be constructed with modest overhead, quantify bias and variance properties for different sketching schemes (i.i.d. and Haar), and provide extensive simulations showing competitive coverage and interval lengths. The framework offers practical guidelines for balancing accuracy and computation in large-scale data analysis, with extensions to iterative sketching and PCA, and broad applicability to stochastic approximation with dependent data. The work also includes a thorough cost analysis and discusses data-access considerations for scalable, parallel inference.
Abstract
Randomized algorithms, such as randomized sketching or stochastic optimization, are a promising approach to ease the computational burden in analyzing large datasets. However, randomized algorithms also produce non-deterministic outputs, leading to the problem of evaluating their accuracy. In this paper, we develop a statistical inference framework for quantifying the uncertainty of the outputs of randomized algorithms. Our key conclusion is that one can perform statistical inference for the target of a sequence of randomized algorithms as long as in the limit, their outputs fluctuate around the target according to any (possibly unknown) probability distribution. In this setting, we develop appropriate statistical inference methods -- sub-randomization, multi-run plug-in and multi-run aggregation -- by estimating the unknown parameters of the limiting distribution either using multiple runs of the randomized algorithm, or by tailored estimates. As illustrations, we develop methods for statistical inference when using stochastic optimization (such as Polyak-Ruppert averaging in stochastic gradient descent and stochastic optimization with momentum). We also illustrate our methods in inference for least squares parameters via randomized sketching, by characterizing the limiting distributions of sketching estimates in a possibly growing dimensional case. We further characterize the computation and communication cost of our methods, showing that in certain cases, they add negligible overhead. The results are supported via a broad range of simulations.
