Table of Contents
Fetching ...

A Framework for Statistical Inference via Randomized Algorithms

Zhixiang Zhang, Sokbae Lee, Edgar Dobriban

TL;DR

This work develops a general statistical-inference framework for outputs of randomized algorithms, treating the data as deterministic and the randomness as arising from algorithmic procedures such as sketching and stochastic optimization. It introduces three main inference methods—sub-randomization, multi-run plug-in, and multi-run aggregation—along with an asymptotically pivotal baseline, and demonstrates how to apply them to least-squares problems under growing-dimension regimes and to stochastic optimization, including SGD with Polyak-Ruppert averaging and momentum methods. The results establish conditions under which valid confidence regions can be constructed with modest overhead, quantify bias and variance properties for different sketching schemes (i.i.d. and Haar), and provide extensive simulations showing competitive coverage and interval lengths. The framework offers practical guidelines for balancing accuracy and computation in large-scale data analysis, with extensions to iterative sketching and PCA, and broad applicability to stochastic approximation with dependent data. The work also includes a thorough cost analysis and discusses data-access considerations for scalable, parallel inference.

Abstract

Randomized algorithms, such as randomized sketching or stochastic optimization, are a promising approach to ease the computational burden in analyzing large datasets. However, randomized algorithms also produce non-deterministic outputs, leading to the problem of evaluating their accuracy. In this paper, we develop a statistical inference framework for quantifying the uncertainty of the outputs of randomized algorithms. Our key conclusion is that one can perform statistical inference for the target of a sequence of randomized algorithms as long as in the limit, their outputs fluctuate around the target according to any (possibly unknown) probability distribution. In this setting, we develop appropriate statistical inference methods -- sub-randomization, multi-run plug-in and multi-run aggregation -- by estimating the unknown parameters of the limiting distribution either using multiple runs of the randomized algorithm, or by tailored estimates. As illustrations, we develop methods for statistical inference when using stochastic optimization (such as Polyak-Ruppert averaging in stochastic gradient descent and stochastic optimization with momentum). We also illustrate our methods in inference for least squares parameters via randomized sketching, by characterizing the limiting distributions of sketching estimates in a possibly growing dimensional case. We further characterize the computation and communication cost of our methods, showing that in certain cases, they add negligible overhead. The results are supported via a broad range of simulations.

A Framework for Statistical Inference via Randomized Algorithms

TL;DR

This work develops a general statistical-inference framework for outputs of randomized algorithms, treating the data as deterministic and the randomness as arising from algorithmic procedures such as sketching and stochastic optimization. It introduces three main inference methods—sub-randomization, multi-run plug-in, and multi-run aggregation—along with an asymptotically pivotal baseline, and demonstrates how to apply them to least-squares problems under growing-dimension regimes and to stochastic optimization, including SGD with Polyak-Ruppert averaging and momentum methods. The results establish conditions under which valid confidence regions can be constructed with modest overhead, quantify bias and variance properties for different sketching schemes (i.i.d. and Haar), and provide extensive simulations showing competitive coverage and interval lengths. The framework offers practical guidelines for balancing accuracy and computation in large-scale data analysis, with extensions to iterative sketching and PCA, and broad applicability to stochastic approximation with dependent data. The work also includes a thorough cost analysis and discusses data-access considerations for scalable, parallel inference.

Abstract

Randomized algorithms, such as randomized sketching or stochastic optimization, are a promising approach to ease the computational burden in analyzing large datasets. However, randomized algorithms also produce non-deterministic outputs, leading to the problem of evaluating their accuracy. In this paper, we develop a statistical inference framework for quantifying the uncertainty of the outputs of randomized algorithms. Our key conclusion is that one can perform statistical inference for the target of a sequence of randomized algorithms as long as in the limit, their outputs fluctuate around the target according to any (possibly unknown) probability distribution. In this setting, we develop appropriate statistical inference methods -- sub-randomization, multi-run plug-in and multi-run aggregation -- by estimating the unknown parameters of the limiting distribution either using multiple runs of the randomized algorithm, or by tailored estimates. As illustrations, we develop methods for statistical inference when using stochastic optimization (such as Polyak-Ruppert averaging in stochastic gradient descent and stochastic optimization with momentum). We also illustrate our methods in inference for least squares parameters via randomized sketching, by characterizing the limiting distributions of sketching estimates in a possibly growing dimensional case. We further characterize the computation and communication cost of our methods, showing that in certain cases, they add negligible overhead. The results are supported via a broad range of simulations.
Paper Structure (68 sections, 31 theorems, 348 equations, 15 figures, 12 tables, 2 algorithms)

This paper contains 68 sections, 31 theorems, 348 equations, 15 figures, 12 tables, 2 algorithms.

Key Result

Proposition 2.1

Consider a sequence of problems as defined above. Suppose that as $m,n\to\infty$, for a known distribution $J$. For $\alpha\in (0,1)$, let $\Xi$ be a measurable set such that $J(\Xi)\geqslant 1-\alpha$. If $(\widehat{T}_{m,n})_{n\geqslant 1}$ is invertible with probability tending to unity and $\Xi$ is an open set, then Moreover, if $\Xi$ is a continuity set of $J$, then $\lim_{m,n\to\infty} P\l

Figures (15)

  • Figure 1: Flowchart illustrating our proposed framework. We consider some large data set $z_n$; which we cannot access directly due to its size. Instead we observe the output $Z_{m,n} = \mathcal{A}_m(z_n, S_{m,n})$ of a randomized algorithm, where $S_{m,n}$ is a source of randomness. We are interested in some parameter $\theta_n(z_n)$ of the unobserved data set; and aim to build a confidence region $C_m$ that contains this parameter with some pre-specified probability, so $P(\theta_n(z_n) \in C_m) \geqslant 1-\alpha$---at least asymptotically. We propose several approaches to reach this goal; some rely on generating additional smaller datasets $\{Z_{b,i} =\mathcal{A}_b(z_n, S_{b,i})\}_{i=1}^K$ by running the randomized algorithm repeatedly or in a distributed manner; and using them to construct the estimate $L_{b,m,n}$ from \ref{['L']} of the error distribution of the output of the randomized algorithm.
  • Figure 2: Methods for statistical inference via randomized algorithms, categorized by the conditions under which they are applicable. Here, $\widehat{J}_{m,n}$ is the distribution of of ${\widehat{T}_{m,n}(\widehat{\theta}_{m}-\theta_n)}$, where the randomness is only due to $S_{m,n}$. We consider two sets of conditions: Either that $\widehat{J}_{m,n}$ converges to a limiting distribution $J$, or that $\widehat{\theta}_{m}$ is nearly unbiased.
  • Figure 3: Left: Coverage of 90% intervals for the first coordinate of $\beta_n$, and 95% Clopper-Pearson interval for the coverage, in a synthetic data example. Right: Length of the confidence intervals. We use sketch-and-solve estimators obtained via i.i.d. sketching, and data generated from the model in Case 1, with $p=500, n=8,000,b=600, K=100$ and 500 Monte Carlo trials for each setting.
  • Figure 4: Inference using momentum and vanilla SGD algorithms. Methods compared include sub-randomization and multi-run plug-in inference with varying learning rates. The learning rates are $\gamma_t = 0.4/(t+1)^a$, and momentum parameters are $1- \gamma_t$.
  • Figure 5: Time for generating $K$ small sketches of size $b=200$ with $X\in \mathbb{R}^{2^{17}\times 100}$: "Block" refers to the memory-efficient computation of $K$ sketches using data blocking, and "full" represents the naive method requiring loading the full data $K$ times. Loading time indicates the time of loading the data, and total time encompasses both loading and sketch computation.
  • ...and 10 more figures

Theorems & Definitions (46)

  • Proposition 2.1: Classical asymptotically pivotal inference
  • Theorem 2.2: Inference via sub-randomization
  • Corollary 2.3: Sub-randomization inference under converging scale
  • Theorem 2.4: Multi-run plug-in inference for a normal limit
  • Corollary 2.5: Multi-run plug-in inference with centering and scaling estimated using different output sizes
  • Theorem 2.6: Inference by multi-run aggregation
  • Theorem 3.2: Distributions of estimators obtained via sketching with i.i.d. entries
  • Remark 3.3
  • Corollary 3.4: Simplified distributions of i.i.d. sketching estimators
  • Proposition 3.5: Variance estimation for Gaussian sketching
  • ...and 36 more