Table of Contents
Fetching ...

Worth Their Weight: Randomized and Regularized Block Kaczmarz Algorithms without Preprocessing

Gil Goldshlager, Jiang Hu, Lin Lin

TL;DR

The paper tackles solving large-scale linear least-squares without preprocessing by analyzing RBK under uniform sampling and introducing ReBlocK, a regularized RBK variant. It shows RBK-U converges in a Monte Carlo sense to a weighted LS solution but can fail when the data contain nearly singular blocks, and it demonstrates that incorporating a mild regularization yields robust convergence with controllable bias and variance. Gaussian-data analysis provides conditions under which RBK-U recovers $x^*$ with faster rates than mSGD for rapidly decaying spectra, while ReBlocK offers practical robustness and efficiency, including favorable natural-gradient applications. Empirical results illustrate that ReBlocK-U outperforms both RBK-U and mSGD in inconsistent problems, and tail averaging further enhances convergence, supporting a no-preprocessing, Monte Carlo-based approach to large-scale LS and relevant neural/NLP and physics-inspired tasks.

Abstract

Due to the ever growing amounts of data leveraged for machine learning and scientific computing, it is increasingly important to develop algorithms that sample only a small portion of the data at a time. In the case of linear least-squares, the randomized block Kaczmarz method (RBK) is an appealing example of such an algorithm, but its convergence is only understood under sampling distributions that require potentially prohibitively expensive preprocessing steps. To address this limitation, we analyze RBK when the data is sampled uniformly, showing that its iterates converge in a Monte Carlo sense to a $\textit{weighted}$ least-squares solution. Unfortunately, for general problems the bias of the weighted least-squares solution and the variance of the iterates can become arbitrarily large. We show that these quantities can be rigorously controlled by incorporating regularization into the RBK iterations, yielding the regularized algorithm ReBlocK. Numerical experiments including examples arising from natural gradient optimization demonstrate that ReBlocK can outperform both RBK and minibatch stochastic gradient descent for inconsistent problems with rapidly decaying singular values.

Worth Their Weight: Randomized and Regularized Block Kaczmarz Algorithms without Preprocessing

TL;DR

The paper tackles solving large-scale linear least-squares without preprocessing by analyzing RBK under uniform sampling and introducing ReBlocK, a regularized RBK variant. It shows RBK-U converges in a Monte Carlo sense to a weighted LS solution but can fail when the data contain nearly singular blocks, and it demonstrates that incorporating a mild regularization yields robust convergence with controllable bias and variance. Gaussian-data analysis provides conditions under which RBK-U recovers with faster rates than mSGD for rapidly decaying spectra, while ReBlocK offers practical robustness and efficiency, including favorable natural-gradient applications. Empirical results illustrate that ReBlocK-U outperforms both RBK-U and mSGD in inconsistent problems, and tail averaging further enhances convergence, supporting a no-preprocessing, Monte Carlo-based approach to large-scale LS and relevant neural/NLP and physics-inspired tasks.

Abstract

Due to the ever growing amounts of data leveraged for machine learning and scientific computing, it is increasingly important to develop algorithms that sample only a small portion of the data at a time. In the case of linear least-squares, the randomized block Kaczmarz method (RBK) is an appealing example of such an algorithm, but its convergence is only understood under sampling distributions that require potentially prohibitively expensive preprocessing steps. To address this limitation, we analyze RBK when the data is sampled uniformly, showing that its iterates converge in a Monte Carlo sense to a least-squares solution. Unfortunately, for general problems the bias of the weighted least-squares solution and the variance of the iterates can become arbitrarily large. We show that these quantities can be rigorously controlled by incorporating regularization into the RBK iterations, yielding the regularized algorithm ReBlocK. Numerical experiments including examples arising from natural gradient optimization demonstrate that ReBlocK can outperform both RBK and minibatch stochastic gradient descent for inconsistent problems with rapidly decaying singular values.

Paper Structure

This paper contains 26 sections, 15 theorems, 123 equations, 14 figures, 1 algorithm.

Key Result

Theorem 3.1

Consider the RBK-U algorithm, namely alg:gen with $M(A_S) = (A_S A_S^\top)^+$ and $\rho = \mathbf{U}(m,k)$. Let $\alpha = \sigma^+_{\rm min}(\overline{P})$ and assume that $x_0 \in \operatorname{range}(A^\top)$. Then the expectation of the RBK-U iterates $x_T$ converges to $x^{(\rho)}$ as Furthermore, the tail averages $\overline{x}_T$ converge to $x^{(\rho)}$ as with $V = \mathbb{E}\,_{S \sim \

Figures (14)

  • Figure 1: Visual depiction of the $3 \times 2$ linear system \ref{['eq:isosceles']} that causes RBK-U to fail catastrophically when $k=2$. As $\epsilon \rightarrow 0$ the top vertex walks off to infinity and the recovered solution $x^{(\rho)}$ goes with it, while the true solution $x^*$ approaches the $x$-axis.
  • Figure 1: Comparison of methods on two problems with Gaussian data, with no singular value decay (left) and rapid singular value decay (right). The vertical dotted line indicates the burn-in time, before which results are shown for individual iterates.
  • Figure 1: Size of the bias $x^{(\rho)} - x^*$ for ReBlocK-U with $k=2$ on the isosceles triangle problem of \ref{['thm:no-go']}, for various values of $\epsilon$ and $\lambda$. Left: for $\lambda=0$ the bias grows without bound as $1/\epsilon \rightarrow \infty$, whereas for any $\lambda > 0$ the bias reaches a fixed maximum value and then decays to zero. Right: for any fixed $\epsilon$ the bias decreases monotonically with $\lambda$.
  • Figure 1: Comparison of methods for calculating natural gradient directions for a small neural network. The network parameters $\theta$ are taken from three snapshots of a single training run, with one snapshot from the "pre-descent" phase before the loss begins to decrease (top left), one snapshot from the "descent" phase during which the loss decreases rapidly (top right), and one snapshot from the "post-descent" phase when the decay rate of the loss has slowed significantly (bottom). The algorithms are measured in terms of their progress towards reducing the relative residual $\tilde{r} = \left\lVert J x - [f_\theta - f] \right\rVert / \left\lVert f_\theta - f \right\rVert$ for the least-squares problem \ref{['eq:ngd_ls']}, which measures how well the function-space update direction $J x$ agrees with the function-space loss gradient $f_\theta - f$. The burn-in time is set to $T_b \approx T/100$ in each case, as indicated by the vertical dotted line.
  • Figure 1: Target function for neural network training.
  • ...and 9 more figures

Theorems & Definitions (29)

  • Theorem 3.1
  • Example 3.2: No-go for RBK-U
  • Theorem 4.1
  • Corollary 4.2
  • Theorem 5.1
  • Corollary 5.2
  • Lemma A.1: Convergence of single-iterate expectation
  • Proof 1: Proof of \ref{['lemma:expect']}
  • Lemma A.2
  • Proof 2
  • ...and 19 more