Table of Contents
Fetching ...

Subspace-constrained randomized coordinate descent for linear systems with good low-rank matrix approximations

Jackie Lok, Elizaveta Rebrova

TL;DR

This work introduces SC-RCD, a memory-efficient solver for large PSD linear systems that leverages a cheap, low-rank Nyström approximation computed via RPCholesky to constrain RCD updates to an affine subspace. The subspace constraint acts as an implicit preconditioner, making convergence depend on the spectrum of the residual $\mathbf{A}^{\circ}=\mathbf{A}-\mathbf{A}\langle\mathcal{S}\rangle$ rather than the full matrix spectrum, and enabling robust performance when the original system has large spectral outliers. The authors develop a general subspace-constrained sketch-and-project framework and prove linear convergence under suitable conditions, with concrete complexity bounds that scale favorably when the spectrum decays rapidly. Numerical experiments on synthetic PSD systems and kernel ridge regression problems demonstrate that SC-RCD can outperform standard RCD and competitive solvers while using modest memory, illustrating its practical potential for large-scale, dense linear systems.

Abstract

The randomized coordinate descent (RCD) method is a classical algorithm with simple, lightweight iterations that is widely used for various optimization problems, including the solution of positive semidefinite linear systems. As a linear solver, RCD is particularly effective when the matrix is well-conditioned; however, its convergence rate deteriorates rapidly in the presence of large spectral outliers. In this paper, we introduce the subspace-constrained randomized coordinate descent (SC-RCD) method, in which the dynamics of RCD are restricted to an affine subspace corresponding to a column Nyström approximation, efficiently computed using the recently analyzed RPCholesky algorithm. We prove that SC-RCD converges at a rate that is unaffected by large spectral outliers, making it an effective and memory-efficient solver for large-scale, dense linear systems with rapidly decaying spectra, such as those encountered in kernel ridge regression. Experimental validation and comparisons with related solvers based on coordinate descent and the conjugate gradient method demonstrate the efficiency of SC-RCD. Our theoretical results are derived by developing a more general subspace-constrained framework for the sketch-and-project method. This framework, which may be of independent interest, generalizes popular algorithms such as randomized Kaczmarz and coordinate descent, and provides a flexible, implicit preconditioning strategy for a variety of iterative solvers.

Subspace-constrained randomized coordinate descent for linear systems with good low-rank matrix approximations

TL;DR

This work introduces SC-RCD, a memory-efficient solver for large PSD linear systems that leverages a cheap, low-rank Nyström approximation computed via RPCholesky to constrain RCD updates to an affine subspace. The subspace constraint acts as an implicit preconditioner, making convergence depend on the spectrum of the residual rather than the full matrix spectrum, and enabling robust performance when the original system has large spectral outliers. The authors develop a general subspace-constrained sketch-and-project framework and prove linear convergence under suitable conditions, with concrete complexity bounds that scale favorably when the spectrum decays rapidly. Numerical experiments on synthetic PSD systems and kernel ridge regression problems demonstrate that SC-RCD can outperform standard RCD and competitive solvers while using modest memory, illustrating its practical potential for large-scale, dense linear systems.

Abstract

The randomized coordinate descent (RCD) method is a classical algorithm with simple, lightweight iterations that is widely used for various optimization problems, including the solution of positive semidefinite linear systems. As a linear solver, RCD is particularly effective when the matrix is well-conditioned; however, its convergence rate deteriorates rapidly in the presence of large spectral outliers. In this paper, we introduce the subspace-constrained randomized coordinate descent (SC-RCD) method, in which the dynamics of RCD are restricted to an affine subspace corresponding to a column Nyström approximation, efficiently computed using the recently analyzed RPCholesky algorithm. We prove that SC-RCD converges at a rate that is unaffected by large spectral outliers, making it an effective and memory-efficient solver for large-scale, dense linear systems with rapidly decaying spectra, such as those encountered in kernel ridge regression. Experimental validation and comparisons with related solvers based on coordinate descent and the conjugate gradient method demonstrate the efficiency of SC-RCD. Our theoretical results are derived by developing a more general subspace-constrained framework for the sketch-and-project method. This framework, which may be of independent interest, generalizes popular algorithms such as randomized Kaczmarz and coordinate descent, and provides a flexible, implicit preconditioning strategy for a variety of iterative solvers.

Paper Structure

This paper contains 37 sections, 14 theorems, 86 equations, 5 figures, 2 algorithms.

Key Result

Theorem 1.1

Let $\mathbf{A} \in \mathbb{R}^{n \times n}$ be a psd matrix with eigenvalues $\lambda_1(\mathbf{A}) \geq \ldots \geq \lambda_n(\mathbf{A}) \geq 0$, and denote $\log_+(t) = \max \{ \log t, 0 \}$ for $t > 0$. Suppose that for some fixed integer $r \geq 0$ and real $\delta > 0$, the approximation rank Then, the rank-$d$ Nyström approximation $\mathbf{A} \langle \mathcal{S} \rangle$ output by RPChole

Figures (5)

  • Figure 4.1: Solving a synthetic $8,192 \times 8,192$ psd system $\mathbf{A} \mathbf{x} = \mathbf{b}$ with approximate rank $r = 400$. (Left) Relative residual norm $\lVert \mathbf{A} \mathbf{x}^k - \mathbf{b} \rVert_2 / \lVert \mathbf{A} \mathbf{x}^0 - \mathbf{b} \rVert_2$ over 300 epochs for SC-RCD (with $d = 500$ and $\ell = 500$), as well as CG, RCD and CD++ (also with $\ell = 500$), using the same initial iterate as SC-RCD. Each epoch corresponds to a single pass over the entire dataset (i.e., one iteration of SC-RCD/RCD/CD++ corresponds to $\ell / n$ epochs). The lines depict the median over 100 independent runs with the same Nyström approximation. (Right) Eigenvalue spectra of $\mathbf{A}$ and the residual matrix $\mathbf{A}^{\circ}$ corresponding to the rank-$d$ approximation, with their condition numbers $\sum_i \lambda_i / \lambda_{\mathrm{min}}^+$ reported in brackets.
  • Figure 4.3: Solving the KRR problem $(\mathbf{K} + \lambda \mathbf{I}) \mathbf{x} = \mathbf{y}$ on the hls4ml_lhc_jets dataset with $n = 100,000$ samples and a very small amount of regularization $\lambda = 10^{-9} n$. (Top left) Relative residual norm $\lVert (\mathbf{K} + \lambda \mathbf{I}) \mathbf{x}^k - \mathbf{y} \rVert_2 / \lVert \mathbf{y} \rVert_2$ over 100 epochs for SC-RCD (with $d = 1,000$ and $\ell = 1,000$), as well as RCD (also with $\ell = 1,000$), CG, and PCG (also with $d = 1,000$). The lines (resp. shaded interval) depict the median (resp. 0.2- and 0.8-quantiles) over 100 independent runs. (Top right) Error in terms of time elapsed. The kernel matrix $\mathbf{K}$ is not stored in memory, and entry evaluations represent the dominant computational cost. (Bottom left) The leading $20,000$ eigenvalues of $\mathbf{A} = \mathbf{K} + \lambda \mathbf{I}$, the residual $\mathbf{A}^{\circ}$, and the preconditioned $\lambda \mathbf{M}^{-1/2} \mathbf{A} \mathbf{M}^{-1/2}$ corresponding to the rank-$d$ Nyström approximation, with their condition numbers $\sum_i \lambda_i / \lambda_{\mathrm{min}}^+$ reported in brackets. (Bottom right) The error after 50 epochs for SC-RCD and PCG with various approximation ranks $d$.
  • Figure 4.4: Solving the KRR problem $(\mathbf{K} + \lambda \mathbf{I}) \mathbf{x} = \mathbf{y}$ on the sensorless dataset with $n = 20,000$ samples and a small regularization parameter $\lambda = 10^{-7} n$. (Left) Relative residual norm over 300 epochs for SC-RCD (with $d = 1,000$ and $\ell = 1,000$), RCD (also with $\ell = 1,000$), CG, and PCG (also with $d = 1,000$). (Right) Eigenvalue spectra of $\mathbf{A} = \mathbf{K} + \lambda \mathbf{I}$, the residual $\mathbf{A}^{\circ}$, and the preconditioned $\lambda \mathbf{M}^{-1/2} \mathbf{A} \mathbf{M}^{-1/2}$ corresponding to the rank-$d$ Nyström approximation, with their condition numbers $\sum_i \lambda_i / \lambda_{\mathrm{min}}^+$ reported in brackets.
  • Figure C.1: For kernel matrices exhibiting rapid spectral decay, the SC-RCD method is particularly effective: (Left) Convergence trajectories and (right) eigenvalues for the (i) ACSIncome$(p=11)$, (ii) Airlines_DepDelay_1M$(p=9)$, (iii) cod-rna$(p=8)$, and (iv) diamonds$(p=9)$ datasets.
  • Figure C.2: For matrices with slower spectral decay, SC-RCD with uniformly sampled blocks typically works quite well: (Left) Convergence trajectories and (right) eigenvalues for the (i) covtype.binary$(p=54)$, (ii) creditcard$(p=29)$, (iii) HIGGS$(p=28)$, and (iv) SensIT Vehicle$(p=100)$ datasets.

Theorems & Definitions (33)

  • Theorem 1.1: ChenEtAl2024
  • Remark 1.2
  • Theorem 1.3
  • Remark 1.4: Randomized low-rank approximation
  • Remark 1.5: Approximation rank
  • Remark 1.6: Block size and sampling
  • Theorem 1.7
  • Remark 1.8
  • Proposition 1.9
  • Lemma 2.1: Closed-form updates
  • ...and 23 more