Learning to Relax: Setting Solver Parameters Across a Sequence of Linear System Instances

Mikhail Khodak; Edmond Chow; Maria-Florina Balcan; Ameet Talwalkar

Learning to Relax: Setting Solver Parameters Across a Sequence of Linear System Instances

Mikhail Khodak, Edmond Chow, Maria-Florina Balcan, Ameet Talwalkar

TL;DR

The paper presents a principled framework for setting solver parameters across sequences of linear system instances by casting parameter tuning as online learning with bandit feedback. Focusing on SOR, it introduces a surrogate upper bound on iteration counts and develops Tsallis-INF-based bandit algorithms, achieving sublinear regret relative to the best fixed parameter and extending to contextual settings with diagonal shifts and to CG with SSOR preconditioning. It also provides a stochastic analysis for SSOR and a Chebyshev-regression approach (ChebCB) for context-rich diagonal-shift problems, yielding near instance-optimal performance in practice. The results establish end-to-end guarantees for data-driven numerical methods, showing that well-understood learning algorithms can meaningfully speed up high-precision linear solvers in sequential settings, albeit with limitations that motivate future work on broader solver families and non-stationary regimes.

Abstract

Solving a linear system $Ax=b$ is a fundamental scientific computing primitive for which numerous solvers and preconditioners have been developed. These come with parameters whose optimal values depend on the system being solved and are often impossible or too expensive to identify; thus in practice sub-optimal heuristics are used. We consider the common setting in which many related linear systems need to be solved, e.g. during a single numerical simulation. In this scenario, can we sequentially choose parameters that attain a near-optimal overall number of iterations, without extra matrix computations? We answer in the affirmative for Successive Over-Relaxation (SOR), a standard solver whose parameter $ω$ has a strong impact on its runtime. For this method, we prove that a bandit online learning algorithm--using only the number of iterations as feedback--can select parameters for a sequence of instances such that the overall cost approaches that of the best fixed $ω$ as the sequence length increases. Furthermore, when given additional structural information, we show that a contextual bandit method asymptotically achieves the performance of the instance-optimal policy, which selects the best $ω$ for each instance. Our work provides the first learning-theoretic treatment of high-precision linear system solvers and the first end-to-end guarantees for data-driven scientific computing, demonstrating theoretically the potential to speed up numerical methods using well-understood learning algorithms.

Learning to Relax: Setting Solver Parameters Across a Sequence of Linear System Instances

TL;DR

Abstract

Solving a linear system

is a fundamental scientific computing primitive for which numerous solvers and preconditioners have been developed. These come with parameters whose optimal values depend on the system being solved and are often impossible or too expensive to identify; thus in practice sub-optimal heuristics are used. We consider the common setting in which many related linear systems need to be solved, e.g. during a single numerical simulation. In this scenario, can we sequentially choose parameters that attain a near-optimal overall number of iterations, without extra matrix computations? We answer in the affirmative for Successive Over-Relaxation (SOR), a standard solver whose parameter

has a strong impact on its runtime. For this method, we prove that a bandit online learning algorithm--using only the number of iterations as feedback--can select parameters for a sequence of instances such that the overall cost approaches that of the best fixed

as the sequence length increases. Furthermore, when given additional structural information, we show that a contextual bandit method asymptotically achieves the performance of the instance-optimal policy, which selects the best

for each instance. Our work provides the first learning-theoretic treatment of high-precision linear system solvers and the first end-to-end guarantees for data-driven scientific computing, demonstrating theoretically the potential to speed up numerical methods using well-understood learning algorithms.

Paper Structure (35 sections, 29 theorems, 50 equations, 6 figures, 8 algorithms)

This paper contains 35 sections, 29 theorems, 50 equations, 6 figures, 8 algorithms.

Introduction
Core contributions
Technical and theoretical contributions
Related work and comparisons
Asymptotic analysis of learning the relaxation parameter
Setup
Establishing a surrogate upper bound
Performing as well as the best fixed $\omega$
The diagonally shifted setting
Tuning preconditioned conjugate gradient
A stochastic analysis of symmetric SOR
Regularity of the expected cost function
Chebyshev regression for diagonal shifts
Conclusion and limitations
Related work and comparisons
...and 20 more sections

Key Result

Lemma 2.1

Define $U(\omega)=1+\frac{-\log\varepsilon}{-\log(\rho(\mathbf{C}_\omega)+\tau(1-\rho(\mathbf{C}_\omega)))}$, $\alpha=\tau+(1-\tau)\max\{\beta^2,\omega_{\textup{\tiny max}}-1\}$, and $\omega^\ast=1+\beta^2/(1+\sqrt{1-\beta^2})^2$, where $\beta=\rho(\mathbf{I}_n-\mathbf{D}^{-1}\mathbf{A})$. Then the

Figures (6)

Figure 1: Left: comparison of different cost estimates. Center-left: mean performance of different parameters across forty instances of form $\mathbf{A}+\frac{12c-3}{20}\mathbf{I}_n$, where $c\sim$ Beta$(2,6)$. Center-right: the same but for $c\sim$ Beta$(1/2,3/2)$, which is relatively higher-variance. In both cases the dashed line indicates instance-optimal performance, the matrix $\mathbf{A}$ is a discrete Laplacian of a $100\times100$ square domain, and the targets $\mathbf{b}$ are truncated Gaussians. Right: asymptocity as measured by the difference between the spectral norm at iteration $k$ and the spectral radius, together with its upper bound $\tau(1-\rho(\mathbf{C}_\omega))$.
Figure 2: Left: solver cost for $\mathbf{b}$ drawn from a truncated Gaussian v.s. $\mathbf{b}$ a small eigenvector of $\mathbf{C}_{1.4}$. Center-left: cost to solve 5K diagonally shifted systems $\mathbf{A}_t=\mathbf{A}+\frac{12c_t-3}{20}\mathbf{I}_n$ for $c_t\sim$ Beta$(2,6)$. Center-right: total SSOR-preconditioned CG iterations taken while solving the 2D heat equation with a time-varying diffusion coefficient (used as context) on different grids, as a function of the linear system dimension. Right: (smoothed) parameters chosen at each timestep of one such simulation, overlaid on a contour plot of the cost of solving the system at step $t$ with parameter $\omega$ (c.f. Appendix \ref{['app:experimental-details']}).
Figure 3: Values of $\tau$ and $\beta$ for $\mathbf{A}+c\mathbf{I}_n$ for different $c$.
Figure 4: Comparison of actual cost of running SSOR-preconditioned CG and the upper bounds computed in Section \ref{['app:cg']} as functions of the tuning parameter $\omega\in[2\sqrt2-2,1.9]$ on various domains.
Figure 5: Average across forty trials of the time needed to solve 5K diagonally shifted systems with $\mathbf{A}_t=\mathbf{A}+\frac{12c-3}{20}\mathbf{I}_n$ for $c\sim$ Beta$(\frac{1}{2},\frac{3}{2})$ (center) and $c\sim$ Beta$(2,6)$ (otherwise).
...and 1 more figures

Theorems & Definitions (53)

Remark 1.1
Lemma 2.1
Theorem 2.1
Theorem 2.2: c.f. Theorem \ref{['thm:contextual-tsallis-inf']}
Theorem 2.3
Theorem 3.1
Theorem 3.2: Corollary of Theorem \ref{['thm:squarecb-ftl-general']}
Corollary A.1
proof
Corollary A.2
...and 43 more

Learning to Relax: Setting Solver Parameters Across a Sequence of Linear System Instances

TL;DR

Abstract

Learning to Relax: Setting Solver Parameters Across a Sequence of Linear System Instances

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (53)