Table of Contents
Fetching ...

An Additively Preconditioned Trust Region Strategy for Machine Learning

Samuel Cruz Alegría, Bindi Çapriqi, Shega Likaj, Ken Trotti, Rolf Krause

TL;DR

The paper tackles the challenge of training large, nonconvex neural networks by introducing a nonlinear right-preconditioned Additively Preconditioned Trust-Region Strategy (APTS) that leverages Additive Schwarz domain decomposition to perform parallel local updates. By constructing local subproblems with first-order consistency and aggregating their corrections within a global Trust-Region framework, APTS achieves robust convergence with reduced hyperparameter tuning. An Inexact APTS (IAPTS) variant further reduces computational cost by caching intermediate computations and using a constrained Adam optimizer within subdomains, while still benefiting from a global TR safeguard. Numerical experiments on MNIST and CIFAR-10 show competitive performance without hyperparameter tuning, with larger subdomain counts improving accuracy and efficiency, highlighting the method’s potential for scalable, parallel ML optimization.

Abstract

Modern machine learning, especially the training of deep neural networks, depends on solving large-scale, highly nonconvex optimization problems, whose objective function exhibit a rough landscape. Motivated by the success of parallel preconditioners in the context of Krylov methods for large scale linear systems, we introduce a novel nonlinearly preconditioned Trust-Region method that makes use of an additive Schwarz correction at each minimization step, thereby accelerating convergence. More precisely, we propose a variant of the Additively Preconditioned Trust-Region Strategy (APTS), which combines a right-preconditioned additive Schwarz framework with a classical Trust-Region algorithm. By decomposing the parameter space into sub-domains, APTS solves local non-linear sub-problems in parallel and assembles their corrections additively. The resulting method not only shows fast convergence; due to the underlying Trust-Region strategy, it furthermore largely obviates the need for hyperparameter tuning.

An Additively Preconditioned Trust Region Strategy for Machine Learning

TL;DR

The paper tackles the challenge of training large, nonconvex neural networks by introducing a nonlinear right-preconditioned Additively Preconditioned Trust-Region Strategy (APTS) that leverages Additive Schwarz domain decomposition to perform parallel local updates. By constructing local subproblems with first-order consistency and aggregating their corrections within a global Trust-Region framework, APTS achieves robust convergence with reduced hyperparameter tuning. An Inexact APTS (IAPTS) variant further reduces computational cost by caching intermediate computations and using a constrained Adam optimizer within subdomains, while still benefiting from a global TR safeguard. Numerical experiments on MNIST and CIFAR-10 show competitive performance without hyperparameter tuning, with larger subdomain counts improving accuracy and efficiency, highlighting the method’s potential for scalable, parallel ML optimization.

Abstract

Modern machine learning, especially the training of deep neural networks, depends on solving large-scale, highly nonconvex optimization problems, whose objective function exhibit a rough landscape. Motivated by the success of parallel preconditioners in the context of Krylov methods for large scale linear systems, we introduce a novel nonlinearly preconditioned Trust-Region method that makes use of an additive Schwarz correction at each minimization step, thereby accelerating convergence. More precisely, we propose a variant of the Additively Preconditioned Trust-Region Strategy (APTS), which combines a right-preconditioned additive Schwarz framework with a classical Trust-Region algorithm. By decomposing the parameter space into sub-domains, APTS solves local non-linear sub-problems in parallel and assembles their corrections additively. The resulting method not only shows fast convergence; due to the underlying Trust-Region strategy, it furthermore largely obviates the need for hyperparameter tuning.

Paper Structure

This paper contains 12 sections, 22 equations, 3 figures, 2 algorithms.

Figures (3)

  • Figure 1: Graphical representation of the pipelined NN across 3 GPUS (a), and the decoupled subdomains (b).
  • Figure 2: Average training accuracy (left axis) and loss (right axis) over 100 epochs with a batch size of 10 000 on the MNIST dataset. Solid lines denote the mean across five runs.
  • Figure 3: Average training accuracy (left axis) and loss (right axis) over 25 epochs with a batch size of 200 on the CIFAR-10 dataset. Solid lines denote the mean across five runs.