Table of Contents
Fetching ...

Effectively Leveraging Momentum Terms in Stochastic Line Search Frameworks for Fast Optimization of Finite-Sum Problems

Matteo Lapucci, Davide Pucci

TL;DR

The paper tackles fast optimization of large-scale finite-sum problems by integrating momentum into stochastic line search frameworks through a mini-batch persistency strategy. The authors develop a data-persistent conjugate-gradient–style framework (MBCG-DP) with safeguards and derive convergence results under interpolation and the Polyak-Lojasiewicz condition, supported by a bias-corrected gradient estimator when persistency is used. Empirically, the proposed MBCG_FR variant achieves state-of-the-art performance on convex problems and competitive results on nonconvex deep-learning tasks, particularly at larger batch sizes. The work highlights practical benefits of data overlap in minibatches and offers a foundation for further exploration of momentum-line-search interactions in scalable optimization.

Abstract

In this work, we address unconstrained finite-sum optimization problems, with particular focus on instances originating in large scale deep learning scenarios. Our main interest lies in the exploration of the relationship between recent line search approaches for stochastic optimization in the overparametrized regime and momentum directions. First, we point out that combining these two elements with computational benefits is not straightforward. To this aim, we propose a solution based on mini-batch persistency. We then introduce an algorithmic framework that exploits a mix of data persistency, conjugate-gradient type rules for the definition of the momentum parameter and stochastic line searches. The resulting algorithm provably possesses convergence properties under suitable assumptions and is empirically shown to outperform other popular methods from the literature, obtaining state-of-the-art results in both convex and nonconvex large scale training problems.

Effectively Leveraging Momentum Terms in Stochastic Line Search Frameworks for Fast Optimization of Finite-Sum Problems

TL;DR

The paper tackles fast optimization of large-scale finite-sum problems by integrating momentum into stochastic line search frameworks through a mini-batch persistency strategy. The authors develop a data-persistent conjugate-gradient–style framework (MBCG-DP) with safeguards and derive convergence results under interpolation and the Polyak-Lojasiewicz condition, supported by a bias-corrected gradient estimator when persistency is used. Empirically, the proposed MBCG_FR variant achieves state-of-the-art performance on convex problems and competitive results on nonconvex deep-learning tasks, particularly at larger batch sizes. The work highlights practical benefits of data overlap in minibatches and offers a foundation for further exploration of momentum-line-search interactions in scalable optimization.

Abstract

In this work, we address unconstrained finite-sum optimization problems, with particular focus on instances originating in large scale deep learning scenarios. Our main interest lies in the exploration of the relationship between recent line search approaches for stochastic optimization in the overparametrized regime and momentum directions. First, we point out that combining these two elements with computational benefits is not straightforward. To this aim, we propose a solution based on mini-batch persistency. We then introduce an algorithmic framework that exploits a mix of data persistency, conjugate-gradient type rules for the definition of the momentum parameter and stochastic line searches. The resulting algorithm provably possesses convergence properties under suitable assumptions and is empirically shown to outperform other popular methods from the literature, obtaining state-of-the-art results in both convex and nonconvex large scale training problems.

Paper Structure

This paper contains 13 sections, 3 theorems, 31 equations, 6 figures, 2 tables, 2 algorithms.

Key Result

Proposition 5.1

Let $\mathcal{R}_{k-1}\subset\{1,\ldots,N\}$ and let $\mathcal{S}_k$ be a random subset of $\bar{\mathcal{R}}_{k-1} = \{1,\ldots,N\}\setminus\mathcal{R}_{k-1}$. Let $\zeta_k=\frac{N-|\mathcal{R}_{k-1}|}{|\mathcal{S}_{k}|}$. Then, the quantities and are conditionally unbiased estimators of $f(x^k)$ and $\nabla f(x^k)$ given $x^k$, i.e., $\mathbb{E}_k[f_k(x^k)] = f(x^k)$ and $\mathbb{E}_k[g_k(x^k)

Figures (6)

  • Figure 1: Average angle per mini-batch GD epoch (in degrees) between the momentum term and the negative stochastic gradient we would obtain for a new mini-batch with 0%, 25%, 50%, 75% and 100% persistency.
  • Figure 2: Effect of a 50% mini-batch persistency with state-of-the-art algorithms (Minibatch-GD with momentum, Adam, PoNoS, MSL-SGDM) applied to the problem of training a multi-layer perceptron on the MNIST dataset. Time is specified in seconds.
  • Figure 3: Behavior of MBCG-DP and its unbiased variant when employed to train a multi-layer perceptron on MNIST dataset with batch size of 128 and 512.
  • Figure 4: Behavior of MBCG-DP when employed to train a kernel classifier on ijcnn and a multi-layer perceptron on MNIST dataset. The left column refers to the comparison of the FR, HS and PPR rules for selecting $\beta_k$; the central column refers to the comparison of rules for selecting $\alpha_0^k$ (constant, heuristic, SPS); the right column refers to different safeguard strategies for ensuring a descent direction is used (clipping, gradient, inversion, subspace).
  • Figure 5: Comparison of the proposed MBCG_FR algorithm with other algorithms in the literature. The training loss over time is shown for each problem.
  • ...and 1 more figures

Theorems & Definitions (8)

  • Remark 1
  • Proposition 5.1
  • proof
  • Remark 2
  • Proposition 5.2
  • proof
  • Theorem 5.3
  • proof