Table of Contents
Fetching ...

Beyond adaptive gradient: Fast-Controlled Minibatch Algorithm for large-scale optimization

Corrado Coppola, Lorenzo Papa, Irene Amerini, Laura Palagi

TL;DR

This paper addresses the memory and theoretical limitations of adaptive gradient methods in large-scale DL by introducing Fast-Controlled Mini-batch Algorithm (F-CMA), which combines random reshuffling with a derivative-free, line-search–driven safeguard to ensure loss reduction per epoch. The method provides a deterministic global convergence guarantee to a stationary point for smooth, possibly non-convex objectives and reduces computational overhead through a derivative-free line-search that requires at most two full evaluations of the true objective per epoch. Empirically, F-CMA outperforms several baselines on CIFAR-10/100 across CNNs and a vision transformer, achieving up to 68% faster training, up to 20% higher per-epoch efficiency, and up to 5% gains in accuracy, while maintaining robustness to hyper-parameter settings. The work demonstrates significant practical impact by enabling faster, more reliable training without architectural changes, and lays groundwork for extending fast-controlled minibatching to broader DL tasks and architectures.

Abstract

Adaptive gradient methods have been increasingly adopted by deep learning community due to their fast convergence and reduced sensitivity to hyper-parameters. However, these methods come with limitations, such as increased memory requirements for elements like moving averages and a poorly understood convergence theory. To overcome these challenges, we introduce F-CMA, a Fast-Controlled Mini-batch Algorithm with a random reshuffling method featuring a sufficient decrease condition and a line-search procedure to ensure loss reduction per epoch, along with its deterministic proof of global convergence to a stationary point. To evaluate the F-CMA, we integrate it into conventional training protocols for classification tasks involving both convolutional neural networks and vision transformer models, allowing for a direct comparison with popular optimizers. Computational tests show significant improvements, including a decrease in the overall training time by up to 68%, an increase in per-epoch efficiency by up to 20%, and in model accuracy by up to 5%.

Beyond adaptive gradient: Fast-Controlled Minibatch Algorithm for large-scale optimization

TL;DR

This paper addresses the memory and theoretical limitations of adaptive gradient methods in large-scale DL by introducing Fast-Controlled Mini-batch Algorithm (F-CMA), which combines random reshuffling with a derivative-free, line-search–driven safeguard to ensure loss reduction per epoch. The method provides a deterministic global convergence guarantee to a stationary point for smooth, possibly non-convex objectives and reduces computational overhead through a derivative-free line-search that requires at most two full evaluations of the true objective per epoch. Empirically, F-CMA outperforms several baselines on CIFAR-10/100 across CNNs and a vision transformer, achieving up to 68% faster training, up to 20% higher per-epoch efficiency, and up to 5% gains in accuracy, while maintaining robustness to hyper-parameter settings. The work demonstrates significant practical impact by enabling faster, more reliable training without architectural changes, and lays groundwork for extending fast-controlled minibatching to broader DL tasks and architectures.

Abstract

Adaptive gradient methods have been increasingly adopted by deep learning community due to their fast convergence and reduced sensitivity to hyper-parameters. However, these methods come with limitations, such as increased memory requirements for elements like moving averages and a poorly understood convergence theory. To overcome these challenges, we introduce F-CMA, a Fast-Controlled Mini-batch Algorithm with a random reshuffling method featuring a sufficient decrease condition and a line-search procedure to ensure loss reduction per epoch, along with its deterministic proof of global convergence to a stationary point. To evaluate the F-CMA, we integrate it into conventional training protocols for classification tasks involving both convolutional neural networks and vision transformer models, allowing for a direct comparison with popular optimizers. Computational tests show significant improvements, including a decrease in the overall training time by up to 68%, an increase in per-epoch efficiency by up to 20%, and in model accuracy by up to 5%.

Paper Structure

This paper contains 13 sections, 6 theorems, 54 equations, 2 tables, 3 algorithms.

Key Result

Proposition 1

Assume that the sequence of points $\{w^k\}$ produced by Algorithm alg:innercycle1 is limited and that $\lim_{k\rightarrow0}\zeta_k = 0$. Then, for any limit point $\Bar w$ of $\{w^k\}$ a subset of indices $K$ exists such that

Theorems & Definitions (12)

  • Proposition 1: Convergence assuming $\{w^k\}$ bounded
  • proof
  • Proposition 2: DFL well-defined
  • proof
  • Lemma 3: $| \tilde{f}^{k+1} - f(w^k) |$ bounded
  • proof
  • Proposition 4: $\{w^k\}$ limited
  • proof
  • Proposition 5: $\zeta^k \rightarrow 0$
  • proof
  • ...and 2 more