Table of Contents
Fetching ...

Training in reverse: How iteration order influences convergence and stability in deep learning

Benoit Dherin, Benny Avelin, Anders Karlsson, Hanna Mazzawi, Javier Gonzalvo, Michael Munn

TL;DR

The paper investigates how the order in which gradient updates are applied affects convergence and stability in deep learning under a constant learning rate and small batches. It introduces a backward contraction principle showing that backward SGD converges to a fixed point in contractive regions, and it connects this point convergence to forward SGD's stationary distribution. Two explicit SGD examples and extensive experiments across CNNs and MLPs demonstrate that backward trajectories are more stable and converge faster, while forward trajectories oscillate or converge to distributions. The authors also propose practical approximations and potential applications, such as windowed backward updates and Lie-bracket corrections, highlighting a new avenue for leveraging iteration order to improve training dynamics.

Abstract

Despite exceptional achievements, training neural networks remains computationally expensive and is often plagued by instabilities that can degrade convergence. While learning rate schedules can help mitigate these issues, finding optimal schedules is time-consuming and resource-intensive. This work explores theoretical issues concerning training stability in the constant-learning-rate (i.e., without schedule) and small-batch-size regime. Surprisingly, we show that the order of gradient updates affects stability and convergence in gradient-based optimizers. We illustrate this new line of thinking using backward-SGD, which processes batch gradient updates like SGD but in reverse order. Our theoretical analysis shows that in contractive regions (e.g., around minima) backward-SGD converges to a point while the standard forward-SGD generally only converges to a distribution. This leads to improved stability and convergence which we demonstrate experimentally. While full backward-SGD is computationally intensive in practice, it highlights opportunities to exploit reverse training dynamics (or more generally alternate iteration orders) to improve training. To our knowledge, this represents a new and unexplored avenue in deep learning optimization.

Training in reverse: How iteration order influences convergence and stability in deep learning

TL;DR

The paper investigates how the order in which gradient updates are applied affects convergence and stability in deep learning under a constant learning rate and small batches. It introduces a backward contraction principle showing that backward SGD converges to a fixed point in contractive regions, and it connects this point convergence to forward SGD's stationary distribution. Two explicit SGD examples and extensive experiments across CNNs and MLPs demonstrate that backward trajectories are more stable and converge faster, while forward trajectories oscillate or converge to distributions. The authors also propose practical approximations and potential applications, such as windowed backward updates and Lie-bracket corrections, highlighting a new avenue for leveraging iteration order to improve training dynamics.

Abstract

Despite exceptional achievements, training neural networks remains computationally expensive and is often plagued by instabilities that can degrade convergence. While learning rate schedules can help mitigate these issues, finding optimal schedules is time-consuming and resource-intensive. This work explores theoretical issues concerning training stability in the constant-learning-rate (i.e., without schedule) and small-batch-size regime. Surprisingly, we show that the order of gradient updates affects stability and convergence in gradient-based optimizers. We illustrate this new line of thinking using backward-SGD, which processes batch gradient updates like SGD but in reverse order. Our theoretical analysis shows that in contractive regions (e.g., around minima) backward-SGD converges to a point while the standard forward-SGD generally only converges to a distribution. This leads to improved stability and convergence which we demonstrate experimentally. While full backward-SGD is computationally intensive in practice, it highlights opportunities to exploit reverse training dynamics (or more generally alternate iteration orders) to improve training. To our knowledge, this represents a new and unexplored avenue in deep learning optimization.

Paper Structure

This paper contains 37 sections, 6 theorems, 71 equations, 19 figures.

Key Result

Theorem 2.2

(Backward contraction mappings principle) Let $T_{i}$ be a sequence of continuous self-maps of a complete metric space. Assume $T_i$'s are uniform contractions, with a certain $k<1$ in common, and for some $\theta$ there is a constant $D$ such that, Then for any $\theta \in \Omega$ the backward iterates converge to a point $\theta^*$ as $n\rightarrow \infty$. Moreover, the convergence rate is ex

Figures (19)

  • Figure 1: Naive implementation of the backward dynamics: Forward iterations (left) and backward iterations (right). The training steps are represented by Pac-men consuming batches. Forward iterations maintain a training state and consume a new batch at each step, while backward iterations restart the training and consume all the batches received so far in reverse order.
  • Figure 2: Backward SGD exhibits decreased variance and increased stability compared to forward SGD for a ResNet-18 model trained on CIFAR-10. The additional seeds are in \ref{['appendix:figure_1']}.
  • Figure 3: Backward SGD converges toward a different minima after resetting the initialization point at step 1000 ("intermittent backward") while forward SGD oscillates between them for MLP trained on FashionMNIST. Top: On the first seed, backward changes from a higher test-performance trajectory to a lower test-performance trajectory at the reset step 1000. Bottom: On the second seed, backward changes this time from a trajectory converging to a lower test-performance point to a trajectory converging to a higher test-performance point. The other seeds can be found in Appendix \ref{['appendix:figure_2']} including all the learning curves.
  • Figure 4: Decreased variance and increased stability in train (left) and test (right) losses for backward SGD compared to forward SGD for all 5 seeds. The data was sampled from $f(x) = x^2$ and the training performed with batch size 1 and learning rate 0.05 for 1400 steps.
  • Figure 5: Decreased variance and increased stability in train (left) and test (rigth) losses for backward SGD compared to forward SGD for all 5 seeds. The data was sampled from $f(x) = \cos(10x)$ and the training performed with batch size 1 and learning rate 0.02 for 1400 steps.
  • ...and 14 more figures

Theorems & Definitions (14)

  • Example 2.1
  • Theorem 2.2
  • proof
  • Remark 2.3
  • Example 2.4
  • Lemma 2.5
  • Theorem 2.6
  • Lemma C.1
  • proof
  • Theorem C.2
  • ...and 4 more