Table of Contents
Fetching ...

Robust Learning Rate Selection for Stochastic Optimization via Splitting Diagnostic

Matteo Sordello, Niccolò Dalmasso, Hangfeng He, Weijie Su

TL;DR

The paper introduces SplitSGD, a stochastic optimization algorithm that dynamically adjusts the learning rate by detecting stationary phases using a Splitting Diagnostic. This diagnostic runs two parallel SGD threads from the same point and uses gradient coherence, $Q_i=\langle \bar g_i^{(1)}, \bar g_i^{(2)}\rangle$, to decide when to decrease the learning rate by a factor $\gamma$, while increasing the current thread length by $1/\gamma$. The authors provide theoretical guarantees under standard convexity and smoothness assumptions, showing controlled detection error and convergence with increasing diagnostics, and demonstrate through extensive experiments that SplitSGD is robust to hyperparameters and can outperform adaptive methods like Adam on neural networks. The approach yields improved generalization in several deep-learning tasks and offers a practical, low-overhead enhancement to SGD for both convex and nonconvex objectives.

Abstract

This paper proposes SplitSGD, a new dynamic learning rate schedule for stochastic optimization. This method decreases the learning rate for better adaptation to the local geometry of the objective function whenever a stationary phase is detected, that is, the iterates are likely to bounce at around a vicinity of a local minimum. The detection is performed by splitting the single thread into two and using the inner product of the gradients from the two threads as a measure of stationarity. Owing to this simple yet provably valid stationarity detection, SplitSGD is easy-to-implement and essentially does not incur additional computational cost than standard SGD. Through a series of extensive experiments, we show that this method is appropriate for both convex problems and training (non-convex) neural networks, with performance compared favorably to other stochastic optimization methods. Importantly, this method is observed to be very robust with a set of default parameters for a wide range of problems and, moreover, can yield better generalization performance than other adaptive gradient methods such as Adam.

Robust Learning Rate Selection for Stochastic Optimization via Splitting Diagnostic

TL;DR

The paper introduces SplitSGD, a stochastic optimization algorithm that dynamically adjusts the learning rate by detecting stationary phases using a Splitting Diagnostic. This diagnostic runs two parallel SGD threads from the same point and uses gradient coherence, , to decide when to decrease the learning rate by a factor , while increasing the current thread length by . The authors provide theoretical guarantees under standard convexity and smoothness assumptions, showing controlled detection error and convergence with increasing diagnostics, and demonstrate through extensive experiments that SplitSGD is robust to hyperparameters and can outperform adaptive methods like Adam on neural networks. The approach yields improved generalization in several deep-learning tasks and offers a practical, low-overhead enhancement to SGD for both convex and nonconvex objectives.

Abstract

This paper proposes SplitSGD, a new dynamic learning rate schedule for stochastic optimization. This method decreases the learning rate for better adaptation to the local geometry of the objective function whenever a stationary phase is detected, that is, the iterates are likely to bounce at around a vicinity of a local minimum. The detection is performed by splitting the single thread into two and using the inner product of the gradients from the two threads as a measure of stationarity. Owing to this simple yet provably valid stationarity detection, SplitSGD is easy-to-implement and essentially does not incur additional computational cost than standard SGD. Through a series of extensive experiments, we show that this method is appropriate for both convex problems and training (non-convex) neural networks, with performance compared favorably to other stochastic optimization methods. Importantly, this method is observed to be very robust with a set of default parameters for a wide range of problems and, moreover, can yield better generalization performance than other adaptive gradient methods such as Adam.

Paper Structure

This paper contains 18 sections, 8 theorems, 63 equations, 13 figures, 2 algorithms.

Key Result

Theorem 2

If Assumptions smoothness, errors_assumption and bounded_noisy_gradient with $m=4$ hold, $\|\nabla F(\theta_0)\| > 0$ and we run $t_1$ iterations before a Splitting Diagnostic with $w$ windows of length $l$, then for any $i \in \{1,..., w\}$ we can set $\eta$ small enough to guarantee that where $C_1(\eta, l) = O(1/\sqrt l) + O(\sqrt{\eta(t_1+l)})$$C_1(\eta, l)$ also depends on $\|\nabla F(\theta

Figures (13)

  • Figure 1: Normalized dot product of averaged noisy gradients over $100$ iterations. Stationarity depends on the learning rate: $\eta = 1$ corresponds to stationarity (purple), while $\eta = 0.1$ corresponds to non stationarity (orange). Details in Section \ref{['sec:splitt-diagn-splitsg']}.
  • Figure 2: The architecture of SplitSGD. The initial learning rate is $\eta$ and the length of the first single thread is $t_1$. If the diagnostic does not detect stationarity, the length and learning rate of the next thread remain unchanged. If stationarity is observed, we decrease the learning rate by a factor $\gamma$ and proportionally increase the length.
  • Figure 3: Histogram of the gradient coherence $Q_i$ (for the second pair of windows, normalized) of the Splitting Diagnostic for linear and logistic regression. The two top panels show the behavior in Theorem \ref{['asymptotic_eta']}, the two bottom panels the one in Theorem \ref{['asymptotic_t']}. In orange we see non stationarity, while in purple a distribution that will return stationarity for an appropriate choice of $w$ and $q$.
  • Figure 4: The probability of making a type I error using the Splitting Diagnostic (thick lines) closely matches with the respective theoretical probability (thin lines), in both linear and logistic regression settings. In both settings we considered 1000 experiments for each value of $w$ and we initialize $\theta_0$ to be close to $\theta^*$, with $l = 10$ and $\eta$ sufficiently large to guarantee stationarity.
  • Figure 5: (Top) Comparison between Splitting and pflug Diagnostics on linear and logistic regression. The y-axis represents the epochs and the red bands are the epochs where stationarity should be detected, while the boxplots represent the distribution of when the method actually detects stationarity. The pflug Diagnostics incurs in the risk of waiting too long after stationarity is reached, while the Splitting Diagnostic does not as a checkpoint is set every fixed number of iterations. (Bottom) comparison of the log(loss) achieved after $100$ epochs between SplitSGD, SGD$^{1/2}$ (Half) and SGD with constant or decreasing learning rate on linear and logistic regression. More details are in Section \ref{['convex_section']}.
  • ...and 8 more figures

Theorems & Definitions (12)

  • Definition 1
  • Theorem 2
  • Theorem 3
  • Lemma 4
  • Proposition 5
  • Lemma 6
  • Remark 7
  • Lemma 8
  • Remark 9
  • Lemma 10
  • ...and 2 more