Robust Learning Rate Selection for Stochastic Optimization via Splitting Diagnostic
Matteo Sordello, Niccolò Dalmasso, Hangfeng He, Weijie Su
TL;DR
The paper introduces SplitSGD, a stochastic optimization algorithm that dynamically adjusts the learning rate by detecting stationary phases using a Splitting Diagnostic. This diagnostic runs two parallel SGD threads from the same point and uses gradient coherence, $Q_i=\langle \bar g_i^{(1)}, \bar g_i^{(2)}\rangle$, to decide when to decrease the learning rate by a factor $\gamma$, while increasing the current thread length by $1/\gamma$. The authors provide theoretical guarantees under standard convexity and smoothness assumptions, showing controlled detection error and convergence with increasing diagnostics, and demonstrate through extensive experiments that SplitSGD is robust to hyperparameters and can outperform adaptive methods like Adam on neural networks. The approach yields improved generalization in several deep-learning tasks and offers a practical, low-overhead enhancement to SGD for both convex and nonconvex objectives.
Abstract
This paper proposes SplitSGD, a new dynamic learning rate schedule for stochastic optimization. This method decreases the learning rate for better adaptation to the local geometry of the objective function whenever a stationary phase is detected, that is, the iterates are likely to bounce at around a vicinity of a local minimum. The detection is performed by splitting the single thread into two and using the inner product of the gradients from the two threads as a measure of stationarity. Owing to this simple yet provably valid stationarity detection, SplitSGD is easy-to-implement and essentially does not incur additional computational cost than standard SGD. Through a series of extensive experiments, we show that this method is appropriate for both convex problems and training (non-convex) neural networks, with performance compared favorably to other stochastic optimization methods. Importantly, this method is observed to be very robust with a set of default parameters for a wide range of problems and, moreover, can yield better generalization performance than other adaptive gradient methods such as Adam.
