When to restart? Exploring escalating restarts on convergence

Ayush K. Varshney; Šarūnas Girdzijauskas; Konstantinos Vandikas; Aneta Vulgarakis Feljan

When to restart? Exploring escalating restarts on convergence

Ayush K. Varshney, Šarūnas Girdzijauskas, Konstantinos Vandikas, Aneta Vulgarakis Feljan

TL;DR

Stochastic Gradient Descent with Escalating Restarts adaptively increases the learning rate upon convergence, demonstrating the benefit of convergence-aware escalating restarts for better local optima.

Abstract

Learning rate scheduling plays a critical role in the optimization of deep neural networks, directly influencing convergence speed, stability, and generalization. While existing schedulers such as cosine annealing, cyclical learning rates, and warm restarts have shown promise, they often rely on fixed or periodic triggers that are agnostic to the training dynamics, such as stagnation or convergence behavior. In this work, we propose a simple yet effective strategy, which we call Stochastic Gradient Descent with Escalating Restarts (SGD-ER). It adaptively increases the learning rate upon convergence. Our method monitors training progress and triggers restarts when stagnation is detected, linearly escalating the learning rate to escape sharp local minima and explore flatter regions of the loss landscape. We evaluate SGD-ER across CIFAR-10, CIFAR-100, and TinyImageNet on a range of architectures including ResNet-18/34/50, VGG-16, and DenseNet-101. Compared to standard schedulers, SGD-ER improves test accuracy by 0.5-4.5%, demonstrating the benefit of convergence-aware escalating restarts for better local optima.

When to restart? Exploring escalating restarts on convergence

TL;DR

Abstract

Paper Structure (6 sections, 2 theorems, 11 equations, 5 figures, 9 tables)

This paper contains 6 sections, 2 theorems, 11 equations, 5 figures, 9 tables.

Introduction
Stochastic Gradient Descent with Escalating Restarts
Experimental Analysis
Conclusion
Additional Experimental Results
Theoretical Analysis

Key Result

Theorem 1

Let $f:\mathbb{R}^d \to \mathbb{R}$ be an $L$-smooth function. Let $\theta^\star$ be a strict saddle point where $\nabla f(\theta^\star) = 0$ and $\lambda_{\min}(\nabla^2 f(\theta^\star)) = -\gamma < 0$. Let $\theta_0^{(k)}$ be the starting point for restart $k$. Assume there exists a non-zero resid Then, for any neighborhood radius $\delta > |x_0|$, the number of iterations $T_k$ required to esca

Figures (5)

Figure 1: Comparison of learning rate schedulers and test accuracy on CIFAR-100 with ResNet-18 architecture with training budget of 2000 epochs. (Left) Learning rate trajectories for our proposed scheduler (ours_exp, in red) alongside four baselines: SGD with exponential decay, SGD with Warmup-Stable-Decay-Simplified (SGD_WSDS), SGD with Cyclical Learning Rate (SGD_CLR), and SGD with Cosine Annealing and Warm Restarts (SGD_cosA). (Right) Comparison of test accuracy on CIFAR-100 under identical training budgets, results highlight that our approach converges to a better local optima and terminates early when further improvements are not found.
Figure 2: (Left) Training loss trajectories for our proposed scheduler (ours_exp, in red) alongside four baselines: SGD with exponential decay, SGD with Warmup-Stable-Decay-Simplified (SGD_WSDS), SGD with Cyclical Learning Rate (SGD_CLR), and SGD with Cosine Annealing and Warm Restarts (SGD_cosA). (Right) Test accuracy on CIFAR-100 under a training budget of 750 epochs; our approach (in red) finds better local optima.
Figure 3: Comparison of learning rate schedulers and test accuracy on CIFAR-100 with ResNet-18 architecture with training budget of 2000 epochs. (Left) Learning rate trajectories for our proposed scheduler (ours_exp - with exponential decay, ours_lin - with linear decay) alongside these baselines: SGD with exponential decay, SGD with linear decay (SGD_lin), SGD with Warmup-Stable-Decay-Simplified (SGD_WSDS), Adam, SGD with Cyclical Learning Rate (SGD_CLR), and SGD with Cosine Annealing and Warm Restarts (SGD_cosA). (Right) Comparison of test accuracy on CIFAR-100 under identical training budgets, results highlight that our approaches converges to a better solutions.
Figure 4: (Left) Training loss trajectories for our proposed scheduler (ours_exp, in red) alongside four baselines: SGD with exponential decay, SGD with Warmup-Stable-Decay-Simplified (SGD_WSDS), SGD with Cyclical Learning Rate (SGD_CLR), and SGD with Cosine Annealing and Warm Restarts (SGD_cosA). (Right) Test accuracy on CIFAR-10 under a training budget of 500 epochs; our approach (in red) finds better local optima.
Figure 5: Training loss trajectories for our proposed scheduler (ours_exp, in red) compared against four baselines: standard SGD with exponential decay, SGD with Warmup‑Stable‑Decay‑Simplified (SGD_WSDS), SGD with Cyclical Learning Rate (SGD_CLR), and SGD with Cosine Annealing and Warm Restarts (SGD_cosA). While the baseline methods largely converge and stop improving with additional training, SGD‑ER (ours, in red) continues to make steady progress, ultimately reaching higher test accuracies and demonstrating its advantage in prolonged‑training regimes.

Theorems & Definitions (2)

Theorem 1
Theorem 2

When to restart? Exploring escalating restarts on convergence

TL;DR

Abstract

When to restart? Exploring escalating restarts on convergence

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (2)