Relationship between Batch Size and Number of Steps Needed for Nonconvex Optimization of Stochastic Gradient Descent using Armijo Line Search

Yuki Tsukada; Hideaki Iiduka

Relationship between Batch Size and Number of Steps Needed for Nonconvex Optimization of Stochastic Gradient Descent using Armijo Line Search

Yuki Tsukada, Hideaki Iiduka

TL;DR

The paper addresses how batch size interacts with the number of steps in nonconvex optimization when using SGD with Armijo-line-search learning rates. It derives a convergence bound showing that the best-case gradient norm scales as $\min_{k< K} \mathbb{E}[\|\nabla f(\bm{\theta}_k)\|^2] \le C_1/K + C_2/b$, and proves that the required steps $K(b)$ are monotone decreasing and convex in $b$, while the SFO complexity $N(b)=K(b)b$ is convex with a unique minimizer $b^* = \frac{2 C_2}{\epsilon^2}$. The results imply a critical batch size beyond which additional batching yields diminishing returns, and the theory is supported by numerical experiments on ResNets and MLPs that identify near-optimal $b^*$ values. Overall, the work provides a principled framework for selecting batch size to minimize gradient evaluations while achieving a given accuracy in nonconvex SGD with Armijo line search.

Abstract

While stochastic gradient descent (SGD) can use various learning rates, such as constant or diminishing rates, the previous numerical results showed that SGD performs better than other deep learning optimizers using when it uses learning rates given by line search methods. In this paper, we perform a convergence analysis on SGD with a learning rate given by an Armijo line search for nonconvex optimization indicating that the upper bound of the expectation of the squared norm of the full gradient becomes small when the number of steps and the batch size are large. Next, we show that, for SGD with the Armijo-line-search learning rate, the number of steps needed for nonconvex optimization is a monotone decreasing convex function of the batch size; that is, the number of steps needed for nonconvex optimization decreases as the batch size increases. Furthermore, we show that the stochastic first-order oracle (SFO) complexity, which is the stochastic gradient computation cost, is a convex function of the batch size; that is, there exists a critical batch size that minimizes the SFO complexity. Finally, we provide numerical results that support our theoretical results. The numerical results indicate that the number of steps needed for training deep neural networks decreases as the batch size increases and that there exist the critical batch sizes that can be estimated from the theoretical results.

Relationship between Batch Size and Number of Steps Needed for Nonconvex Optimization of Stochastic Gradient Descent using Armijo Line Search

TL;DR

, and proves that the required steps

are monotone decreasing and convex in

, while the SFO complexity

is convex with a unique minimizer

. The results imply a critical batch size beyond which additional batching yields diminishing returns, and the theory is supported by numerical experiments on ResNets and MLPs that identify near-optimal

values. Overall, the work provides a principled framework for selecting batch size to minimize gradient evaluations while achieving a given accuracy in nonconvex SGD with Armijo line search.

Abstract

Paper Structure (28 sections, 5 theorems, 68 equations, 8 figures, 1 table, 2 algorithms)

This paper contains 28 sections, 5 theorems, 68 equations, 8 figures, 1 table, 2 algorithms.

Introduction
Background
Motivation
Contribution
Convergence analysis of SGD with Armijo-line-search learning rates
Steps needed for $\epsilon$--approximation of SGD with Armijo line-search-learning rates
Critical batch size minimizing SFO complexity of SGD with Armijo-line-search learning rates
Numerical results supporting our theoretical results
Mathematical Preliminaries
Definitions
Assumptions and problem
Stochastic gradient descent using Armijo line search
Armijo condition
Stochastic gradient descent under Armijo condition
Analysis of SGD using Armijo Line Search
...and 13 more sections

Key Result

Proposition 2.3

noce Let $f \colon \mathbb{R}^d \to \mathbb{R}$ be continuously differentiable. Let $\bm{\theta}_k \in \mathbb{R}^d$ and let $\bm{d}_k$$(\neq \bm{0})$ have the descent property defined by $\langle \nabla f(\bm{\theta}_k), \bm{d}_k \rangle < 0$. Let $c \in (0,1)$. Then, there exists $\gamma_k > 0$ su

Figures (8)

Figure 1:
Figure 3:
Figure 5:
Figure 7:
Figure 9:
...and 3 more figures

Theorems & Definitions (5)

Proposition 2.3
Lemma 2.4
Theorem 3.1: Upper bound of the squared norm of the full gradient
Theorem 3.2: Steps needed for nonconvex optimization of SGD using Armijo line search
Theorem 3.3: Existence of critical batch size for SGD using Armijo line search

Relationship between Batch Size and Number of Steps Needed for Nonconvex Optimization of Stochastic Gradient Descent using Armijo Line Search

TL;DR

Abstract

Relationship between Batch Size and Number of Steps Needed for Nonconvex Optimization of Stochastic Gradient Descent using Armijo Line Search

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (8)

Theorems & Definitions (5)