AdAdaGrad: Adaptive Batch Size Schemes for Adaptive Gradient Methods

Tim Tsz-Kit Lau; Han Liu; Mladen Kolar

AdAdaGrad: Adaptive Batch Size Schemes for Adaptive Gradient Methods

Tim Tsz-Kit Lau, Han Liu, Mladen Kolar

TL;DR

This work unveils the potential of adaptive batch size strategies for adaptive gradient optimizers in large-scale model training and proves that AdAdaGradNorm converges with high probability at a rate of $\mathscr{O}(1/K)$ to find a first-order stationary point of smooth nonconvex functions within $K$ iterations.

Abstract

The choice of batch sizes in minibatch stochastic gradient optimizers is critical in large-scale model training for both optimization and generalization performance. Although large-batch training is arguably the dominant training paradigm for large-scale deep learning due to hardware advances, the generalization performance of the model deteriorates compared to small-batch training, leading to the so-called "generalization gap" phenomenon. To mitigate this, we investigate adaptive batch size strategies derived from adaptive sampling methods, originally developed only for stochastic gradient descent. Given the significant interplay between learning rates and batch sizes, and considering the prevalence of adaptive gradient methods in deep learning, we emphasize the need for adaptive batch size strategies in these contexts. We introduce AdAdaGrad and its scalar variant AdAdaGradNorm, which progressively increase batch sizes during training, while model updates are performed using AdaGrad and AdaGradNorm. We prove that AdAdaGradNorm converges with high probability at a rate of $\mathscr{O}(1/K)$ to find a first-order stationary point of smooth nonconvex functions within $K$ iterations. AdAdaGrad also demonstrates similar convergence properties when integrated with a novel coordinate-wise variant of our adaptive batch size strategies. We corroborate our theoretical claims by performing image classification experiments, highlighting the merits of the proposed schemes in terms of both training efficiency and model generalization. Our work unveils the potential of adaptive batch size strategies for adaptive gradient optimizers in large-scale model training.

AdAdaGrad: Adaptive Batch Size Schemes for Adaptive Gradient Methods

TL;DR

to find a first-order stationary point of smooth nonconvex functions within

iterations.

Abstract

to find a first-order stationary point of smooth nonconvex functions within

iterations. AdAdaGrad also demonstrates similar convergence properties when integrated with a novel coordinate-wise variant of our adaptive batch size strategies. We corroborate our theoretical claims by performing image classification experiments, highlighting the merits of the proposed schemes in terms of both training efficiency and model generalization. Our work unveils the potential of adaptive batch size strategies for adaptive gradient optimizers in large-scale model training.

Paper Structure (40 sections, 12 theorems, 90 equations, 8 figures, 10 tables, 1 algorithm)

This paper contains 40 sections, 12 theorems, 90 equations, 8 figures, 10 tables, 1 algorithm.

Introduction
Contributions.
Related Work
Large-batch training.
Adaptive sampling methods.
Problem Formulation
Notation.
Problem setting.
Adaptive Sampling Methods
Norm Test
Inner Product Test
Adaptive Sampling Methods for Adaptive Gradient Methods
Convergence Analysis
Convergence Results
Numerical Experiments
...and 25 more sections

Key Result

Proposition 5.1

For every iteration $k\in\mathbb{N}^*$, if the conditions of the exact variance norm test hold with constant $\eta\in\mleft(0,1\mright)$ and the conditions of the exact variance augmented inner product test hold with constants $(\vartheta, \nu) \in\mathbb{R}_{++}^2$ respectively, then the E-SG condi

Figures (8)

Figure 1: Training loss, validation accuracy and batch sizes of AdaSGD, AdAdaGrad and AdAdaGrad-Norm for a three-layer CNN on the MNIST dataset.
Figure 2: AdaGrad and AdAdaGrad for ResNet-18 on the CIFAR-10 dataset.
Figure 3: Adam and AdAdam for ResNet-18 on the CIFAR-10 dataset.
Figure 4: Training loss, validation accuracy and batch size curves (vs. number of training samples) of AdaSGD, AdAdaGrad and AdAdaGrad-Norm for logistic regression on the MNIST dataset.
Figure 5: Training loss, validation accuracy and batch size curves (vs. number of training samples) of AdaSGD, AdAdaGrad and AdAdaGrad-Norm for three-layer CNN on the MNIST dataset.
...and 3 more figures

Theorems & Definitions (23)

Definition 1: Expected strong growth
Proposition 5.1: Informal
Theorem 5.1: AdAdaGrad-Norm
Proposition 5.2: Coordinate-wise expected strong growth
Theorem 5.2: AdAdaGrad
Theorem 5.3: $(L_0, L_1)$-smooth AdAdaGrad-Norm
proof
Proposition C.1: Exact variance norm test
proof
Proposition C.2: Exact variance inner product test and orthogonality test
...and 13 more

AdAdaGrad: Adaptive Batch Size Schemes for Adaptive Gradient Methods

TL;DR

Abstract

AdAdaGrad: Adaptive Batch Size Schemes for Adaptive Gradient Methods

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (8)

Theorems & Definitions (23)