Table of Contents
Fetching ...

AdAdaGrad: Adaptive Batch Size Schemes for Adaptive Gradient Methods

Tim Tsz-Kit Lau, Han Liu, Mladen Kolar

TL;DR

This work unveils the potential of adaptive batch size strategies for adaptive gradient optimizers in large-scale model training and proves that AdAdaGradNorm converges with high probability at a rate of $\mathscr{O}(1/K)$ to find a first-order stationary point of smooth nonconvex functions within $K$ iterations.

Abstract

The choice of batch sizes in minibatch stochastic gradient optimizers is critical in large-scale model training for both optimization and generalization performance. Although large-batch training is arguably the dominant training paradigm for large-scale deep learning due to hardware advances, the generalization performance of the model deteriorates compared to small-batch training, leading to the so-called "generalization gap" phenomenon. To mitigate this, we investigate adaptive batch size strategies derived from adaptive sampling methods, originally developed only for stochastic gradient descent. Given the significant interplay between learning rates and batch sizes, and considering the prevalence of adaptive gradient methods in deep learning, we emphasize the need for adaptive batch size strategies in these contexts. We introduce AdAdaGrad and its scalar variant AdAdaGradNorm, which progressively increase batch sizes during training, while model updates are performed using AdaGrad and AdaGradNorm. We prove that AdAdaGradNorm converges with high probability at a rate of $\mathscr{O}(1/K)$ to find a first-order stationary point of smooth nonconvex functions within $K$ iterations. AdAdaGrad also demonstrates similar convergence properties when integrated with a novel coordinate-wise variant of our adaptive batch size strategies. We corroborate our theoretical claims by performing image classification experiments, highlighting the merits of the proposed schemes in terms of both training efficiency and model generalization. Our work unveils the potential of adaptive batch size strategies for adaptive gradient optimizers in large-scale model training.

AdAdaGrad: Adaptive Batch Size Schemes for Adaptive Gradient Methods

TL;DR

This work unveils the potential of adaptive batch size strategies for adaptive gradient optimizers in large-scale model training and proves that AdAdaGradNorm converges with high probability at a rate of to find a first-order stationary point of smooth nonconvex functions within iterations.

Abstract

The choice of batch sizes in minibatch stochastic gradient optimizers is critical in large-scale model training for both optimization and generalization performance. Although large-batch training is arguably the dominant training paradigm for large-scale deep learning due to hardware advances, the generalization performance of the model deteriorates compared to small-batch training, leading to the so-called "generalization gap" phenomenon. To mitigate this, we investigate adaptive batch size strategies derived from adaptive sampling methods, originally developed only for stochastic gradient descent. Given the significant interplay between learning rates and batch sizes, and considering the prevalence of adaptive gradient methods in deep learning, we emphasize the need for adaptive batch size strategies in these contexts. We introduce AdAdaGrad and its scalar variant AdAdaGradNorm, which progressively increase batch sizes during training, while model updates are performed using AdaGrad and AdaGradNorm. We prove that AdAdaGradNorm converges with high probability at a rate of to find a first-order stationary point of smooth nonconvex functions within iterations. AdAdaGrad also demonstrates similar convergence properties when integrated with a novel coordinate-wise variant of our adaptive batch size strategies. We corroborate our theoretical claims by performing image classification experiments, highlighting the merits of the proposed schemes in terms of both training efficiency and model generalization. Our work unveils the potential of adaptive batch size strategies for adaptive gradient optimizers in large-scale model training.
Paper Structure (40 sections, 12 theorems, 90 equations, 8 figures, 10 tables, 1 algorithm)

This paper contains 40 sections, 12 theorems, 90 equations, 8 figures, 10 tables, 1 algorithm.

Key Result

Proposition 5.1

For every iteration $k\in\mathbb{N}^*$, if the conditions of the exact variance norm test hold with constant $\eta\in\mleft(0,1\mright)$ and the conditions of the exact variance augmented inner product test hold with constants $(\vartheta, \nu) \in\mathbb{R}_{++}^2$ respectively, then the E-SG condi

Figures (8)

  • Figure 1: Training loss, validation accuracy and batch sizes of AdaSGD, AdAdaGrad and AdAdaGrad-Norm for a three-layer CNN on the MNIST dataset.
  • Figure 2: AdaGrad and AdAdaGrad for ResNet-18 on the CIFAR-10 dataset.
  • Figure 3: Adam and AdAdam for ResNet-18 on the CIFAR-10 dataset.
  • Figure 4: Training loss, validation accuracy and batch size curves (vs. number of training samples) of AdaSGD, AdAdaGrad and AdAdaGrad-Norm for logistic regression on the MNIST dataset.
  • Figure 5: Training loss, validation accuracy and batch size curves (vs. number of training samples) of AdaSGD, AdAdaGrad and AdAdaGrad-Norm for three-layer CNN on the MNIST dataset.
  • ...and 3 more figures

Theorems & Definitions (23)

  • Definition 1: Expected strong growth
  • Proposition 5.1: Informal
  • Theorem 5.1: AdAdaGrad-Norm
  • Proposition 5.2: Coordinate-wise expected strong growth
  • Theorem 5.2: AdAdaGrad
  • Theorem 5.3: $(L_0, L_1)$-smooth AdAdaGrad-Norm
  • proof
  • Proposition C.1: Exact variance norm test
  • proof
  • Proposition C.2: Exact variance inner product test and orthogonality test
  • ...and 13 more