Table of Contents
Fetching ...

AdaGrad under Anisotropic Smoothness

Yuxing Liu, Rui Pan, Tong Zhang

TL;DR

It is shown that under anisotropic smoothness and noise conditions, AdaGrad can achieve faster convergence guarantees in terms of better dimensional dependence than algorithms with uniform step sizes across all coordinates.

Abstract

Adaptive gradient methods have been widely adopted in training large-scale deep neural networks, especially large foundation models. Despite the huge success in practice, their theoretical advantages over classical gradient methods with uniform step sizes across all coordinates (e.g. SGD) have not been fully understood, especially in the large batch-size setting commonly used in practice. This is because the only theoretical result that can demonstrate this benefit was obtained in the original paper of Adagrad for convex nonsmooth objective functions, which is insufficient for large batch algorithms. In this work, we attempt to resolve this gap between theory and practice by proposing a novel anisotropic generalized smoothness assumption and providing corresponding analyses of Adagrad. It is shown that under anisotropic smoothness and noise conditions, AdaGrad can achieve faster convergence guarantees in terms of better dimensional dependence than algorithms with uniform step sizes across all coordinates. Experiments in logistic regression and instruction following fine-tuning tasks provide strong evidence to support our novel assumption and theoretical analysis.

AdaGrad under Anisotropic Smoothness

TL;DR

It is shown that under anisotropic smoothness and noise conditions, AdaGrad can achieve faster convergence guarantees in terms of better dimensional dependence than algorithms with uniform step sizes across all coordinates.

Abstract

Adaptive gradient methods have been widely adopted in training large-scale deep neural networks, especially large foundation models. Despite the huge success in practice, their theoretical advantages over classical gradient methods with uniform step sizes across all coordinates (e.g. SGD) have not been fully understood, especially in the large batch-size setting commonly used in practice. This is because the only theoretical result that can demonstrate this benefit was obtained in the original paper of Adagrad for convex nonsmooth objective functions, which is insufficient for large batch algorithms. In this work, we attempt to resolve this gap between theory and practice by proposing a novel anisotropic generalized smoothness assumption and providing corresponding analyses of Adagrad. It is shown that under anisotropic smoothness and noise conditions, AdaGrad can achieve faster convergence guarantees in terms of better dimensional dependence than algorithms with uniform step sizes across all coordinates. Experiments in logistic regression and instruction following fine-tuning tasks provide strong evidence to support our novel assumption and theoretical analysis.
Paper Structure (29 sections, 19 theorems, 119 equations, 2 figures, 4 tables, 3 algorithms)

This paper contains 29 sections, 19 theorems, 119 equations, 2 figures, 4 tables, 3 algorithms.

Key Result

Theorem 4.1

Under Assumptions asm:convex_set, asm:smooth, asm:unbiased_gradient, asm:anisotropic_noise with $\mathbf{L}_1 = 0$, for the sequence $\{\mathbf{w}_t\}_{t=1}^T$ generated by Adagrad (Algorithm alg:adagrad with option I) with constant step size $\eta_t\equiv\eta = D_\infty$, it holds that for $\Bar{\m

Figures (2)

  • Figure 1: Verification of Assumption \ref{['asm:generalized_smooth']} in GPT-2 on Alpaca dataset. x-axis: $|\partial_j f(\mathbf{w})|$, y-axis: $|\partial_j f(\mathbf{w}) - \partial_j f(\mathbf{w}')| / |\mathbf{w} - \mathbf{w}'|_j$, where the color represents the iteration index of $\mathbf{w}$. We run Adam with full gradients for 100 steps and randomly selected nearby points $\mathbf{w}$ and $\mathbf{w}$' along the trajectory to plot the scatter points.
  • Figure 2: Training loss curves of SGD and Adagrad for instruction following tasks on Alpaca with GPT2. Left: batch size 256, Right: batch size 512.

Theorems & Definitions (40)

  • Theorem 4.1: Convex convergence of Adagrad
  • Remark 4.2
  • Theorem 4.3: Nonconvex convergence of Adagrad
  • Remark 5.2
  • Theorem 5.3: Convergence of Adagrad with generalized smoothness
  • Remark 5.4
  • Example B.1
  • Lemma C.1: Projection
  • proof
  • Lemma C.2: Variance reduced by batch size
  • ...and 30 more