AdaGrad under Anisotropic Smoothness

Yuxing Liu; Rui Pan; Tong Zhang

AdaGrad under Anisotropic Smoothness

Yuxing Liu, Rui Pan, Tong Zhang

TL;DR

It is shown that under anisotropic smoothness and noise conditions, AdaGrad can achieve faster convergence guarantees in terms of better dimensional dependence than algorithms with uniform step sizes across all coordinates.

Abstract

Adaptive gradient methods have been widely adopted in training large-scale deep neural networks, especially large foundation models. Despite the huge success in practice, their theoretical advantages over classical gradient methods with uniform step sizes across all coordinates (e.g. SGD) have not been fully understood, especially in the large batch-size setting commonly used in practice. This is because the only theoretical result that can demonstrate this benefit was obtained in the original paper of Adagrad for convex nonsmooth objective functions, which is insufficient for large batch algorithms. In this work, we attempt to resolve this gap between theory and practice by proposing a novel anisotropic generalized smoothness assumption and providing corresponding analyses of Adagrad. It is shown that under anisotropic smoothness and noise conditions, AdaGrad can achieve faster convergence guarantees in terms of better dimensional dependence than algorithms with uniform step sizes across all coordinates. Experiments in logistic regression and instruction following fine-tuning tasks provide strong evidence to support our novel assumption and theoretical analysis.

AdaGrad under Anisotropic Smoothness

TL;DR

Abstract

Paper Structure (29 sections, 19 theorems, 119 equations, 2 figures, 4 tables, 3 algorithms)

This paper contains 29 sections, 19 theorems, 119 equations, 2 figures, 4 tables, 3 algorithms.

Introduction
Related Work
Adaptive gradient methods.
Convergence results of SGD.
Convergence results of Adagrad.
Theoretical understanding of adaptive gradient methods:
Large batch training:
Preliminaries
Notations
Problem Settings and Assumptions
AdaGrad with Anisotropic Assumptions
Convex Cases
Nonconvex Cases
AdaGrad with Generalized Anisotropic Smoothness
Experimental Results
...and 14 more sections

Key Result

Theorem 4.1

Under Assumptions asm:convex_set, asm:smooth, asm:unbiased_gradient, asm:anisotropic_noise with $\mathbf{L}_1 = 0$, for the sequence $\{\mathbf{w}_t\}_{t=1}^T$ generated by Adagrad (Algorithm alg:adagrad with option I) with constant step size $\eta_t\equiv\eta = D_\infty$, it holds that for $\Bar{\m

Figures (2)

Figure 1: Verification of Assumption \ref{['asm:generalized_smooth']} in GPT-2 on Alpaca dataset. x-axis: $|\partial_j f(\mathbf{w})|$, y-axis: $|\partial_j f(\mathbf{w}) - \partial_j f(\mathbf{w}')| / |\mathbf{w} - \mathbf{w}'|_j$, where the color represents the iteration index of $\mathbf{w}$. We run Adam with full gradients for 100 steps and randomly selected nearby points $\mathbf{w}$ and $\mathbf{w}$' along the trajectory to plot the scatter points.
Figure 2: Training loss curves of SGD and Adagrad for instruction following tasks on Alpaca with GPT2. Left: batch size 256, Right: batch size 512.

Theorems & Definitions (40)

Theorem 4.1: Convex convergence of Adagrad
Remark 4.2
Theorem 4.3: Nonconvex convergence of Adagrad
Remark 5.2
Theorem 5.3: Convergence of Adagrad with generalized smoothness
Remark 5.4
Example B.1
Lemma C.1: Projection
proof
Lemma C.2: Variance reduced by batch size
...and 30 more

AdaGrad under Anisotropic Smoothness

TL;DR

Abstract

AdaGrad under Anisotropic Smoothness

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (40)