The Implicit Bias of Adam on Separable Data

Chenyang Zhang; Difan Zou; Yuan Cao

The Implicit Bias of Adam on Separable Data

Chenyang Zhang, Difan Zou, Yuan Cao

TL;DR

This work analyzes the implicit bias of Adam in linear binary classification with linearly separable data. It proves that when the stability constant $\epsilon$ is neglected, Adam converges to a linear classifier that maximizes the $\ell_\infty$-margin, with polynomial-time convergence for a broad class of decaying learning rates, distinguishing it from gradient-descent dynamics that maximize the $\ell_2$-margin. Theoretical results are complemented by experiments showing distinct margin trajectories for Adam versus GD/GDM and corroborating the polynomial-rate behavior for sublinear learning-rate schedules. The findings deepen the theoretical understanding of adaptive optimizers and suggest directions for extending these insights to homogeneous networks and stochastic settings.

Abstract

Adam has become one of the most favored optimizers in deep learning problems. Despite its success in practice, numerous mysteries persist regarding its theoretical understanding. In this paper, we study the implicit bias of Adam in linear logistic regression. Specifically, we show that when the training data are linearly separable, Adam converges towards a linear classifier that achieves the maximum $\ell_\infty$-margin. Notably, for a general class of diminishing learning rates, this convergence occurs within polynomial time. Our result shed light on the difference between Adam and (stochastic) gradient descent from a theoretical perspective.

The Implicit Bias of Adam on Separable Data

TL;DR

This work analyzes the implicit bias of Adam in linear binary classification with linearly separable data. It proves that when the stability constant

is neglected, Adam converges to a linear classifier that maximizes the

-margin, with polynomial-time convergence for a broad class of decaying learning rates, distinguishing it from gradient-descent dynamics that maximize the

-margin. Theoretical results are complemented by experiments showing distinct margin trajectories for Adam versus GD/GDM and corroborating the polynomial-rate behavior for sublinear learning-rate schedules. The findings deepen the theoretical understanding of adaptive optimizers and suggest directions for extending these insights to homogeneous networks and stochastic settings.

Abstract

-margin. Notably, for a general class of diminishing learning rates, this convergence occurs within polynomial time. Our result shed light on the difference between Adam and (stochastic) gradient descent from a theoretical perspective.

Paper Structure (20 sections, 20 theorems, 82 equations, 2 figures)

This paper contains 20 sections, 20 theorems, 82 equations, 2 figures.

Introduction
Additional Related Work
Problem Settings
Main Results
Experiments
Proof Sketch for Theorem \ref{['thm:convergence_rate']}
Conclusion and Future Work
Proof in Section \ref{['section:proof_overview']}
Proof of Lemma \ref{['lemma:inf_norm_upper_boundI']}
Proof of Lemma \ref{['lemma:difference_momentum_gradient']}
Proof of Lemma \ref{['lemma:smoothness']}
Proof of Lemma \ref{['lemma:margin_iterates']}
Proof of Lemma \ref{['lemma:inf_norm_upper_boundII']}
Complete Proof for Theorem \ref{['thm:convergence_rate']} and Calculation Details for Corollary \ref{['crlry:margin_rate']}
Complete Proof for Theorem \ref{['thm:convergence_rate']}
...and 5 more sections

Key Result

Theorem 1.1

Let $\{\eta_t\}_{t=0}^\infty$, $\{\mathbf{w}_t\}_{t=0}^\infty$ be the sequence of learning rates and iterates of Adam respectively. Suppose that the data set $\{(\mathbf{x}_i,y_i)\}_{i=1}^n$ is linearly separable, and that $\lim_{t\rightarrow} \eta_t = 0$, $\sum_{t=0}^\infty \eta_t = \infty$. Then u where $\gamma := \max_{\|\mathbf{w}\|_{\infty}\leq 1}\min_{i\in[n]}\langle\mathbf{w}, y_i\cdot\math

Figures (2)

Figure 1: Normalized $\ell_\infty$-margins and $\ell_2$-margins achieved by GD, GDM, and Adam with/without the stability constant $\epsilon$ during training. (a) gives the results of normalized $\ell_\infty$-margins, while (b) shows the results of normalized $\ell_2$-margins.
Figure 2: Log-log plots of the normalized $\ell_\infty$-margin gaps $|\min_{i\in[n]}\langle\mathbf{w}_{t}, y_i\cdot\mathbf{x}_i\rangle/ \|\mathbf{w}_t\|_\infty - \gamma |$ versus training iterations. (a) presents the results for Adam with the stability constant $\epsilon$, and (b) presents the results for Adam without the stability constant $\epsilon$.

Theorems & Definitions (21)

Theorem 1.1: Simplified version of Theorem \ref{['thm:convergence_rate']}
Theorem 4.5
Corollary 4.6
Corollary 4.7
Lemma 6.1
Lemma 6.2
Lemma 6.3
Lemma 6.4
Lemma 6.5: Lemma A.2 in DBLP:conf/iclr/Zou0LG23
Lemma A.1
...and 11 more

The Implicit Bias of Adam on Separable Data

TL;DR

Abstract

The Implicit Bias of Adam on Separable Data

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (21)