Table of Contents
Fetching ...

Implicit Bias of Per-sample Adam on Separable Data: Departure from the Full-batch Regime

Beomhan Baek, Minhak Song, Chulhee Yun

TL;DR

This work reveals that the implicit bias of Adam is not universal but depends critically on batching and data structure. By analyzing incremental Adam (Inc-Adam) and contrasting it with full-batch Adam, the authors show that mini-batch updates can shift the limiting classifier away from the $\ell_\infty$-max-margin toward the $\ell_2$-max-margin on certain structured datasets, a phenomenon not present in the full-batch setting. They introduce AdamProxy as a data-dependent dual-optimization framework to characterize limiting directions via a fixed-point on the probability simplex, and demonstrate with GR, Gaussian, and shifted-diagonal data that limit directions are data-driven and algorithm-dependent. Signum, in contrast, preserves the $\ell_\infty$-max-margin bias under mini-batch regimes when momentum is near $1$, highlighting a robust geometric property that Adam may lose in stochastic training. Overall, the paper uncovers a nuanced picture of implicit bias in Adam-like optimizers, showing the interplay between batching, momentum, and data geometry, and it lays groundwork for a duality-based analysis of limiting predictors in more general settings.

Abstract

Adam [Kingma and Ba, 2015] is the de facto optimizer in deep learning, yet its theoretical understanding remains limited. Prior analyses show that Adam favors solutions aligned with $\ell_\infty$-geometry, but these results are restricted to the full-batch regime. In this work, we study the implicit bias of incremental Adam (using one sample per step) for logistic regression on linearly separable data, and we show that its bias can deviate from the full-batch behavior. To illustrate this, we construct a class of structured datasets where incremental Adam provably converges to the $\ell_2$-max-margin classifier, in contrast to the $\ell_\infty$-max-margin bias of full-batch Adam. For general datasets, we develop a proxy algorithm that captures the limiting behavior of incremental Adam as $β_2 \to 1$ and we characterize its convergence direction via a data-dependent dual fixed-point formulation. Finally, we prove that, unlike Adam, Signum [Bernstein et al., 2018] converges to the $\ell_\infty$-max-margin classifier for any batch size by taking $β$ close enough to 1. Overall, our results highlight that the implicit bias of Adam crucially depends on both the batching scheme and the dataset, while Signum remains invariant.

Implicit Bias of Per-sample Adam on Separable Data: Departure from the Full-batch Regime

TL;DR

This work reveals that the implicit bias of Adam is not universal but depends critically on batching and data structure. By analyzing incremental Adam (Inc-Adam) and contrasting it with full-batch Adam, the authors show that mini-batch updates can shift the limiting classifier away from the -max-margin toward the -max-margin on certain structured datasets, a phenomenon not present in the full-batch setting. They introduce AdamProxy as a data-dependent dual-optimization framework to characterize limiting directions via a fixed-point on the probability simplex, and demonstrate with GR, Gaussian, and shifted-diagonal data that limit directions are data-driven and algorithm-dependent. Signum, in contrast, preserves the -max-margin bias under mini-batch regimes when momentum is near , highlighting a robust geometric property that Adam may lose in stochastic training. Overall, the paper uncovers a nuanced picture of implicit bias in Adam-like optimizers, showing the interplay between batching, momentum, and data geometry, and it lays groundwork for a duality-based analysis of limiting predictors in more general settings.

Abstract

Adam [Kingma and Ba, 2015] is the de facto optimizer in deep learning, yet its theoretical understanding remains limited. Prior analyses show that Adam favors solutions aligned with -geometry, but these results are restricted to the full-batch regime. In this work, we study the implicit bias of incremental Adam (using one sample per step) for logistic regression on linearly separable data, and we show that its bias can deviate from the full-batch behavior. To illustrate this, we construct a class of structured datasets where incremental Adam provably converges to the -max-margin classifier, in contrast to the -max-margin bias of full-batch Adam. For general datasets, we develop a proxy algorithm that captures the limiting behavior of incremental Adam as and we characterize its convergence direction via a data-dependent dual fixed-point formulation. Finally, we prove that, unlike Adam, Signum [Bernstein et al., 2018] converges to the -max-margin classifier for any batch size by taking close enough to 1. Overall, our results highlight that the implicit bias of Adam crucially depends on both the batching scheme and the dataset, while Signum remains invariant.

Paper Structure

This paper contains 53 sections, 40 theorems, 160 equations, 11 figures, 4 algorithms.

Key Result

Proposition 2.3

Let $\{{\mathbf w}_t\}_{t=0}^\infty$ be the iterates of Det-Adam with $\beta_1 \leq \beta_2$. Then, under ass:nonzeroass:lr, if $\lim_{t\rightarrow\infty}\frac{\eta_t^{1/2}\mathcal{L}({\mathbf w}_t)}{|\nabla\mathcal{L}({\mathbf w}_t)[k]|}=0$, then the update of $k$-th coordinate ${\mathbf w}_{t+1}[k for some $\lim_{t\rightarrow\infty} \epsilon_t=0$.

Figures (11)

  • Figure 1: Mini-batch Adam loses the $\ell_\infty$-max-margin bias of full-batch Adam. Cosine similarity between the weight vector and the $\ell_2$-max-margin (left) and $\ell_\infty$-max-margin (right) solutions in a linear classification task on $10$ data points drawn from the $50$-dimensional standard Gaussian. Full-batch Adam with $(\beta_1, \beta_2)=(0.9, 0.95)$ converges to the $\ell_\infty$-max-margin solution, whereas mini-batch variants with batch size $1$ converge closer to the $\ell_2$-max-margin direction. See \ref{['appendix:exp_detail']} for experimental details.
  • Figure 2: Mini-batch Adam converges to the $\ell_2$-max-margin solution on the GR dataset. We train on the dataset ${\mathbf x}_0=(1,1,1,1)$, ${\mathbf x}_1 = (2,2,2,-2)$, ${\mathbf x}_2 = (3,3,-3,-3)$, and ${\mathbf x}_1 = (4,-4,4,-4)$. Variants of mini-batch Adam with batch size $1$ consistently converge to the $\ell_2$-max-margin direction, while full-batch Adam converges to the $\ell_\infty$-max-margin direction.
  • Figure 3: Mini-batch Adam converges to the fixed-point solution on Gaussian data. We train on the same Gaussian data as in \ref{['fig:gaussian_cossim']} and plot the cosine similarity of the weight vector with the $\ell_2$-max-margin solution (left) and the fixed-point solution (right). The results show that variants of mini-batch Adam with batch size $1$ converge to the fixed-point solution obtained by \ref{['alg:fixed_point']}, consistent with our theoretical prediction (\ref{['thm:fixed_pt']}).
  • Figure 4: Mini-batch Adam converges to the $\ell_\infty$-max-margin solution on a shifted-diagonal dataset. We train on the dataset ${\mathbf x}_0=(1,\delta,\delta,\delta)$, ${\mathbf x}_1=(\delta,2,\delta,\delta)$, ${\mathbf x}_2=(\delta,\delta,4,\delta)$, and ${\mathbf x}_3=(\delta,\delta,\delta,8)$ with $\delta=0.1$. Variants of mini-batch Adam with batch size $1$ converge to the $\ell_\infty$-max-margin direction.
  • Figure 5: Mini-batch Signum converges to the $\ell_\infty$-max-margin solution. We train on the same Gaussian data ($N=10$, $d=50$) as in \ref{['fig:gaussian_cossim']}, using full-batch Signum and incremental Signum with $\beta=0.99$, for batch sizes $b\in \{5,2,1\}$. Across all batch sizes, incremental Signum consistently converges to the $\ell_\infty$-max-margin solution, in sharp contrast to incremental Adam.
  • ...and 6 more figures

Theorems & Definitions (70)

  • Proposition 2.3
  • Proposition 2.3
  • Definition 3.1
  • Corollary 3.1
  • Theorem 3.2
  • Proposition 4.0
  • Definition 4.1
  • Proposition 4.1: Loss convergence
  • Lemma 4.2
  • Definition 4.3
  • ...and 60 more