Table of Contents
Fetching ...

The Rich and the Simple: On the Implicit Bias of Adam and SGD

Bhavya Vasudeva, Jung Whan Lee, Vatsal Sharan, Mahdi Soltanolkotabi

TL;DR

The paper investigates how Adam's implicit bias differs from SGD in training two-layer ReLU networks on Gaussian mixture data. By deriving population gradients and analyzing both gradient flow and Adam-like updates, it shows that SGD tends toward linear, simple boundaries while Adam learns nonlinear boundaries closer to Bayes optimal predictions, yielding better generalization under certain distribution shifts. Extensive experiments across synthetic data, MNIST-based spurious-feature tasks, and subgroup-robustness benchmarks corroborate that Adam's richer feature learning improves worst-group accuracy and core-feature decoding, suggesting practical advantages for handling spurious correlations. The work provides a principled contrast between optimization schemes, guides expectations about generalization in the presence of spurious features, and points to future work on broader architectures and regularization effects.

Abstract

Adam is the de facto optimization algorithm for several deep learning applications, but an understanding of its implicit bias and how it differs from other algorithms, particularly standard first-order methods such as (stochastic) gradient descent (GD), remains limited. In practice, neural networks (NNs) trained with SGD are known to exhibit simplicity bias -- a tendency to find simple solutions. In contrast, we show that Adam is more resistant to such simplicity bias. First, we investigate the differences in the implicit biases of Adam and GD when training two-layer ReLU NNs on a binary classification task with Gaussian data. We find that GD exhibits a simplicity bias, resulting in a linear decision boundary with a suboptimal margin, whereas Adam leads to much richer and more diverse features, producing a nonlinear boundary that is closer to the Bayes' optimal predictor. This richer decision boundary also allows Adam to achieve higher test accuracy both in-distribution and under certain distribution shifts. We theoretically prove these results by analyzing the population gradients. Next, to corroborate our theoretical findings, we present extensive empirical results showing that this property of Adam leads to superior generalization across various datasets with spurious correlations where NNs trained with SGD are known to show simplicity bias and do not generalize well under certain distributional shifts.

The Rich and the Simple: On the Implicit Bias of Adam and SGD

TL;DR

The paper investigates how Adam's implicit bias differs from SGD in training two-layer ReLU networks on Gaussian mixture data. By deriving population gradients and analyzing both gradient flow and Adam-like updates, it shows that SGD tends toward linear, simple boundaries while Adam learns nonlinear boundaries closer to Bayes optimal predictions, yielding better generalization under certain distribution shifts. Extensive experiments across synthetic data, MNIST-based spurious-feature tasks, and subgroup-robustness benchmarks corroborate that Adam's richer feature learning improves worst-group accuracy and core-feature decoding, suggesting practical advantages for handling spurious correlations. The work provides a principled contrast between optimization schemes, guides expectations about generalization in the presence of spurious features, and points to future work on broader architectures and regularization effects.

Abstract

Adam is the de facto optimization algorithm for several deep learning applications, but an understanding of its implicit bias and how it differs from other algorithms, particularly standard first-order methods such as (stochastic) gradient descent (GD), remains limited. In practice, neural networks (NNs) trained with SGD are known to exhibit simplicity bias -- a tendency to find simple solutions. In contrast, we show that Adam is more resistant to such simplicity bias. First, we investigate the differences in the implicit biases of Adam and GD when training two-layer ReLU NNs on a binary classification task with Gaussian data. We find that GD exhibits a simplicity bias, resulting in a linear decision boundary with a suboptimal margin, whereas Adam leads to much richer and more diverse features, producing a nonlinear boundary that is closer to the Bayes' optimal predictor. This richer decision boundary also allows Adam to achieve higher test accuracy both in-distribution and under certain distribution shifts. We theoretically prove these results by analyzing the population gradients. Next, to corroborate our theoretical findings, we present extensive empirical results showing that this property of Adam leads to superior generalization across various datasets with spurious correlations where NNs trained with SGD are known to show simplicity bias and do not generalize well under certain distributional shifts.

Paper Structure

This paper contains 44 sections, 12 theorems, 61 equations, 12 figures, 17 tables.

Key Result

Proposition 1

The optimal predictor for the data in eq:data with $d\!=\!2$ is:

Figures (12)

  • Figure 1: Illustration of the synthetic dataset considered in this work, and comparison of the Bayes' optimal predictor with the decision boundaries of two-layer NNs trained with Adam and GD.
  • Figure 2: Comparison of test accuracy and agreement with a linear model for a two‑layer NN trained on MNIST with spurious correlation.
  • Figure 3: Evolution of the decision boundary (top row) and the neurons (bottom row) over time, for GD (left) and Adam (right) with learning rates $0.1$ and $10^{-4}$ over $20\,000$ epochs of training a width $100$ NN with small initialization (the neurons are colored based on the quadrant they were initialized in) using population gradients (the samples are plotted for illustration purposes) on the Gaussian data setting (\ref{['eq:data']}) with $\mu=0.3,\omega=2,\sigma=0.1$. GD leads to a linear decision boundary, with neurons mostly aligned with the directions $[\pm1,0]^\top$, while Adam (with $\beta_1=\beta_2=0.9999$) leads to a non-linear decision boundary, with neurons aligned with three main directions $[-1,0]^\top, [1,1]^\top, [1,-1]^\top$, which is closer to the Bayes' optimal predictor.
  • Figure 4: Comparison of the Bayes' optimal predictor and the predictors learned by two-layer NNs trained with GD, Adam ($\beta_1=\beta_2\approx 0$) or signGD, and Adam with $\beta_1=\beta_2\approx 1$ on the toy dataset (Gaussian dataset with $\sigma\rightarrow 0$).
  • Figure 5: Distribution of margins of training‑set samples from the MNIST dataset with spurious correlation for a two‑layer NN trained using SGD (left) and Adam (right).
  • ...and 7 more figures

Theorems & Definitions (19)

  • Proposition 1: Bayes' Optimal Predictor
  • Proposition 2: Population Gradient
  • Theorem 1
  • Theorem 2
  • Theorem 3
  • Theorem 4
  • proof
  • proof
  • Theorem 5
  • proof
  • ...and 9 more