On the Implicit Bias of Adam
Matias D. Cattaneo, Jason M. Klusowski, Boris Shigida
TL;DR
This paper uses backward error analysis to derive a global, second-order in step size $h$ ODE approximation for Adam (and RMSProp) and its mini-batch and full-batch variants. It shows that Adam typically anti-penalizes the perturbed gradient one-norm $\|\nabla E(\boldsymbol{\theta})\|_{1,\varepsilon}$ when $\sqrt{\varepsilon}$ is small and $\rho$ exceeds $\beta$, signaling implicit anti-regularization that can worsen generalization, while other hyperparameter regimes recover GD-like regularization. Theoretical results are complemented by numerical experiments on vision architectures (ResNets, CNNs, ViTs) and standard datasets, which corroborate the predicted anti-regularization effects and link them to generalization performance. Overall, the work provides a principled framework for understanding the implicit bias of adaptive optimizers and motivates further study of their generalization behavior across architectures.
Abstract
In previous literature, backward error analysis was used to find ordinary differential equations (ODEs) approximating the gradient descent trajectory. It was found that finite step sizes implicitly regularize solutions because terms appearing in the ODEs penalize the two-norm of the loss gradients. We prove that the existence of similar implicit regularization in RMSProp and Adam depends on their hyperparameters and the training stage, but with a different "norm" involved: the corresponding ODE terms either penalize the (perturbed) one-norm of the loss gradients or, conversely, impede its reduction (the latter case being typical). We also conduct numerical experiments and discuss how the proven facts can influence generalization.
