Table of Contents
Fetching ...

Adam Simplified: Bias Correction Debunked

Sam Laing, Antonio Orvieto

TL;DR

The paper questions the necessity of bias correction in Adam by conducting controlled ablations across language and vision tasks. It demonstrates that bias correction acts as an implicit learning-rate schedule, captured by $\rho(t;\beta_1,\beta_2) = \frac{\sqrt{1-\beta_2^{t}}}{1-\beta_1^{t}}$, and its effects depend on the scheduling regime. Under LM-optimal settings with $\beta_1=\beta_2=0.95$ and with appropriate LR schedules, Adam achieves equivalent final performance with or without bias correction; without scheduling, bias correction can hurt. The study recommends removing bias correction from practice and theory in favor of explicit LR scheduling for simplicity and interpretability.

Abstract

The Adam optimizer is a cornerstone of modern deep learning, yet the empirical necessity of each of its individual components is often taken for granted. This paper presents a focused investigation into the role of bias-correction, a feature whose contribution remains poorly understood. Through a series of systematic ablations on vision and language modelling tasks, we demonstrate that the conventional wisdom surrounding bias correction is misleading. In particular, we demonstrate that in the optimal hyper-parameter configuration, the inclusion of bias correction leads to no improvement in final test performance. Moreover, unless appropriate learning rate scheduling is implemented, the inclusion of bias correction can sometimes be detrimental to performance. We further reinterpret bias correction as a form of implicit learning rate scheduling whose behaviour is strongly dependent on the choice of smoothing hyper-parameters $β_1, β_2 \in [0,1)$. Our findings challenge the universal inclusion of this component.

Adam Simplified: Bias Correction Debunked

TL;DR

The paper questions the necessity of bias correction in Adam by conducting controlled ablations across language and vision tasks. It demonstrates that bias correction acts as an implicit learning-rate schedule, captured by , and its effects depend on the scheduling regime. Under LM-optimal settings with and with appropriate LR schedules, Adam achieves equivalent final performance with or without bias correction; without scheduling, bias correction can hurt. The study recommends removing bias correction from practice and theory in favor of explicit LR scheduling for simplicity and interpretability.

Abstract

The Adam optimizer is a cornerstone of modern deep learning, yet the empirical necessity of each of its individual components is often taken for granted. This paper presents a focused investigation into the role of bias-correction, a feature whose contribution remains poorly understood. Through a series of systematic ablations on vision and language modelling tasks, we demonstrate that the conventional wisdom surrounding bias correction is misleading. In particular, we demonstrate that in the optimal hyper-parameter configuration, the inclusion of bias correction leads to no improvement in final test performance. Moreover, unless appropriate learning rate scheduling is implemented, the inclusion of bias correction can sometimes be detrimental to performance. We further reinterpret bias correction as a form of implicit learning rate scheduling whose behaviour is strongly dependent on the choice of smoothing hyper-parameters . Our findings challenge the universal inclusion of this component.

Paper Structure

This paper contains 14 sections, 4 equations, 7 figures.

Figures (7)

  • Figure 1: Sensitivity to learning rate for AdamW with and without bias correction (orange and blue respectively). The plots show final validation perplexity (y-axis) across a range of learning rates (x-axis, log-scale). Results are averaged over $3$ random seeds. With warm-up cosine scheduling, removing bias correction increases sensitivity for default hyperparameters $(\beta_1,\beta_2)=(0.9,0.999)$ but with identical optimal performance. For the LM-optimal setting $(\beta_1, \beta_2) = (0.95, 0.95)$, performance is identical. With a fixed learning rate, the inclusion of bias correction has a more pronounced effect. In the default torch setting $(0.9, 0.999)$, excluding bias correction has a detrimental effect whereas for the LM-optimal setting $(0.95, 0.95)$, bias correction slightly degrades optimal performance.
  • Figure 2: Comparison of the effective learning rate when bias correction is applied for $(\beta_1, \beta_2) = (0.9, 0.999)$ (green) and $(\beta_1, \beta_2) = (0.95, 0.95)$ (red) under both warm-up cosine scheduling (left) and a constant learning rate (right). With warm-up cosine scheduling, the bias correction factor is effectively absorbed for the LM-optimal setting $(0.95, 0.95)$ (the true warmup cosine schedule is indistinguishable from the red curve), whereas for the default setting $(0.9, 0.999)$ it substantially modifies the effective learning rate, lowering the peak value. Without scheduling, the torch default configuration exhibits a very gradual warm-up on effect on the effective learning rate, while the LM-optimal setting produces an initial spike that quickly decays to the nominal learning rate.
  • Figure 3: ResNet9 on CIFAR-10.
  • Figure 4: ResNet50 on Tiny ImageNet.
  • Figure 5: ViT on Tiny ImageNet
  • ...and 2 more figures