Butterfly Effects of SGD Noise: Error Amplification in Behavior Cloning and Autoregression

Adam Block; Dylan J. Foster; Akshay Krishnamurthy; Max Simchowitz; Cyril Zhang

Butterfly Effects of SGD Noise: Error Amplification in Behavior Cloning and Autoregression

Adam Block, Dylan J. Foster, Akshay Krishnamurthy, Max Simchowitz, Cyril Zhang

TL;DR

The paper addresses training instabilities in offline deep behavior cloning by identifying gradient variance amplification (GVA) as the mechanism behind long-horizon rollout oscillations. It demonstrates that SGD noise propagates through unstable closed-loop dynamics, causing dramatic, nonconvergent reward fluctuations even when the BC objective remains smooth. A key finding is that exponential moving average (EMA) of iterates robustly mitigates GVA across continuous control tasks and autoregressive language generation, often removing the need for learning-rate decay. The work provides theoretical vignettes illustrating EMA’s variance-reduction benefits and discusses the limitations of convex-theory explanations in nonconvex deep learning, suggesting EMA as a practical stabilizer for training neural networks in feedback-loop settings. Overall, EMA emerges as a broadly applicable tool to stabilize learning in both RL-oriented BC and NLP autoregressive models, with implications for data efficiency and generalization in systems with closed-loop dynamics.

Abstract

This work studies training instabilities of behavior cloning with deep neural networks. We observe that minibatch SGD updates to the policy network during training result in sharp oscillations in long-horizon rewards, despite negligibly affecting the behavior cloning loss. We empirically disentangle the statistical and computational causes of these oscillations, and find them to stem from the chaotic propagation of minibatch SGD noise through unstable closed-loop dynamics. While SGD noise is benign in the single-step action prediction objective, it results in catastrophic error accumulation over long horizons, an effect we term gradient variance amplification (GVA). We show that many standard mitigation techniques do not alleviate GVA, but find an exponential moving average (EMA) of iterates to be surprisingly effective at doing so. We illustrate the generality of this phenomenon by showing the existence of GVA and its amelioration by EMA in both continuous control and autoregressive language generation. Finally, we provide theoretical vignettes that highlight the benefits of EMA in alleviating GVA and shed light on the extent to which classical convex models can help in understanding the benefits of iterate averaging in deep learning.

Butterfly Effects of SGD Noise: Error Amplification in Behavior Cloning and Autoregression

TL;DR

Abstract

Paper Structure (77 sections, 22 theorems, 124 equations, 22 figures, 1 table)

This paper contains 77 sections, 22 theorems, 124 equations, 22 figures, 1 table.

Introduction
Contributions
Related work
Preliminaries
MDP formalism.
Imitation learning and behavior cloning.
Notation.
Diagnosis of rollout oscillations: gradient variance amplification
Instabilities in behavior cloning of MuJoCo tasks
Experimental setup.
Instability is caused by gradient variance amplification
Understanding GVA: mismatch between BC loss and rollout reward
Working model for GVA.
Mitigating GVA: stabilizers for unstable optimizers
The outsized benefit of iterate averaging
...and 62 more sections

Key Result

Proposition 3.1

Let $\mathcal{B}_{\delta}$ denote the set of $\delta$-Lipschitz functions $\Delta: \mathcal{S} \to \mathcal{A}$ with $\Delta(\mathbf 0) = \mathbf 0$. For any $\delta > 0$, there exists a deterministic MDP with horizon $H$ and an expert policy $\pi_{\bm{\theta}^\star}$ such that the dynamics are Lips yet $\sup_{\Delta \in \mathcal{B}_\delta} \ell_{\mathrm{BC}}(\pi_{\bm{\theta}^\star} + \Delta) \leq

Figures (22)

Figure 1: Typical reward instabilities over long-horizon ($H = 1000$) rollouts of neural behavior cloners for the Walker2d-v4 MuJoCo locomotion task. Left: Rollout rewards (blue training curves) oscillate dramatically over the course of training (evaluated every $5000$ iterations), while BC loss is stable. Center: Zoomed-in view of the highlighted region in (left). Large reward fluctuations are evident even between consecutive gradient iterates. Right: Exhaustive evaluation of small neighborhoods (in stochastic gradient directions) around iterates 115K and 120K, revealing a fractal reward landscape $\bm{\theta} \mapsto J_H(\pi_{\bm{\theta}})$; this jaggedness is invisible in the 1-step behavior cloning objective $\ell_{\mathrm{BC}}(\pi_{\bm{\theta}})$. Iterate averaging (EMA) drastically mitigates these effects (green training curves). Details are provided in \ref{['subsubsec:appendix-freqeval']}.
Figure 2: Highlights from a large suite of experiments, suggesting an algorithmic (rather than statistical) origin of reward oscillations. All plots use the 4-layer MLP architecture unless otherwise specified. Blue curves show mean rewards over 20 initial conditions, while teal dots show disaggregated per-episode rewards (such that each point represents the rollout reward of a fixed initial condition of the policy at the current iterate). These oscillations persist across dataset sizes, architectures, model scales, and choices of regularizers, and diminish toward the end of training as the learning rate decays to 0. They are most strongly mitigated by variance reduction strategies. Here, we opt for direct visualizations, providing a qualitative demonstration of GVA and its mitigations. We accompany these with quantitative comparisons in \ref{['app:gva-quantitative']}.
Figure 3: Iterate averaging significantly mitigates GVA-induced reward oscillations, without needing to change the learning rate schedule or batch size. These improvements hold across architectures, dataset sizes, and some tasks. Column 2, bottom: Algorithmic instabilities are more pronounced at smaller sample sizes; thus, stabilization can lead to improved sample efficiency. Column 3: We recommend updating the EMA at every iterate, with an initial burn-in phase, and with a tuned $\gamma^{(t)} = t^{-\alpha}$ decay, to avoid divergence or slower progress. Columns 4-5: We verify that the benefits of EMA are not exclusive to the Walker2d-v4 task; for some other tasks (including the higher-dimensional Humanoid-v4), oscillations are more benign.
Figure 4: GVA in natural language generation, with 270M-parameter Transformer models trained on TinyStories. (Top row) Left: Validation loss curves with and without EMA. Center: Zooming in on (left), evaluations at every update demonstrate small per-iterate loss fluctuations, which are even smaller if EMA is applied; note that the green "lines" are also scatter plots. Right: Training paths in (model loss, EMA loss) space. EMA enables training without learning rate decay; this mitigates overfitting, resulting in the lowest-perplexity model. (Bottom) Examples of autoregressively generated text (with argmax decoding), where nearby training iterates can bifurcate. See \ref{['subsec:appendix-nlp']} for full results, including quantitative evaluations of these "butterfly effects" in generation.
Figure 5: Training curves of the SAC experts for the various MuJoCo continuous control agents, along with final-iterate mean rewards. Top: Online reinforcement learning training curves; these exhibit training instabilities, but not of the same nature as those encountered in our offline behavior cloning settings. Bottom: Unnormalized reward distributions (through out the rest of this paper, we divide by these means). Outliers are marked by $\times$ symbols.
...and 17 more figures

Theorems & Definitions (39)

Proposition 3.1: Example of exponential error amplification
Proposition 4.1: Informal version of \ref{['prop:cliff_loss_formal']}
Remark B.1
Proposition C.1: GVA in linear dynamical systems
Proposition C.2: GVA does not occur in sufficiently stable linear systems
Proposition C.3
Lemma C.4
proof
Proposition C.5
Proposition C.6
...and 29 more

Butterfly Effects of SGD Noise: Error Amplification in Behavior Cloning and Autoregression

TL;DR

Abstract

Butterfly Effects of SGD Noise: Error Amplification in Behavior Cloning and Autoregression

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (22)

Theorems & Definitions (39)