Table of Contents
Fetching ...

Gradients are Not All You Need

Luke Metz, C. Daniel Freeman, Samuel S. Schoenholz, Tal Kachman

TL;DR

This work analyzes why differentiating through iterative dynamical systems can fail due to chaotic dynamics, tracing failures to the spectrum of the recurrent Jacobian and demonstrating gradient explosion across neural, physical, and learning-to-leach domains. It surveys a spectrum of remedies, from designing well-behaved systems and employing proxy objectives to truncation, gradient clipping, and ergodic-system methods, while also highlighting black-box gradient alternatives. The key contribution is a spectrum-based diagnostic plus a practical toolbox for mitigating gradient pathologies in differentiable simulations, with empirical demonstrations in physics, meta-learning, and molecular dynamics. The findings urge practitioners to assess the Jacobian spectrum before applying end-to-end differentiation and to adopt alternative gradient strategies when chaos dominates, thereby providing a spectrum-aware path to robust optimization in chaotic systems.

Abstract

Differentiable programming techniques are widely used in the community and are responsible for the machine learning renaissance of the past several decades. While these methods are powerful, they have limits. In this short report, we discuss a common chaos based failure mode which appears in a variety of differentiable circumstances, ranging from recurrent neural networks and numerical physics simulation to training learned optimizers. We trace this failure to the spectrum of the Jacobian of the system under study, and provide criteria for when a practitioner might expect this failure to spoil their differentiation based optimization algorithms.

Gradients are Not All You Need

TL;DR

This work analyzes why differentiating through iterative dynamical systems can fail due to chaotic dynamics, tracing failures to the spectrum of the recurrent Jacobian and demonstrating gradient explosion across neural, physical, and learning-to-leach domains. It surveys a spectrum of remedies, from designing well-behaved systems and employing proxy objectives to truncation, gradient clipping, and ergodic-system methods, while also highlighting black-box gradient alternatives. The key contribution is a spectrum-based diagnostic plus a practical toolbox for mitigating gradient pathologies in differentiable simulations, with empirical demonstrations in physics, meta-learning, and molecular dynamics. The findings urge practitioners to assess the Jacobian spectrum before applying end-to-end differentiation and to adopt alternative gradient strategies when chaos dominates, thereby providing a spectrum-aware path to robust optimization in chaotic systems.

Abstract

Differentiable programming techniques are widely used in the community and are responsible for the machine learning renaissance of the past several decades. While these methods are powerful, they have limits. In this short report, we discuss a common chaos based failure mode which appears in a variety of differentiable circumstances, ranging from recurrent neural networks and numerical physics simulation to training learned optimizers. We trace this failure to the spectrum of the Jacobian of the system under study, and provide criteria for when a practitioner might expect this failure to spoil their differentiation based optimization algorithms.

Paper Structure

This paper contains 28 sections, 40 equations, 7 figures, 1 table.

Figures (7)

  • Figure 1: Sometimes, black box gradient estimates can result in lower variance gradient estimates. On the left, we plot $l(x) = 0.1 \text{sin}(xw/(\pi)) + (x / 10)^2 + 0.1$ for different values of $w$ in red, as well as the loss smoothed by convolving with a 0.3 std Gaussian. On the figure to the right we show the max gradient variance computed over all $x \in [-10, 10]$. When the frequency of oscillations grows higher, the reparameterization gradient variance also grows while the back box gradient remains constant.
  • Figure 2: Loss surface and gradient variance for a stochastic policy on a a robotics control task -- the Ant environment from Brax. (a): We show a 1D projection of the loss surface along a random direction. All randomness in this plot is fixed. Color denotes different lengths of unroll when computing the loss. For small numbers of iterations the loss is smooth. For higher numbers of steps the underlying loss becomes highly curved. (b): Instead of fixing randomness as done in the left plot, we average over multiple random samples for the 8 step unroll (average is in black, samples are in colors). We find that averaging greatly smooths the underlying loss surface. (c): We look at gradient variance of gradients computed over multiple random samples from the stochastic policy. We show three different parameter values (shifts corresponding to the x-axis in the first two plots and are denoted with the same color vertical dashed lines). Despite having a seemingly smooth loss surface, the gradient variance explodes in exponential growth.
  • Figure 3: Loss surface and gradient variance calculations for meta-learning an optimizer. (a): We show a 1D projection of the meta-loss surfaces (loss with respect to learned optimizer parameters) for different length unrolls -- in this case, different numbers of application of the learned optimizer. For small numbers of steps, we find a smooth loss surfaces, but for higher numbers of steps we see a mix of smooth, and high curvature regions. (b): We show an average of the meta-loss over Gaussian perturbed learned optimizer weights. The average is shown in black, and the losses averaged over are shown in color. We find this averaged loss is smooth and appears well behaved. (c): We plot gradient variance over the different perturbations of the learned optimizer weights. These perturbations are shifts corresponding to the x-axis in the first two figures and are marked there with colored dashed vertical lines. For some settings of the learned optimizer weights (corresponding to the x-axis of the first 2 figures) we find well behaved gradient variance. For others, e.g. red, we find exponential growth in variance
  • Figure 4: Energy for packings of bi-disperse disks varying the diameter of the small disk, $D$, and the number of optimization steps. (a): The energy of the system as a function of $D$ for different numbers of optimization steps. We see that the energy decreases with more steps of optimization. (b): The energy for the maximum number of optimization steps considered (256). Each individual curve is the energy for one random configuration and the black line indicates the energy averaged over many random seeds. (c): The variance of the gradient estimate for different values of $D$ as a function of the number of steps of optimization.
  • Figure 5: Exploration into the eigenspectrum of the recurrent jacobians of the Brax Ant experiment. We show two parameter values: init1 which is initialized in an unstable regime, and init2 which is in stable regime. (a): we show the spectrum of the recurrent jacobain taken from the 90th iteration ($\frac{\partial s_{90}}{\partial s_{89}}$). (b): We plot the length of the maximum norm eigenvalue of each recurrent jacobian along the sequence ( $\frac{\partial s_{i}}{\partial s_{i-1}}$ for each $i$). (c): We plot the length of the max eigenvalue of the ($\frac{\partial s_{i}}{\partial s_{0}}$ for each $i$) and find that the unstable initialization grows exponentially. (d): We plot the gradient norms of each initialization and find exploding gradients in the unstable initialization.
  • ...and 2 more figures