Table of Contents
Fetching ...

Gradient Flossing: Improving Gradient Descent through Dynamic Control of Jacobians

Rainer Engelken

TL;DR

The paper addresses gradient instability in RNNs caused by long-range dependencies, linking gradient propagation to Lyapunov exponents of the forward dynamics via the long-term Jacobian $\mathbf{T}_t$. It proposes gradient flossing, a differentiable QR-based regularization that pushes selected Lyapunov exponents toward zero, improving both gradient norm and the conditioning of $\mathbf{T}_t$ to enhance temporal credit assignment. The authors present theoretical connections to the long-term Jacobian’s condition number $\kappa_2$ and provide empirical evidence across tasks with varying temporal complexity, showing gains when flossing is used before and/or during training. They also discuss limitations, computational costs, and the method’s compatibility with orthogonal initializations and gated units, highlighting potential extensions to broader architectures and dynamical systems models.

Abstract

Training recurrent neural networks (RNNs) remains a challenge due to the instability of gradients across long time horizons, which can lead to exploding and vanishing gradients. Recent research has linked these problems to the values of Lyapunov exponents for the forward-dynamics, which describe the growth or shrinkage of infinitesimal perturbations. Here, we propose gradient flossing, a novel approach to tackling gradient instability by pushing Lyapunov exponents of the forward dynamics toward zero during learning. We achieve this by regularizing Lyapunov exponents through backpropagation using differentiable linear algebra. This enables us to "floss" the gradients, stabilizing them and thus improving network training. We demonstrate that gradient flossing controls not only the gradient norm but also the condition number of the long-term Jacobian, facilitating multidimensional error feedback propagation. We find that applying gradient flossing prior to training enhances both the success rate and convergence speed for tasks involving long time horizons. For challenging tasks, we show that gradient flossing during training can further increase the time horizon that can be bridged by backpropagation through time. Moreover, we demonstrate the effectiveness of our approach on various RNN architectures and tasks of variable temporal complexity. Additionally, we provide a simple implementation of our gradient flossing algorithm that can be used in practice. Our results indicate that gradient flossing via regularizing Lyapunov exponents can significantly enhance the effectiveness of RNN training and mitigate the exploding and vanishing gradient problem.

Gradient Flossing: Improving Gradient Descent through Dynamic Control of Jacobians

TL;DR

The paper addresses gradient instability in RNNs caused by long-range dependencies, linking gradient propagation to Lyapunov exponents of the forward dynamics via the long-term Jacobian . It proposes gradient flossing, a differentiable QR-based regularization that pushes selected Lyapunov exponents toward zero, improving both gradient norm and the conditioning of to enhance temporal credit assignment. The authors present theoretical connections to the long-term Jacobian’s condition number and provide empirical evidence across tasks with varying temporal complexity, showing gains when flossing is used before and/or during training. They also discuss limitations, computational costs, and the method’s compatibility with orthogonal initializations and gated units, highlighting potential extensions to broader architectures and dynamical systems models.

Abstract

Training recurrent neural networks (RNNs) remains a challenge due to the instability of gradients across long time horizons, which can lead to exploding and vanishing gradients. Recent research has linked these problems to the values of Lyapunov exponents for the forward-dynamics, which describe the growth or shrinkage of infinitesimal perturbations. Here, we propose gradient flossing, a novel approach to tackling gradient instability by pushing Lyapunov exponents of the forward dynamics toward zero during learning. We achieve this by regularizing Lyapunov exponents through backpropagation using differentiable linear algebra. This enables us to "floss" the gradients, stabilizing them and thus improving network training. We demonstrate that gradient flossing controls not only the gradient norm but also the condition number of the long-term Jacobian, facilitating multidimensional error feedback propagation. We find that applying gradient flossing prior to training enhances both the success rate and convergence speed for tasks involving long time horizons. For challenging tasks, we show that gradient flossing during training can further increase the time horizon that can be bridged by backpropagation through time. Moreover, we demonstrate the effectiveness of our approach on various RNN architectures and tasks of variable temporal complexity. Additionally, we provide a simple implementation of our gradient flossing algorithm that can be used in practice. Our results indicate that gradient flossing via regularizing Lyapunov exponents can significantly enhance the effectiveness of RNN training and mitigate the exploding and vanishing gradient problem.
Paper Structure (28 sections, 26 equations, 16 figures, 2 tables, 2 algorithms)

This paper contains 28 sections, 26 equations, 16 figures, 2 tables, 2 algorithms.

Figures (16)

  • Figure 1: Gradient flossing controls Lyapunov exponents and gradient signal propagation A) Exploding and vanishing gradients in backpropagation through time arise from amplification/attenuation of product of Jacobians that form the long-term Jacobian $\mathbf{T}_{t}(\mathbf{h}_\tau)=\prod_{\tau'=\tau}^{t-1}\frac{\partial\mathbf{h}_{\tau'+1}}{\partial \mathbf{h}_{\tau'}}$. B) First Lyapunov exponent of Vanilla RNN as a function of training epochs. Minimizing the mean squared error between estimated first Lyapunov exponent and target Lyapunov exponent $\lambda_1={-1,-0.5,0}$ by gradient descent. 10 Vanilla RNNs were initialized with Gaussian recurrent weights $W_{ij}\sim \mathcal{N}(0,\,g^2/N)$ where values of $g$ were drawn $g\sim \textrm{Unif}(0,1)$. C)Gradient flossing minimizes the square of Lyapunov exponents over epochs. D) Full Lyapunov spectrum of Vanilla RNN after a different number of Lyapunov exponents are pushed to zero via gradient flossing. Note, the variability of the Lyapunov exponents that were not flossed. Parameters: network size $N=32$ with 10 network realizations. Error bars in C indicate the 25% and 75% percentiles and solid line shows median.
  • Figure 2: Gradient flossing reduces condition number of the long-term JacobianA) Condition number $\kappa_2$ of long-term Jacobian $\mathbf{T}_t(\mathbf{h}_\tau)$ as a function of time horizon $t-\tau$ at initialization (blue) and after gradient flossing (orange). Direct numerical simulations are done with arbitrary precision floating point arithmetic (transparent lines) with 256 bits per float, asymptotic theory based on Lyapunov exponents (dashed lines) (Eq \ref{['eq:-condition-number']}). B) Condition number for different number of tangent space dimensions $m$. Simulations (dots) and Lyapunov exponent based theory (dashed lines) at initialization (blue) and after gradient flossing (orange). Gradient flossing increases the number of tangent space dimensions available for backpropagation for a given condition number (Grey dotted line as a guide for eye for $\kappa_2=10^5$.) First $15$ Lyapunov exponents were flossed. C) Comparison of condition number obtained via direct numerical simulations vs. Lyapunov exponent-based. Colors denote the number of flossed Lyapunov exponents $k$. Parameters: $g=1$, batch size $b=1$, $N=80$, $\text{epochs}=500$, $T=500$, gradient flossing for $E_f=500$ epochs. Input $\mathbf{x}_s$ identical to delayed XOR task in Fig \ref{['fig3']}D.
  • Figure 3: Gradient flossing improves trainability on tasks that involve long time horizonsA) Test error for Vanilla RNNs trained on delayed copy task $y_t=x_{t-d}$ for $d=40$ with and without gradient flossing flossing. Solid lines are medians across 5 network realizations. B) Same as A for delayed XOR task with $y_t=|x_{t-d/2}-x_{t-d}|$. C) Mean final test loss as a function of task difficulty (delay $d$) for delayed copy task. D) Mean final test loss as a function of task difficulty (delay $d$) for delayed XOR task. Parameters: $g=1$, batch size $b=16$, $N=80$, $\text{epochs}=10^4$, $T=300$, gradient flossing for $E_f=500$ epochs on $k=75$ before training. Shaded regions in C and D indicate the 20% and 80% percentiles and solid line shows mean. Dots are individual runs. Task loss: MSE($y,\hat{y}$).
  • Figure 4: Gradient flossing during training further improves trainability A) Test accuracy for Vanilla RNNs trained on delayed temporal binary XOR task $y_t=x_{t-d/2} \oplus x_{t-d}$ with gradient flossing during training (green), preflossing (gradient flossing before training) (orange), and with no gradient flossing (blue) for $d=70$. Solid lines are mean across 20 network realizations, individual network realizations shown in transparent fine lines. B) Same as A for delayed spatial XOR task with $y_t=x^1_{t-d} \oplus x^2_{t-d} \oplus x^3_{t-d}$ . Parameters ($g=1$, batch size $b=16$). C) Test accuracy as a function of task difficulty (delay $d$) for delayed temporal XOR task. D) Test accuracy as a function of task difficulty (delay $d$) for delayed spatial XOR task. Parameters: $g=1$, batch size $b=16$, $N=80$, $\text{epochs}=10^4$, $T=300$, gradient flossing for $E_f=500$ epochs on $k=75$ before training and during training for green lines, and only before training for orange lines. Same plotting conventions as previous figure. Task loss: cross-entropy between $y$ and $\hat{y}$.
  • Figure 5: Gradient flossing for different numbers of flossed Lyapunov exponents A) Test accuracy for delayed temporal XOR task as a function of delay $d$ with different numbers flossed Lyapunov exponents $k$. B) Same data as A but here test accuracy as a function of number of flossed Lyapunov exponents $k$. Parameters: $g=1$, batch size $b=16$, $N=80$, $\text{epochs}=10^4$ for delayed temporal XOR, $\text{epochs}=5000$ for delayed spatial XOR, $T=300$, gradient flossing for $E_f=500$ epochs before training and during training for A, B. Shaded areas are 25% and 75% percentile, solid lines are means, transparent dots are individual simulations, task loss: cross-entropy between $y$ and $\hat{y}$.
  • ...and 11 more figures