Table of Contents
Fetching ...

Curl Descent: Non-Gradient Learning Dynamics with Sign-Diverse Plasticity

Hugo Ninou, Jonathan Kadmon, N. Alex Cayco-Gajic

TL;DR

The paper investigates learning dynamics in neural networks with sign-diverse, non-gradient curl terms arising from biologically plausible plasticity rules. Using a two-layer student–teacher framework, it introduces curl descent updates and analyzes their fixed points, showing that the solution manifold can remain stable under moderate curl, while the origin often becomes a center, depending on architecture. In large networks, random-matrix theory reveals a phase transition where the solution manifold loses stability as the fraction of flipped synapses and compression ratio vary, with distinct thresholds for hidden- vs readout-layer flips. Simulations in linear and nonlinear networks demonstrate chaotic dynamics when curl destabilizes the manifold in the hidden layer, yet in some cases curl terms still yield low error or even faster convergence, highlighting architectures that can robustly leverage non-gradient learning rules. Overall, the work broadens the view of optimization in neural systems and suggests that sign-diverse plasticity can support effective learning beyond traditional gradient descent, with potential implications for both neuroscience and machine learning.

Abstract

Gradient-based algorithms are a cornerstone of artificial neural network training, yet it remains unclear whether biological neural networks use similar gradient-based strategies during learning. Experiments often discover a diversity of synaptic plasticity rules, but whether these amount to an approximation to gradient descent is unclear. Here we investigate a previously overlooked possibility: that learning dynamics may include fundamentally non-gradient "curl"-like components while still being able to effectively optimize a loss function. Curl terms naturally emerge in networks with inhibitory-excitatory connectivity or Hebbian/anti-Hebbian plasticity, resulting in learning dynamics that cannot be framed as gradient descent on any objective. To investigate the impact of these curl terms, we analyze feedforward networks within an analytically tractable student-teacher framework, systematically introducing non-gradient dynamics through neurons exhibiting rule-flipped plasticity. Small curl terms preserve the stability of the original solution manifold, resulting in learning dynamics similar to gradient descent. Beyond a critical value, strong curl terms destabilize the solution manifold. Depending on the network architecture, this loss of stability can lead to chaotic learning dynamics that destroy performance. In other cases, the curl terms can counterintuitively speed learning compared to gradient descent by allowing the weight dynamics to escape saddles by temporarily ascending the loss. Our results identify specific architectures capable of supporting robust learning via diverse learning rules, providing an important counterpoint to normative theories of gradient-based learning in neural networks.

Curl Descent: Non-Gradient Learning Dynamics with Sign-Diverse Plasticity

TL;DR

The paper investigates learning dynamics in neural networks with sign-diverse, non-gradient curl terms arising from biologically plausible plasticity rules. Using a two-layer student–teacher framework, it introduces curl descent updates and analyzes their fixed points, showing that the solution manifold can remain stable under moderate curl, while the origin often becomes a center, depending on architecture. In large networks, random-matrix theory reveals a phase transition where the solution manifold loses stability as the fraction of flipped synapses and compression ratio vary, with distinct thresholds for hidden- vs readout-layer flips. Simulations in linear and nonlinear networks demonstrate chaotic dynamics when curl destabilizes the manifold in the hidden layer, yet in some cases curl terms still yield low error or even faster convergence, highlighting architectures that can robustly leverage non-gradient learning rules. Overall, the work broadens the view of optimization in neural systems and suggests that sign-diverse plasticity can support effective learning beyond traditional gradient descent, with potential implications for both neuroscience and machine learning.

Abstract

Gradient-based algorithms are a cornerstone of artificial neural network training, yet it remains unclear whether biological neural networks use similar gradient-based strategies during learning. Experiments often discover a diversity of synaptic plasticity rules, but whether these amount to an approximation to gradient descent is unclear. Here we investigate a previously overlooked possibility: that learning dynamics may include fundamentally non-gradient "curl"-like components while still being able to effectively optimize a loss function. Curl terms naturally emerge in networks with inhibitory-excitatory connectivity or Hebbian/anti-Hebbian plasticity, resulting in learning dynamics that cannot be framed as gradient descent on any objective. To investigate the impact of these curl terms, we analyze feedforward networks within an analytically tractable student-teacher framework, systematically introducing non-gradient dynamics through neurons exhibiting rule-flipped plasticity. Small curl terms preserve the stability of the original solution manifold, resulting in learning dynamics similar to gradient descent. Beyond a critical value, strong curl terms destabilize the solution manifold. Depending on the network architecture, this loss of stability can lead to chaotic learning dynamics that destroy performance. In other cases, the curl terms can counterintuitively speed learning compared to gradient descent by allowing the weight dynamics to escape saddles by temporarily ascending the loss. Our results identify specific architectures capable of supporting robust learning via diverse learning rules, providing an important counterpoint to normative theories of gradient-based learning in neural networks.

Paper Structure

This paper contains 56 sections, 81 equations, 11 figures.

Figures (11)

  • Figure 1: Toy model analysis. a) Learning trajectories in weight space for gradient flow (dark purple curve) and curl flow (light purple curve). The heatmap represents the log loss, which determines the gradient descent dynamics. The hyperbolic solution manifold (dark red curves) is a global minimum. Curl descent reshapes the learning dynamics and adds a rotational field (flow-field overlain in light purple curves with arrows). b) Schematic of the toy model network. c) Log error vs. training epoch for the same learning trajectories shown in panel a. Inset: Same figure zoomed on the first 10 epochs, showing that curl descent initially ascends the loss function.
  • Figure 2: Analytical phase diagrams. Stability of the solution manifold as a function of the compression ratio $c$ and the fraction of rule-flipped neurons in each layer $\alpha_h$ (hidden) and $\alpha_r$ (readout).
  • Figure 3: Hidden layer curl terms lead to chaos. a) Test error as a function of the compression ratio $c$ and the fraction of rule-flipped neurons $\alpha_h$ (averaged over $10$ random seeds). Black curve: analytical stability boundary (cf. Fig. \ref{['fig:AH_fraction_vs_ratio']}, top). Inset: Close-up for $c\in [0.1,0.7]$. b) Order parameter $q$ (averaged over $10$ seeds) plotted for varying $\alpha_h$ (top) and $c$ (bottom). Dashed lines indicate analytical transition to instability. c) Example weight dynamics in the unstable regime ($c=0.8$, $\alpha_h=0.6$). d) Example weight autocorrelation functions. Inset: Weight dynamics projected onto its first two principal components. Compute resources: 4 hours on $500$ CPUs (local cluster).
  • Figure 4: Readout layer curl terms result in low error even when the solution manifold is unstable. a) Low test error with readout curl terms. Same as Fig. 3a while varying $\alpha_r$. The black curve shows the analytical stability boundary. b) Peak error over learning (maximum over 20 random seeds, initialized near the solution manifold). Inset: Test error vs. epoch in the unstable regime, showing large weight transients that re-descend the loss ($c=1$, $\alpha_r=0.6$). c) Example weight dynamics in the unstable regime ($c=1$, $\alpha_r=0.6$). d) Example weight autocorrelation functions. Inset: Weight dynamics projected onto its first two principal components. Compute resources: 4 hours on $500$ CPUs (local cluster).
  • Figure 5: Nonlinear networks: curl descent leads to faster convergence in a broad parameter regime. a) Network schematic. b) Test error for gradient descent and curl descent with a single rule-flipped readout neuron ($N_\text{train}=2000$, $\text{weight initialization scale}=2$; error bars indicate mean ± sem, averaged over $10$ random seeds). Inset: activation function (tanh). c) Convergence speed of curl descent and gradient descent as a function of training set size ($\text{weight initialization scale}=2$). d) Same as c as a function of the weight initialization range ($N_\text{train}=10000$). e) Convergence speed as a function of the teacher weights initialization scale. f) Convergence speed as a function of the fraction of rule-flipped readout neurons. Compute resources: 12 hours on $500$ CPUs (local cluster).
  • ...and 6 more figures