Table of Contents
Fetching ...

A ghost mechanism: An analytical model of abrupt learning

Fatih Dinc, Ege Cirakman, Yiqi Jiang, Mert Yuksekgonul, Mark J. Schnitzer, Hidenori Tanaka

TL;DR

A minimal dynamical system trained on a delayed-activation task is introduced and it is demonstrated analytically how even a one-dimensional system can exhibit abrupt learning through ghost points rather than bifurcations rather than bifurcations.

Abstract

\emph{Abrupt learning} is commonly observed in neural networks, where long plateaus in network performance are followed by rapid convergence to a desirable solution. Yet, despite its common occurrence, the complex interplay of task, network architecture, and learning rule has made it difficult to understand the underlying mechanisms. Here, we introduce a minimal dynamical system trained on a delayed-activation task and demonstrate analytically how even a one-dimensional system can exhibit abrupt learning through ghost points rather than bifurcations. Through our toy model, we show that the emergence of a ghost point destabilizes learning dynamics. We identify a critical learning rate that prevents learning through two distinct loss landscape features: a no-learning zone and an oscillatory minimum. Testing these predictions in recurrent neural networks (RNNs), we confirm that ghost points precede abrupt learning and accompany the destabilization of learning. We demonstrate two complementary remedies: lowering the model output confidence prevents the network from getting stuck in no-learning zones, while increasing trainable ranks beyond task requirements (\textit{i.e.}, adding sloppy parameters) provides more stable learning trajectories. Our model reveals a bifurcation-free mechanism for abrupt learning and illustrates the importance of both deliberate uncertainty and redundancy in stabilizing learning dynamics.

A ghost mechanism: An analytical model of abrupt learning

TL;DR

A minimal dynamical system trained on a delayed-activation task is introduced and it is demonstrated analytically how even a one-dimensional system can exhibit abrupt learning through ghost points rather than bifurcations rather than bifurcations.

Abstract

\emph{Abrupt learning} is commonly observed in neural networks, where long plateaus in network performance are followed by rapid convergence to a desirable solution. Yet, despite its common occurrence, the complex interplay of task, network architecture, and learning rule has made it difficult to understand the underlying mechanisms. Here, we introduce a minimal dynamical system trained on a delayed-activation task and demonstrate analytically how even a one-dimensional system can exhibit abrupt learning through ghost points rather than bifurcations. Through our toy model, we show that the emergence of a ghost point destabilizes learning dynamics. We identify a critical learning rate that prevents learning through two distinct loss landscape features: a no-learning zone and an oscillatory minimum. Testing these predictions in recurrent neural networks (RNNs), we confirm that ghost points precede abrupt learning and accompany the destabilization of learning. We demonstrate two complementary remedies: lowering the model output confidence prevents the network from getting stuck in no-learning zones, while increasing trainable ranks beyond task requirements (\textit{i.e.}, adding sloppy parameters) provides more stable learning trajectories. Our model reveals a bifurcation-free mechanism for abrupt learning and illustrates the importance of both deliberate uncertainty and redundancy in stabilizing learning dynamics.
Paper Structure (2 sections, 19 equations, 6 figures)

This paper contains 2 sections, 19 equations, 6 figures.

Figures (6)

  • Figure 1: Visualization of the toy model with a single dynamical variable undergoing a saddle-node bifurcation. We initialize the variable at $x(0)=0$ (red dot). Left. For $r>0$, the system evolves towards $x \to \infty$ (red arrow). Right. For $r<0$, the system evolves towards a fixed point at $-\sqrt{-r}$. A pre-defined $x^*$ divides the model output into two states.
  • Figure 2: Our toy model trained on the delayed-activation task captures abrupt learning dynamics phenomena.A We compared the analytical loss function (black line) vs those computed from realistic parameters (colored dots), in which the model output was defined via a sigmoid function $\hat{o}(x) = \sigma(c(x-x^*)).$ The loss function had three distinct regimes: (1) a point of no return, (2) a minimum with non-zero gradient, and (3) abrupt decay of the loss function for $r\geq r^*$, where $r^*=\frac{\pi^2}{4T^2}$ is the global minimum. Parameters: $T = 100$, $\Delta t = 0.1$, and $x^* = 10$. B-C Initializing at $r:= 10r^*$, we minimized the loss function values using gradient descent with different learning rates, recapitulating all three regimes in learning dynamics. Notably, for $\alpha = 10^{-10}$, even though the loss function decrease abruptly around epoch 1500, the network does not undergo any bifurcations, as evident from $r$ not changing its sign during learning. D The toy model learned best with lower learning rates, but at the expense of more epochs of training. As predicted by the theory, learning is no longer possible for $\alpha \geq 9 * 10^{-10}$. Solid lines: means. Error bars: s.e.m. over 10 training instances, in which $r$ was initialized following a normal distribution that has the mean $10r^*$ and the standard deviation $\frac{r^{*}}{10}$. Parameters for (b-d): $T = 100$ and $x^*,c \to \infty$ (analytical model).
  • Figure 3: A rank-one RNN trained on the delayed-activation task reproduced the main findings of the toy model. We trained a rank-one RNN on the DA task, in which the output of the RNN was defined as $\hat{o}(\kappa) = \sigma(c(\kappa - 1))$. Here, $\kappa$ is the latent variable and $c$ is the confidence level. A The RNN trained with a relatively low learning rate showed the abrupt jump in the loss function (between (i) and (ii)), and had oscillatory behavior before converging to a minimum (between (ii) and (iii)). The resulting network learned local ghost points with a small, but non-zero, distance from the $y=0$ line. B When training the same network with higher learning rate, a saddle-node bifurcation occurred, putting the network beyond the point of no return. The network could no longer recover, as indicated by the practically zero gradient after the bifurcation. Parameters: $\tau = 10ms$, $\Delta t = 5ms$, $T = 100ms$, $N=100$ neurons, $c=10$. We initialized all units to be $x(0) = -0.3$ and used stochastic gradient descent. Red dots correspond to the initial values of $\kappa(t)$ for the final networks.
  • Figure 4: Lowering the confidence allows RNNs to recover from the no-learning zone. We trained 100 RNNs with a confidence $c = 10$ and the learning rate $\alpha = 0.02$ for 6000 epochs, otherwise using the same parameters as in Fig. \ref{['fig:fig3']}. Out of 100, 74 RNNs learned the task at some point, whereas 54 of them entered the no-learning zone (quantified as having training accuracy of $\leq 0.5$ for the last $50$ epochs). We further trained these RNNs after lowering the confidence levels, which allowed them to recover. Solid lines: means. Error bars: s.e.m. over 54 networks.
  • Figure S1: We performed the analysis in Fig. \ref{['fig:fig2']}D for the rank-one RNNs trained on the DA task. Parameters: $\tau = 10ms$, $\Delta t = 5ms$, $T = 100ms$, $N=100$ neurons, $c=10$. We initialized all units to be $x(0) = -0.3$ and used stochastic gradient descent.
  • ...and 1 more figures