Table of Contents
Fetching ...

Flexible task abstractions emerge in linear networks with fast and bounded units

Kai Sandbrink, Jan P. Bauer, Alexandra M. Proca, Andrew M. Saxe, Christopher Summerfield, Ali Hummos

TL;DR

This work investigates how cognitive flexibility and task abstractions can emerge in neural networks exposed to changing data distributions. It introduces Neural Task Abstractions (NTA), a linear gated network that jointly learns weights and gating variables under fast, nonnegative, bounded constraints on the gates, with block-structured task curricula. A key finding is the existence of a flexible learning regime in which weights self-organize into task-specific modules while gating representations switch between these modules, enabling rapid adaptation and compositional generalization; this regime is captured by an effective 2D dynamics in the teacher subspace and a symmetry-based exact solution. The results extend to deep linear and nonlinear networks, showing gating-based modularization and rapid task remapping in MNIST-like tasks and offering a mechanistic bridge to cognitive flexibility observed in humans.

Abstract

Animals survive in dynamic environments changing at arbitrary timescales, but such data distribution shifts are a challenge to neural networks. To adapt to change, neural systems may change a large number of parameters, which is a slow process involving forgetting past information. In contrast, animals leverage distribution changes to segment their stream of experience into tasks and associate them with internal task abstracts. Animals can then respond flexibly by selecting the appropriate task abstraction. However, how such flexible task abstractions may arise in neural systems remains unknown. Here, we analyze a linear gated network where the weights and gates are jointly optimized via gradient descent, but with neuron-like constraints on the gates including a faster timescale, nonnegativity, and bounded activity. We observe that the weights self-organize into modules specialized for tasks or sub-tasks encountered, while the gates layer forms unique representations that switch the appropriate weight modules (task abstractions). We analytically reduce the learning dynamics to an effective eigenspace, revealing a virtuous cycle: fast adapting gates drive weight specialization by protecting previous knowledge, while weight specialization in turn increases the update rate of the gating layer. Task switching in the gating layer accelerates as a function of curriculum block size and task training, mirroring key findings in cognitive neuroscience. We show that the discovered task abstractions support generalization through both task and subtask composition, and we extend our findings to a non-linear network switching between two tasks. Overall, our work offers a theory of cognitive flexibility in animals as arising from joint gradient descent on synaptic and neural gating in a neural network architecture.

Flexible task abstractions emerge in linear networks with fast and bounded units

TL;DR

This work investigates how cognitive flexibility and task abstractions can emerge in neural networks exposed to changing data distributions. It introduces Neural Task Abstractions (NTA), a linear gated network that jointly learns weights and gating variables under fast, nonnegative, bounded constraints on the gates, with block-structured task curricula. A key finding is the existence of a flexible learning regime in which weights self-organize into task-specific modules while gating representations switch between these modules, enabling rapid adaptation and compositional generalization; this regime is captured by an effective 2D dynamics in the teacher subspace and a symmetry-based exact solution. The results extend to deep linear and nonlinear networks, showing gating-based modularization and rapid task remapping in MNIST-like tasks and offering a mechanistic bridge to cognitive flexibility observed in humans.

Abstract

Animals survive in dynamic environments changing at arbitrary timescales, but such data distribution shifts are a challenge to neural networks. To adapt to change, neural systems may change a large number of parameters, which is a slow process involving forgetting past information. In contrast, animals leverage distribution changes to segment their stream of experience into tasks and associate them with internal task abstracts. Animals can then respond flexibly by selecting the appropriate task abstraction. However, how such flexible task abstractions may arise in neural systems remains unknown. Here, we analyze a linear gated network where the weights and gates are jointly optimized via gradient descent, but with neuron-like constraints on the gates including a faster timescale, nonnegativity, and bounded activity. We observe that the weights self-organize into modules specialized for tasks or sub-tasks encountered, while the gates layer forms unique representations that switch the appropriate weight modules (task abstractions). We analytically reduce the learning dynamics to an effective eigenspace, revealing a virtuous cycle: fast adapting gates drive weight specialization by protecting previous knowledge, while weight specialization in turn increases the update rate of the gating layer. Task switching in the gating layer accelerates as a function of curriculum block size and task training, mirroring key findings in cognitive neuroscience. We show that the discovered task abstractions support generalization through both task and subtask composition, and we extend our findings to a non-linear network switching between two tasks. Overall, our work offers a theory of cognitive flexibility in animals as arising from joint gradient descent on synaptic and neural gating in a neural network architecture.

Paper Structure

This paper contains 53 sections, 40 equations, 25 figures, 4 tables.

Figures (25)

  • Figure 1: The open-ended learning setting and the modeling approach.A. Example of the blocked curriculum with two tasks. B. Neural Task Abstraction (NTA) model updates $\bm{W}^p$ through gradient descent, but also the gating variables $c^p$, leading to task abstractions emerging in the gating layer.
  • Figure 2: Joint gradient descent on gates and weights enables fast adaptation through gradual specialization. Learning on the blocked curriculum from \ref{['fig:approach']} with $\tau_c=0.03$, $\tau_w=1.3$, and block length $\tau_B=1.0$. $x$-axis indicates time as multiples of $\tau_B$. (Black) Flexible NTA model \ref{['eq:lcs']}, (gray) forgetful NTA model with $\tau_c=\tau_w$ and $\lambda_{\text{nonneg}} = \lambda_{\text{norm}} = 0$. Simulation averaged over 10 random seeds with standard error indicated. A. Loss of both models over time. B. Gate activity of flexible NTA. C. Student-teacher weight alignment $\bm{W}^{\star m} \bm{W}^{p { \raisebox{\depth}{$\m@th\intercal$}}}$, normalized and averaged over rows (cosine similarity) for each student-teacher pair. D., E. Norm of updates to $\bm{W}^p$ and $\bm{c}$. Dashed: norm of students correlating with update size of $\bm{c}$. F. Time to $\mathcal{L}_\text{task} = 0.1$ for both models over blocks.
  • Figure 3: Flexible model generalizes to compositional tasks.A. Task composition consists of new tasks that sum sets of teachers previously encountered. B. Subtask composition consists of new tasks that concatenate alternating rows of sets of teachers previously encountered. Loss of models trained on generalization to task composition (C.) and subtask composition (D.) for the flexible (black) and forgetful (gray) NTA. 'New tasks' indicates the start of the generalization phase when the task curriculum is changed to cycle through the compositional tasks.
  • Figure 4: Mechanism of gradual task specialization in effective 2D subspace.A. Sketch of the reduced model and dynamic feedback. Out-of-subspace students gradually align to teacher axes. B. Trajectories of student weight matrices (blue, orange) in the teacher subspace during complete adaptation following a context switch from teacher 1 to teacher 2 in the flexible regime. Gray stripes indicate associated gate activation. The student weight matrices move little. C. Like (B), but for the forgetful regime. Student weight matrices entirely remap and gates do not turn off. D. Gradient of the task loss on $c^p$ as a function of the weight alignment. E. Trajectories in the specialization subspace as a function of gate timescale for values $\tau_c=0.1, 0.18, 0.32, 0.56, 1.00$ comparing (color) simulations and (dashed black) analytical predictions from exact solutions under symmetry in the flexible regime. Simulations begin from initial conditions of complete specialization and separation $w^p_m=\delta_{pm}$, $c^p = \delta_{p1}$ and follow a complete adaptation from teacher 1 to teacher 2 over the course of a block, reaching $\mathcal{L}_\text{task}<10^{-2}$ for all $\tau_c$.
  • Figure 5: Model specialization emerges as a function of block length, gate learning rate, and regularization strength. The colorbar indicates total alignment (cosine similarity) between all sets of students and teachers considered collectively.
  • ...and 20 more figures