Table of Contents
Fetching ...

Routing without Forgetting

Alessio Masano, Giovanni Bellitto, Dipam Goswani, Joost Van de Weijer, Concetto Spampinato

TL;DR

Routing without Forgetting is introduced, a transformer architecture augmented with energy-based associative retrieval layers inspired by Modern Hopfield Networks that indicates that embedding energy-based associative routing directly within the transformer backbone provides a principled and effective foundation for OCL.

Abstract

Continual learning in transformers is commonly addressed through parameter-efficient adaptation: prompts, adapters, or LoRA modules are specialized per task while the backbone remains frozen. Although effective in controlled multi-epoch settings, these approaches rely on gradual gradient-based specialization and struggle in Online Continual Learning (OCL), where data arrive as a non-stationary stream and each sample may be observed only once. We recast continual learning in transformers as a routing problem: under strict online constraints, the model must dynamically select the appropriate representational subspace for each input without explicit task identifiers or repeated optimization. We thus introduce Routing without Forgetting (RwF), a transformer architecture augmented with energy-based associative retrieval layers inspired by Modern Hopfield Networks. Instead of storing or merging task-specific prompts, RwF generates dynamic prompts through single-step associative retrieval over the transformer token embeddings at each layer. Retrieval corresponds to the closed-form minimization of a strictly convex free-energy functional, enabling input-conditioned routing within each forward pass, independently of iterative gradient refinement. Across challenging class-incremental benchmarks, RwF improves over existing prompt-based methods. On Split-ImageNet-R and Split-ImageNet-S, RwF outperforms prior prompt-based approaches by a large margin, even in few-shot learning regimes. These results indicate that embedding energy-based associative routing directly within the transformer backbone provides a principled and effective foundation for OCL.

Routing without Forgetting

TL;DR

Routing without Forgetting is introduced, a transformer architecture augmented with energy-based associative retrieval layers inspired by Modern Hopfield Networks that indicates that embedding energy-based associative routing directly within the transformer backbone provides a principled and effective foundation for OCL.

Abstract

Continual learning in transformers is commonly addressed through parameter-efficient adaptation: prompts, adapters, or LoRA modules are specialized per task while the backbone remains frozen. Although effective in controlled multi-epoch settings, these approaches rely on gradual gradient-based specialization and struggle in Online Continual Learning (OCL), where data arrive as a non-stationary stream and each sample may be observed only once. We recast continual learning in transformers as a routing problem: under strict online constraints, the model must dynamically select the appropriate representational subspace for each input without explicit task identifiers or repeated optimization. We thus introduce Routing without Forgetting (RwF), a transformer architecture augmented with energy-based associative retrieval layers inspired by Modern Hopfield Networks. Instead of storing or merging task-specific prompts, RwF generates dynamic prompts through single-step associative retrieval over the transformer token embeddings at each layer. Retrieval corresponds to the closed-form minimization of a strictly convex free-energy functional, enabling input-conditioned routing within each forward pass, independently of iterative gradient refinement. Across challenging class-incremental benchmarks, RwF improves over existing prompt-based methods. On Split-ImageNet-R and Split-ImageNet-S, RwF outperforms prior prompt-based approaches by a large margin, even in few-shot learning regimes. These results indicate that embedding energy-based associative routing directly within the transformer backbone provides a principled and effective foundation for OCL.
Paper Structure (20 sections, 8 equations, 2 figures, 4 tables)

This paper contains 20 sections, 8 equations, 2 figures, 4 tables.

Figures (2)

  • Figure 1: RwF layer: routing-augmented transformer block. Given input tokens $Z_\ell$, a Hopfield-based associative retrieval module generates input-conditioned routing prompts $P_\ell$ via energy-based pooling over token features. The retrieved $P_\ell$ are concatenated with the $Z_\ell$ tokens and processed by the standard Multi-Head Self-Attention (MHSA). After MHSA, only the backbone tokens $\tilde{Z}_\ell$ are propagated to the MLP blocks and then to the next RwF transformer layer $\ell+1$, while $\tilde{P}_\ell$ are discarded.
  • Figure 2: Performance scaling with increasing task fragmentation (Split-ImageNet-R). Final Average Accuracy ($\mathrm{A}_{\text{Final}}$) $(\uparrow)$ as the number of sequential tasks $t$ increases from 5 to 40.