Learning to Remember, Learn, and Forget in Attention-Based Models
Djohan Bonnet, Jamie Lohoff, Jan Finkbeiner, Elidona Skhikerujah, Emre Neftci
TL;DR
This work treats In-Context Learning in transformers as a continual-learning problem with fixed-size memories and interference risks. It proposes Palimpsa, a Bayesian metaplasticity-based attention mechanism that adapts the plasticity of each memory state via a per-state importance (I_t) and a forgetting gate tied to a memory window N_t, enabling both forgetting and preserving critical past information. The authors derive Palimpsa from a variational Bayesian objective, show that Mamba2 is a special case of Palimpsa, and provide a continuum that allows metaplastic finetuning of pre-trained models. Empirically, Palimpsa improves performance on the MQAR benchmark and on Commonsense Reasoning tasks, with larger gains as sequence length grows and with fine-tuning at scale, highlighting practical memory improvements for edge-friendly, fixed-memory transformers.
Abstract
In-Context Learning (ICL) in transformers acts as an online associative memory and is believed to underpin their high performance on complex sequence processing tasks. However, in gated linear attention models, this memory has a fixed capacity and is prone to interference, especially for long sequences. We propose Palimpsa, a self-attention model that views ICL as a continual learning problem that must address a stability-plasticity dilemma. Palimpsa uses Bayesian metaplasticity, where the plasticity of each attention state is tied to an importance state grounded by a prior distribution that captures accumulated knowledge. We demonstrate that various gated linear attention models emerge as specific architecture choices and posterior approximations, and that Mamba2 is a special case of Palimpsa where forgetting dominates. This theoretical link enables the transformation of any non-metaplastic model into a metaplastic one, significantly expanding its memory capacity. Our experiments show that Palimpsa consistently outperforms baselines on the Multi-Query Associative Recall (MQAR) benchmark and on Commonsense Reasoning tasks.
