Overcoming catastrophic forgetting with hard attention to the task
Joan Serrà, Dídac Surís, Marius Miron, Alexandros Karatzoglou
TL;DR
This work tackles catastrophic forgetting in sequential learning by introducing Hard Attention to the Task (HAT), a lightweight mechanism that conditions layer-wise unit masks on the current task via learnable embeddings. By accumulating past attentions and modulating gradient updates, HAT preserves previous task information while permitting adaptation to new tasks; it also introduces gradient compensation and an attention-weighted regularization to promote compact, reusable representations. Empirical results across eight image-classification datasets and multiple sequential setups show substantial reductions in forgetting (often 45–75%), robustness to hyperparameters, and practical benefits for online learning and model compression. The framework supports monitoring of capacity usage and weight reuse, and enables aggressive compression without significant accuracy loss, demonstrating both effectiveness and practicality for continual learning scenarios.
Abstract
Catastrophic forgetting occurs when a neural network loses the information learned in a previous task after training on subsequent tasks. This problem remains a hurdle for artificial intelligence systems with sequential learning capabilities. In this paper, we propose a task-based hard attention mechanism that preserves previous tasks' information without affecting the current task's learning. A hard attention mask is learned concurrently to every task, through stochastic gradient descent, and previous masks are exploited to condition such learning. We show that the proposed mechanism is effective for reducing catastrophic forgetting, cutting current rates by 45 to 80%. We also show that it is robust to different hyperparameter choices, and that it offers a number of monitoring capabilities. The approach features the possibility to control both the stability and compactness of the learned knowledge, which we believe makes it also attractive for online learning or network compression applications.
