Does RoBERTa Perform Better than BERT in Continual Learning: An Attention Sink Perspective
Xueying Bai, Yifan Sun, Niranjan Balasubramanian
TL;DR
The paper shows that high downstream capacity from pre-training does not guarantee strong continual learning (CL) due to attention sinks that over-attend ubiquitous tokens and cause over-smoothing. It analyzes sink behavior and its link to cross-task interference, then introduces a pre-scaling mechanism that first probes token-wise attention with a scaling layer to diversify attention on non-sink tokens, followed by fine-tuning. Empirically, Prescale improves CL without experience replay and enables RoBERTa to surpass BERT in CL settings. The work provides practical CL improvements for pretrained LMs and elucidates how attention distributions influence continual learning behavior.
Abstract
Continual learning (CL) aims to train models that can sequentially learn new tasks without forgetting previous tasks' knowledge. Although previous works observed that pre-training can benefit CL, it remains unclear whether a pre-trained model with higher downstream capacity also performs better in CL. In this paper, we observe that pre-trained models may allocate high attention scores to some 'sink' tokens, such as [SEP] tokens, which are ubiquitous across various tasks. Such attention sinks may lead to models' over-smoothing in single-task learning and interference in sequential tasks' learning, which may compromise the models' CL performance despite their high pre-trained capabilities. To reduce these effects, we propose a pre-scaling mechanism that encourages attention diversity across all tokens. Specifically, it first scales the task's attention to the non-sink tokens in a probing stage, and then fine-tunes the model with scaling. Experiments show that pre-scaling yields substantial improvements in CL without experience replay, or progressively storing parameters from previous tasks.
