Does RoBERTa Perform Better than BERT in Continual Learning: An Attention Sink Perspective

Xueying Bai; Yifan Sun; Niranjan Balasubramanian

Does RoBERTa Perform Better than BERT in Continual Learning: An Attention Sink Perspective

Xueying Bai, Yifan Sun, Niranjan Balasubramanian

TL;DR

The paper shows that high downstream capacity from pre-training does not guarantee strong continual learning (CL) due to attention sinks that over-attend ubiquitous tokens and cause over-smoothing. It analyzes sink behavior and its link to cross-task interference, then introduces a pre-scaling mechanism that first probes token-wise attention with a scaling layer to diversify attention on non-sink tokens, followed by fine-tuning. Empirically, Prescale improves CL without experience replay and enables RoBERTa to surpass BERT in CL settings. The work provides practical CL improvements for pretrained LMs and elucidates how attention distributions influence continual learning behavior.

Abstract

Continual learning (CL) aims to train models that can sequentially learn new tasks without forgetting previous tasks' knowledge. Although previous works observed that pre-training can benefit CL, it remains unclear whether a pre-trained model with higher downstream capacity also performs better in CL. In this paper, we observe that pre-trained models may allocate high attention scores to some 'sink' tokens, such as [SEP] tokens, which are ubiquitous across various tasks. Such attention sinks may lead to models' over-smoothing in single-task learning and interference in sequential tasks' learning, which may compromise the models' CL performance despite their high pre-trained capabilities. To reduce these effects, we propose a pre-scaling mechanism that encourages attention diversity across all tokens. Specifically, it first scales the task's attention to the non-sink tokens in a probing stage, and then fine-tunes the model with scaling. Experiments show that pre-scaling yields substantial improvements in CL without experience replay, or progressively storing parameters from previous tasks.

Does RoBERTa Perform Better than BERT in Continual Learning: An Attention Sink Perspective

TL;DR

Abstract

Paper Structure (19 sections, 13 equations, 9 figures, 4 tables)

This paper contains 19 sections, 13 equations, 9 figures, 4 tables.

Introduction
Related Work
Attention Sinks in Language Models
Empirical Analysis of Attention Sinks
Connection between Over-Smoothing and Attention Sinks
Attention Sink and Interference in Continual Learning
Interference in Continual Learning
Case Study: Attention Sink Can Cause Unnecessary Interference Between Tasks
Claim.
Transfer vs. Interference
Method: Pre-Scaling For Diverse Attention
Experiments
Experimental Settings
Results
Ablation Study
...and 4 more sections

Figures (9)

Figure 1: Attention maps averaged from all attention heads after pre-training, fine-tuning, and our pre-scaling mechanism. Sink tokens (i.e., with dark blue columns) in pre-trained models obtain similar high attention scores. After fine-tuning, models (especially RoBERTa) have drastic attention changes, which may indicate feature distortion. After pre-scaling, models have diverse attention on sink tokens and preserve the pre-trained attention patterns.
Figure 2: Average outer degrees and attention deviations on MNLI and SST data. (a) The cumulative average outer degrees of tokens with the top-1, top-3 and top-5 largest outer degrees. (b). The attention deviation of sink tokens with the top-1 largest outer degrees.
Figure 3: The left shows models' over-smoothing and attention deviation on sink tokens. The right shows the ratio of sink tokens that are common tokens across tasks. Here we consider special tokens, the punctuation '.' and the second token in the input as the common tokens. 'PT' stands for pre-training, and 'FT' stands for fine-tuning on 3k MNLI data.
Figure 4: (a) After fine-tuning, sink tokens' representations can be close to data ([CLS]) representations, even though the sink tokens are irrelevant to the task. (b) Models trained on 3k MNLI data and evaluated on MNLI and SNLI data, with and without attention on common sink tokens. Sink tokens ensure models' capacity and their transfer to similar tasks.
Figure 5: The scaling and regular model.
...and 4 more figures

Does RoBERTa Perform Better than BERT in Continual Learning: An Attention Sink Perspective

TL;DR

Abstract

Does RoBERTa Perform Better than BERT in Continual Learning: An Attention Sink Perspective

Authors

TL;DR

Abstract

Table of Contents

Figures (9)