Mitigating Interference in the Knowledge Continuum through Attention-Guided Incremental Learning

Prashant Bhat; Bharath Renjith; Elahe Arani; Bahram Zonooz

Mitigating Interference in the Knowledge Continuum through Attention-Guided Incremental Learning

Prashant Bhat, Bharath Renjith, Elahe Arani, Bahram Zonooz

TL;DR

This work tackles catastrophic forgetting in continual learning by introducing AGILE, a rehearsal-based approach that uses a shared task-attention module augmented with lightweight per-task projection vectors to reduce inter-task interference. By expanding task projections as new tasks arrive and employing a pairwise discrepancy loss alongside EMA-based consistency, AGILE achieves strong WP and TP, scales to many tasks with minimal overhead, and exhibits improved calibration and reduced recency bias. Extensive experiments on Seq-CIFAR10/100 and Seq-TinyImageNet show AGILE outperforming rehearsal-based baselines across Class-IL and Task-IL settings, with robust performance in low-buffer regimes. The findings demonstrate the effectiveness of task-attention mechanisms in continual learning and point to future work in extending AGILE to transformer architectures and further reducing forgetting in shared components.

Abstract

Continual learning (CL) remains a significant challenge for deep neural networks, as it is prone to forgetting previously acquired knowledge. Several approaches have been proposed in the literature, such as experience rehearsal, regularization, and parameter isolation, to address this problem. Although almost zero forgetting can be achieved in task-incremental learning, class-incremental learning remains highly challenging due to the problem of inter-task class separation. Limited access to previous task data makes it difficult to discriminate between classes of current and previous tasks. To address this issue, we propose `Attention-Guided Incremental Learning' (AGILE), a novel rehearsal-based CL approach that incorporates compact task attention to effectively reduce interference between tasks. AGILE utilizes lightweight, learnable task projection vectors to transform the latent representations of a shared task attention module toward task distribution. Through extensive empirical evaluation, we show that AGILE significantly improves generalization performance by mitigating task interference and outperforming rehearsal-based approaches in several CL scenarios. Furthermore, AGILE can scale well to a large number of tasks with minimal overhead while remaining well-calibrated with reduced task-recency bias.

Mitigating Interference in the Knowledge Continuum through Attention-Guided Incremental Learning

TL;DR

Abstract

Paper Structure (31 sections, 1 theorem, 13 equations, 7 figures, 7 tables, 3 algorithms)

This paper contains 31 sections, 1 theorem, 13 equations, 7 figures, 7 tables, 3 algorithms.

Introduction
Related Works
Rehearsal-based Approaches:
Task Attention:
Proposed Method
Motivation
Preliminary
Shared task-attention module
Network expansion
Implementation Details
Results
Experimental results
How AGILE facilitates a good WP and TP?
Ablation study
Parameter growth
...and 16 more sections

Key Result

Theorem 3

If $\mathcal{H}_{W P}(x) \leq \epsilon$ and $\mathcal{H}_{T P}(x) \leq \xi$, then $\mathcal{H}_{Class-IL}(x) \leq \epsilon+\xi$kim2022theoretical.

Figures (7)

Figure 1: Attention-Guided Incremental Learning (AGILE) consists of a shared task-attention module and a set of task-specific projection vectors, one for each task. Each sample is passed through the task-attention module once for each projection vector, and the outputs are fed into task-specific classifiers. AGILE effectively reduces task interference and facilitates accurate task-id prediction (TP) and within-task prediction (WP).
Figure 2: Comparison of AGILE with task-specific learning approaches in Task-IL setting. We report the accuracy on all tasks at the end of CL training with an average across all tasks in the legend. AGILE outperforms other baselines with little memory overhead.
Figure 3: Latent features and task projection vectors after training on Seq-CIFAR100 with 5 tasks. (Left) t-SNE visualization of the latent features of the shared task attention module in the absence of task projection vectors; (Middle) Task projection vectors along leading principle components. (Right) t-SNE visualization of latent features of the shared task attention module in the presence of task projection vectors. Task projection vectors specialize in transforming the latent representations of shared task-attention module towards the task distribution, thereby reducing interference.
Figure 4: (Left) Confusion matrix for various CL models. ER and DER++ show high recency biases, while AGILE makes evenly distributed predictions. (Right) Reliability diagram with ECE indicating AGILE's well-calibrated performance and lowest ECE value. -- denotes the perfect calibration. All models are trained on Seq-CIFAR100, 5 tasks.
Figure 5: Task-wise performance of CL models trained on Seq-CIFAR100 with buffer size 500. The performances of ER and DER++ mainly emanate from the most recent task, while that of AGILE comes more evenly from all the tasks.
...and 2 more figures

Theorems & Definitions (3)

Definition 1
Definition 2
Theorem 3

Mitigating Interference in the Knowledge Continuum through Attention-Guided Incremental Learning

TL;DR

Abstract

Mitigating Interference in the Knowledge Continuum through Attention-Guided Incremental Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (7)

Theorems & Definitions (3)