An Attention-based Representation Distillation Baseline for Multi-Label Continual Learning

Martin Menabue; Emanuele Frascaroli; Matteo Boschini; Lorenzo Bonicelli; Angelo Porrello; Simone Calderara

An Attention-based Representation Distillation Baseline for Multi-Label Continual Learning

Martin Menabue, Emanuele Frascaroli, Matteo Boschini, Lorenzo Bonicelli, Angelo Porrello, Simone Calderara

TL;DR

This paper tackles multi-label continual learning (MLCL) and its forgetting challenges, showing that state-of-the-art prompting methods struggle in MLCL. It introduces Selective Class Attention Distillation (SCAD), which uses a frozen pretrained teacher and adapters to selectively transfer class-token attention from teacher to student, complemented by experience replay and adapter-mask replay. Empirical results on IIRC CIFAR-100 and Incremental WebVision demonstrate that SCAD achieves superior final average PWJS and lower forgetting compared to strong baselines, highlighting the value of pretraining-aligned, selective knowledge transfer in MLCL. The work provides a practical baseline that integrates rehearsal, distillation, and attention-based filtering, offering a robust approach for real-world MLCL applications. The code is available at the provided GitHub link.

Abstract

The field of Continual Learning (CL) has inspired numerous researchers over the years, leading to increasingly advanced countermeasures to the issue of catastrophic forgetting. Most studies have focused on the single-class scenario, where each example comes with a single label. The recent literature has successfully tackled such a setting, with impressive results. Differently, we shift our attention to the multi-label scenario, as we feel it to be more representative of real-world open problems. In our work, we show that existing state-of-the-art CL methods fail to achieve satisfactory performance, thus questioning the real advance claimed in recent years. Therefore, we assess both old-style and novel strategies and propose, on top of them, an approach called Selective Class Attention Distillation (SCAD). It relies on a knowledge transfer technique that seeks to align the representations of the student network -- which trains continuously and is subject to forgetting -- with the teacher ones, which is pretrained and kept frozen. Importantly, our method is able to selectively transfer the relevant information from the teacher to the student, thereby preventing irrelevant information from harming the student's performance during online training. To demonstrate the merits of our approach, we conduct experiments on two different multi-label datasets, showing that our method outperforms the current state-of-the-art Continual Learning methods. Our findings highlight the importance of addressing the unique challenges posed by multi-label environments in the field of Continual Learning. The code of SCAD is available at https://github.com/aimagelab/SCAD-LOD-2024.

An Attention-based Representation Distillation Baseline for Multi-Label Continual Learning

TL;DR

Abstract

Paper Structure (23 sections, 11 equations, 1 figure, 3 tables)

This paper contains 23 sections, 11 equations, 1 figure, 3 tables.

Introduction
Related works
Method
Multi-Label Continual Learning
Pretraining and forgetting
Selective Class Attention Distillation
Backbone
Knowledge transfer technique
Adapter networks
Experience Replay
Adapter mask replay
Experiments
Experimental setup
Datasets
Metrics
...and 8 more sections

Figures (1)

Figure 1: Our proposal involves two pretrained Vision Transformers: one is frozen, referred to as the teacher, while the other is not, referred to as the student. An image is provided as input to both the teacher and the student, and intermediate representations are extracted. From these representations, using equation (\ref{['eq:attn']}) and the indexing indicated in equation (\ref{['eq:dist']}), we can derive the attention vector that contains the relationships between the class token and the other tokens. These operations are summarized in the image by the operator $\mathcal{R}$. The attention vector is used by the attention modules (adapters) to derive binary vectors useful for filtering knowledge transfer. The feature propagation loss $\mathcal{L}_{FP}$ enables maintaining alignment between the attention vectors of the teacher and the student.

An Attention-based Representation Distillation Baseline for Multi-Label Continual Learning

TL;DR

Abstract

An Attention-based Representation Distillation Baseline for Multi-Label Continual Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (1)