Read Between the Layers: Leveraging Multi-Layer Representations for Rehearsal-Free Continual Learning with Pre-Trained Models

Kyra Ahrens; Hans Hergen Lehmann; Jae Hee Lee; Stefan Wermter

Read Between the Layers: Leveraging Multi-Layer Representations for Rehearsal-Free Continual Learning with Pre-Trained Models

Kyra Ahrens, Hans Hergen Lehmann, Jae Hee Lee, Stefan Wermter

TL;DR

This work tackles rehearsal-free continual learning with pre-trained models by leveraging intermediate representations from multiple layers. It introduces LayUP, a simple yet effective class-prototype method that constructs prototypes from concatenated multi-layer features and decorrelates them via second-order Gram statistics, paired with first-session adaptation through PETL. The approach enables robust performance across CIL, DIL, and OCL benchmarks while reducing memory and compute costs compared to existing baselines, and it also serves as a versatile plug-in to enhance other prototype methods. The results demonstrate that fully exploiting pre-trained representations across layers can significantly improve domain transfer and continual learning under limited data scenarios.

Abstract

We address the Continual Learning (CL) problem, wherein a model must learn a sequence of tasks from non-stationary distributions while preserving prior knowledge upon encountering new experiences. With the advancement of foundation models, CL research has pivoted from the initial learning-from-scratch paradigm towards utilizing generic features from large-scale pre-training. However, existing approaches to CL with pre-trained models primarily focus on separating class-specific features from the final representation layer and neglect the potential of intermediate representations to capture low- and mid-level features, which are more invariant to domain shifts. In this work, we propose LayUP, a new prototype-based approach to CL that leverages second-order feature statistics from multiple intermediate layers of a pre-trained network. Our method is conceptually simple, does not require access to prior data, and works out of the box with any foundation model. LayUP surpasses the state of the art in four of the seven class-incremental learning benchmarks, all three domain-incremental learning benchmarks and in six of the seven online continual learning benchmarks, while significantly reducing memory and computational requirements compared to existing baselines. Our results demonstrate that fully exhausting the representational capacities of pre-trained models in CL goes well beyond their final embeddings.

Read Between the Layers: Leveraging Multi-Layer Representations for Rehearsal-Free Continual Learning with Pre-Trained Models

TL;DR

Abstract

Paper Structure (32 sections, 9 equations, 10 figures, 7 tables, 1 algorithm)

This paper contains 32 sections, 9 equations, 10 figures, 7 tables, 1 algorithm.

Introduction
Related Work
Continual learning
Continual learning with pre-trained models
Preliminaries
Continual learning problem formulation
Class-prototype methods for CL with pre-trained models
LayUP
Class prototyping from second-order intermediate features
Combination with parameter-efficient model adaptation
Memory and runtime complexity comparisons
Experiments
Datasets and implementation details
Performance in the CIL and DIL settings
Performance in the OCL setting
...and 17 more sections

Figures (10)

Figure 1: Prior works
Figure 2: LayUP
Figure 4: Classification performance of intermediate layers. Comparison for two different pre-training paradigms of a ViT-B/16 dosovitskiy_vit_2021 backbone and split ImageNet-R (left) and CIFAR-100 (right) datasets. For each intermediate layer $l \in \{1, \dots, L-1\}$ (where $L=12$ denotes the final representation layer), the bars represent the percentage of classes for which a classifier, utilizing \ref{['eq:gram-classifier']} at layer $l$, surpasses the accuracy of the classifier at the $L$th layer.
Figure 5: Comparison of techniques to integrate intermediate representations. LayUP implementations for different values of $k$ using a shared representation and Gram matrix as in \ref{['eq:layup-classifier']} versus averaging over separate ridge (or, Gram) classifiers for each layer using \ref{['eq:gram-classifier']}. Results are reported as average accuracy scores over CL training on split ImageNet-R (left) and CIFAR-100 (right) datasets following phase B of \ref{['alg:method-summary']}.
Figure 6: Choice of $k$ for different dataset characteristics. Multi-layer representation depth $k$ that yields highest accuracy vs. normalized MMD between each dataset and the ImageNet pre-training domain, represented by miniImageNet dataset (left), intra-class similarity (center), and inter-class similarity (right).
...and 5 more figures

Read Between the Layers: Leveraging Multi-Layer Representations for Rehearsal-Free Continual Learning with Pre-Trained Models

TL;DR

Abstract

Read Between the Layers: Leveraging Multi-Layer Representations for Rehearsal-Free Continual Learning with Pre-Trained Models

Authors

TL;DR

Abstract

Table of Contents

Figures (10)