Table of Contents
Fetching ...

RanPAC: Random Projections and Pre-trained Models for Continual Learning

Mark D. McDonnell, Dong Gong, Amin Parveneh, Ehsan Abbasnejad, Anton van den Hengel

TL;DR

The paper tackles forgetting in continual learning when leveraging powerful pre-trained models by introducing RanPAC, a training-free approach that inserts a frozen Random Projection (RP) layer between the pre-trained feature extractor and a Class-Prototype (CP) head. By expanding feature interactions via nonlinear RP and decorrelating class prototypes through second-order statistics, RanPAC enables a simple, rehearsal-free CP-based classifier to approach joint training performance. Across seven class-incremental benchmarks with ViT-B/16 backbones, RanPAC yields substantial reductions in final error rates (between 20% and 62%) without any rehearsal memory, and demonstrates strong compatibility with PETL methods for first-session adaptation. The work highlights the practical potential of CP strategies when augmented with RP and decorrelation, offering a simple, scalable, and fast alternative to full fine-tuning in the era of foundation models.

Abstract

Continual learning (CL) aims to incrementally learn different tasks (such as classification) in a non-stationary data stream without forgetting old ones. Most CL works focus on tackling catastrophic forgetting under a learning-from-scratch paradigm. However, with the increasing prominence of foundation models, pre-trained models equipped with informative representations have become available for various downstream requirements. Several CL methods based on pre-trained models have been explored, either utilizing pre-extracted features directly (which makes bridging distribution gaps challenging) or incorporating adaptors (which may be subject to forgetting). In this paper, we propose a concise and effective approach for CL with pre-trained models. Given that forgetting occurs during parameter updating, we contemplate an alternative approach that exploits training-free random projectors and class-prototype accumulation, which thus bypasses the issue. Specifically, we inject a frozen Random Projection layer with nonlinear activation between the pre-trained model's feature representations and output head, which captures interactions between features with expanded dimensionality, providing enhanced linear separability for class-prototype-based CL. We also demonstrate the importance of decorrelating the class-prototypes to reduce the distribution disparity when using pre-trained representations. These techniques prove to be effective and circumvent the problem of forgetting for both class- and domain-incremental continual learning. Compared to previous methods applied to pre-trained ViT-B/16 models, we reduce final error rates by between 20% and 62% on seven class-incremental benchmarks, despite not using any rehearsal memory. We conclude that the full potential of pre-trained models for simple, effective, and fast CL has not hitherto been fully tapped. Code is at github.com/RanPAC/RanPAC.

RanPAC: Random Projections and Pre-trained Models for Continual Learning

TL;DR

The paper tackles forgetting in continual learning when leveraging powerful pre-trained models by introducing RanPAC, a training-free approach that inserts a frozen Random Projection (RP) layer between the pre-trained feature extractor and a Class-Prototype (CP) head. By expanding feature interactions via nonlinear RP and decorrelating class prototypes through second-order statistics, RanPAC enables a simple, rehearsal-free CP-based classifier to approach joint training performance. Across seven class-incremental benchmarks with ViT-B/16 backbones, RanPAC yields substantial reductions in final error rates (between 20% and 62%) without any rehearsal memory, and demonstrates strong compatibility with PETL methods for first-session adaptation. The work highlights the practical potential of CP strategies when augmented with RP and decorrelation, offering a simple, scalable, and fast alternative to full fine-tuning in the era of foundation models.

Abstract

Continual learning (CL) aims to incrementally learn different tasks (such as classification) in a non-stationary data stream without forgetting old ones. Most CL works focus on tackling catastrophic forgetting under a learning-from-scratch paradigm. However, with the increasing prominence of foundation models, pre-trained models equipped with informative representations have become available for various downstream requirements. Several CL methods based on pre-trained models have been explored, either utilizing pre-extracted features directly (which makes bridging distribution gaps challenging) or incorporating adaptors (which may be subject to forgetting). In this paper, we propose a concise and effective approach for CL with pre-trained models. Given that forgetting occurs during parameter updating, we contemplate an alternative approach that exploits training-free random projectors and class-prototype accumulation, which thus bypasses the issue. Specifically, we inject a frozen Random Projection layer with nonlinear activation between the pre-trained model's feature representations and output head, which captures interactions between features with expanded dimensionality, providing enhanced linear separability for class-prototype-based CL. We also demonstrate the importance of decorrelating the class-prototypes to reduce the distribution disparity when using pre-trained representations. These techniques prove to be effective and circumvent the problem of forgetting for both class- and domain-incremental continual learning. Compared to previous methods applied to pre-trained ViT-B/16 models, we reduce final error rates by between 20% and 62% on seven class-incremental benchmarks, despite not using any rehearsal memory. We conclude that the full potential of pre-trained models for simple, effective, and fast CL has not hitherto been fully tapped. Code is at github.com/RanPAC/RanPAC.
Paper Structure (46 sections, 21 equations, 12 figures, 13 tables, 1 algorithm)

This paper contains 46 sections, 21 equations, 12 figures, 13 tables, 1 algorithm.

Figures (12)

  • Figure 1: The RP method can lead to a representation space with clear class separation. Colored points are 2D t-SNE visualizations for CIFAR-100 classes with features from a pre-trained ViT-B/16 transformer network.
  • Figure 2: Left: Histograms of similarities between class-prototypes and feature vectors from the ViT-B/16 transformer model pre-trained on ImageNet-1K, for the training samples of the Imagenet-R dataset. Right: Pearson correlation coefficients (CCs) for 10 pairs of class-prototypes. Reduced correlations between CPs of different classes (right), coincides with better class separability (left).
  • Figure 3: Impact of RP compared to alternatives. (a) Using only Phase 2 of Algorithm 1, we show average accuracy (see Results) after each of $T=10$ tasks in the split ImageNet-R dataset. The data illustrates the value of nonlinearity combined with large numbers of RPs, $M$. (b) Illustration that interaction terms created from feature vectors extracted from frozen pre-trained models contain important information that can be mostly recovered when RP and nonlinearity are used.
  • Figure : RanPAC Training
  • Figure A1: Overview of RanPAC for CL classification. In Phase 1 of Algorithm 1 we optionally inject parameters for a Parameter-Efficient Transfer Learning (PETL) method into a frozen pre-trained model (PTM). The PETL parameters are trained only on Task 1 (the 'first-session') of a set of $T$ continual learning tasks, to help bridge the domain gap, as in zhou2023revisitingpanos2023session. Then in Phase 2, first, $L$-dimensional feature vectors, ${\bf f}$, are extracted from the network after completion of learning in Phase 1 (now frozen). Then, the extracted feature vectors are randomly projected to dimension $M$ (typically $M > L$) using frozen weights ${\bf W}$, followed by nonlinear activation, $\phi$, to obtain new feature vectors, ${\bf h}=\phi({\bf f}^\top{\bf W})$. Training in Phase 2 is comprised of continual iterative updating of class prototypes and the Gram matrix, followed by a matrix inversion to compute $M\times N$ matrix ${\bf W}_{\rm o}$, following the end of each Task's training. These weights can be thought of as decorrelated class prototypes that linearly weight the features in ${\bf h}$ to obtain class predictions in ${\bf y}$.
  • ...and 7 more figures