Gradient Projection For Continual Parameter-Efficient Tuning

Jingyang Qiao; Zhizhong Zhang; Xin Tan; Yanyun Qu; Wensheng Zhang; Zhi Han; Yuan Xie

Gradient Projection For Continual Parameter-Efficient Tuning

Jingyang Qiao, Zhizhong Zhang, Xin Tan, Yanyun Qu, Wensheng Zhang, Zhi Han, Yuan Xie

TL;DR

This work tackles catastrophic forgetting in parameter-efficient tuning (PET) for continual learning by introducing Parameter Efficient Gradient Projection (PEGP), a unified framework that enforces gradient updates to lie orthogonal to the subspace spanned by old features. By deriving a gradient projection matrix from sampled feature space via singular value decomposition, PEGP adapts Prompt, Prefix, Adapter, and LoRA paradigms (via self-attention and residual connections) to resist forgetting with minimal memory and computation. The authors provide theoretical justification (Proposition 1) and demonstrate that anti-forgetting updates can be realized across single-model backbones (ViT) and multi-modal backbones (CLIP), achieving state-of-the-art forgetting- and accuracy-related metrics across class, online class, domain, task, and cross-modality continual learning, including the BITM dataset. The approach also shows potential in reducing zero-shot collapse and hallucinations, underscoring the practical impact of gradient projection in large, pre-trained models and multi-modal systems.

Abstract

Parameter-efficient tunings (PETs) have demonstrated impressive performance and promising perspectives in training large models, while they are still confronted with a common problem: the trade-off between learning new content and protecting old knowledge, leading to zero-shot generalization collapse, and cross-modal hallucination. In this paper, we reformulate Adapter, LoRA, Prefix-tuning, and Prompt-tuning from the perspective of gradient projection, and firstly propose a unified framework called Parameter Efficient Gradient Projection (PEGP). We introduce orthogonal gradient projection into different PET paradigms and theoretically demonstrate that the orthogonal condition for the gradient can effectively resist forgetting even for large-scale models. It therefore modifies the gradient towards the direction that has less impact on the old feature space, with less extra memory space and training time. We extensively evaluate our method with different backbones, including ViT and CLIP, on diverse datasets, and experiments comprehensively demonstrate its efficiency in reducing forgetting in class, online class, domain, task, and multi-modality continual settings. The project page is available at https://dmcv-ecnu-pegp.github.io/.

Gradient Projection For Continual Parameter-Efficient Tuning

TL;DR

Abstract

Paper Structure (19 sections, 34 equations, 15 figures, 7 tables, 1 algorithm)

This paper contains 19 sections, 34 equations, 15 figures, 7 tables, 1 algorithm.

Introduction
Related Work
Continual Learning
PET-Based Continual Learning
Background of Gradient Projection Method
Differences From the Preliminary Version
Method
A Unified Parameter-Efficient Continual Method
Self-Attention Based Gradient Projection Method
Res-Connection Based Gradient Projection Method
Experiments
Evaluation Benchmarks and Protocol
Implementation Details
Class/Online Class/Task Incremental Learning
Experiments on ViT Backbone
...and 4 more sections

Figures (15)

Figure 1: Radar chart of continual learning results on multiple datasets based on ViT backbone with (a) Adapter (b) LoRA (c) Prefix-Tuning (d) Prompt-Tuning paradigms. ACC refers to the average accuracy metric (higher is better). FOR refers to the forgetting metric (lower is better). An illustrative example of "ACC-CIFAR-10": Average accuracy metric on the 10-Split-CIFAR100 dataset with corresponding tuning parameters of 10 width.
Figure 2: Illustration of our motivations and methods. (a) Through the investigation of four PETs, we discover a unified anti-forgetting formula from two distinct mechanisms. (b) Implementation of the PEGP process, including feature space sampling, singular value decomposition, gradient projection matrix obtaining, and gradient projection.
Figure 3: Flowchart for Prompt-based gradient projection.
Figure 4: Flowchart for Prefix-based gradient projection.
Figure 5: Comparison of gradient projection matrix obtaining method between PGP and this work.
...and 10 more figures

Gradient Projection For Continual Parameter-Efficient Tuning

TL;DR

Abstract

Gradient Projection For Continual Parameter-Efficient Tuning

Authors

TL;DR

Abstract

Table of Contents

Figures (15)