Table of Contents
Fetching ...

BOFA: Bridge-Layer Orthogonal Low-Rank Fusion for CLIP-Based Class-Incremental Learning

Lan Li, Tao Hu, Da-Wei Zhou, Han-Jia Ye, De-Chuan Zhan

TL;DR

BOFA tackles exemplar-free class-incremental learning with CLIP by restricting adaptation to the cross-modal bridge-layer and stabilizing updates with Orthogonal Low-Rank Fusion. The method constrains updates to an Orthogonal Safe Subspace and leverages LoRA to implement the constraint, while enhancing classification with cross-modal hybrid prototypes that combine textual and visual cues. Empirical results across nine diverse benchmarks demonstrate state-of-the-art accuracy and efficiency, achieved without extra parameters or inference cost and without data replay. This approach provides a scalable, practical solution for continual learning in vision-language models.

Abstract

Class-Incremental Learning (CIL) aims to continually learn new categories without forgetting previously acquired knowledge. Vision-language models such as CLIP offer strong transferable representations via multi-modal supervision, making them promising for CIL. However, applying CLIP to CIL poses two major challenges: (1) adapting to downstream tasks often requires additional learnable modules, increasing model complexity and susceptibility to forgetting; and (2) while multi-modal representations offer complementary strengths, existing methods have yet to fully realize their potential in effectively integrating visual and textual modalities. To address these issues, we propose BOFA (Bridge-layer Orthogonal Fusion for Adaptation), a novel framework for CIL. BOFA confines all model adaptation exclusively to CLIP's existing cross-modal bridge-layer, thereby adding no extra parameters or inference cost. To prevent forgetting within this layer, it leverages Orthogonal Low-Rank Fusion, a mechanism that constrains parameter updates to a low-rank ``safe subspace" mathematically constructed to be orthogonal to past task features. This ensures stable knowledge accumulation without data replay. Furthermore, BOFA employs a cross-modal hybrid prototype that synergizes stable textual prototypes with visual counterparts derived from our stably adapted bridge-layer, enhancing classification performance. Extensive experiments on standard benchmarks show that BOFA achieves superior accuracy and efficiency compared to existing methods.

BOFA: Bridge-Layer Orthogonal Low-Rank Fusion for CLIP-Based Class-Incremental Learning

TL;DR

BOFA tackles exemplar-free class-incremental learning with CLIP by restricting adaptation to the cross-modal bridge-layer and stabilizing updates with Orthogonal Low-Rank Fusion. The method constrains updates to an Orthogonal Safe Subspace and leverages LoRA to implement the constraint, while enhancing classification with cross-modal hybrid prototypes that combine textual and visual cues. Empirical results across nine diverse benchmarks demonstrate state-of-the-art accuracy and efficiency, achieved without extra parameters or inference cost and without data replay. This approach provides a scalable, practical solution for continual learning in vision-language models.

Abstract

Class-Incremental Learning (CIL) aims to continually learn new categories without forgetting previously acquired knowledge. Vision-language models such as CLIP offer strong transferable representations via multi-modal supervision, making them promising for CIL. However, applying CLIP to CIL poses two major challenges: (1) adapting to downstream tasks often requires additional learnable modules, increasing model complexity and susceptibility to forgetting; and (2) while multi-modal representations offer complementary strengths, existing methods have yet to fully realize their potential in effectively integrating visual and textual modalities. To address these issues, we propose BOFA (Bridge-layer Orthogonal Fusion for Adaptation), a novel framework for CIL. BOFA confines all model adaptation exclusively to CLIP's existing cross-modal bridge-layer, thereby adding no extra parameters or inference cost. To prevent forgetting within this layer, it leverages Orthogonal Low-Rank Fusion, a mechanism that constrains parameter updates to a low-rank ``safe subspace" mathematically constructed to be orthogonal to past task features. This ensures stable knowledge accumulation without data replay. Furthermore, BOFA employs a cross-modal hybrid prototype that synergizes stable textual prototypes with visual counterparts derived from our stably adapted bridge-layer, enhancing classification performance. Extensive experiments on standard benchmarks show that BOFA achieves superior accuracy and efficiency compared to existing methods.

Paper Structure

This paper contains 18 sections, 1 theorem, 13 equations, 4 figures, 2 tables.

Key Result

Proposition 1

The cumulative scatter matrix of past features is defined as $\mathbf{S}_{\text{old}} = \mathbf{X}_{\text{old}}^\top \mathbf{X}_{\text{old}}$, which can be decomposed as: The optimal solution to Eq. eq:approx_null_space is given by: which is the subspace spanned by the eigenvectors of $\mathbf{S}_{\text{old}}$ associated with its $k$ smallest eigenvalues.

Figures (4)

  • Figure 1: Overview of Orthogonal Low-Rank Fusion, where an OSS $\mathbf{P}^*$ is constructed from past task features to constrain the low-rank update for a new task, thereby minimizing interference with prior knowledge.
  • Figure 2: Incremental performance of different methods. Accuracy is reported at each incremental stage. BOFA consistently outperforms all baselines, with the final gap to the strongest competitor noted at the end of each curve. Additional results are available in the supplementary material.
  • Figure 3: $\bar{\mathcal{A}}$ comparison on four datasets for various ablation variants of BOFA .
  • Figure 4: T-SNE visualization of features (circles) and class prototypes (stars) on CIFAR100 B0 Inc5. We show the feature distributions of old classes (0–4) and new classes (5–9) with (left) and with out (right) applying Orthogonal Low-Rank Fusion.

Theorems & Definitions (1)

  • Proposition 1