BOFA: Bridge-Layer Orthogonal Low-Rank Fusion for CLIP-Based Class-Incremental Learning
Lan Li, Tao Hu, Da-Wei Zhou, Han-Jia Ye, De-Chuan Zhan
TL;DR
BOFA tackles exemplar-free class-incremental learning with CLIP by restricting adaptation to the cross-modal bridge-layer and stabilizing updates with Orthogonal Low-Rank Fusion. The method constrains updates to an Orthogonal Safe Subspace and leverages LoRA to implement the constraint, while enhancing classification with cross-modal hybrid prototypes that combine textual and visual cues. Empirical results across nine diverse benchmarks demonstrate state-of-the-art accuracy and efficiency, achieved without extra parameters or inference cost and without data replay. This approach provides a scalable, practical solution for continual learning in vision-language models.
Abstract
Class-Incremental Learning (CIL) aims to continually learn new categories without forgetting previously acquired knowledge. Vision-language models such as CLIP offer strong transferable representations via multi-modal supervision, making them promising for CIL. However, applying CLIP to CIL poses two major challenges: (1) adapting to downstream tasks often requires additional learnable modules, increasing model complexity and susceptibility to forgetting; and (2) while multi-modal representations offer complementary strengths, existing methods have yet to fully realize their potential in effectively integrating visual and textual modalities. To address these issues, we propose BOFA (Bridge-layer Orthogonal Fusion for Adaptation), a novel framework for CIL. BOFA confines all model adaptation exclusively to CLIP's existing cross-modal bridge-layer, thereby adding no extra parameters or inference cost. To prevent forgetting within this layer, it leverages Orthogonal Low-Rank Fusion, a mechanism that constrains parameter updates to a low-rank ``safe subspace" mathematically constructed to be orthogonal to past task features. This ensures stable knowledge accumulation without data replay. Furthermore, BOFA employs a cross-modal hybrid prototype that synergizes stable textual prototypes with visual counterparts derived from our stably adapted bridge-layer, enhancing classification performance. Extensive experiments on standard benchmarks show that BOFA achieves superior accuracy and efficiency compared to existing methods.
