Table of Contents
Fetching ...

Continual Learning with Vision-Language Models via Semantic-Geometry Preservation

Chiyuan He, Zihuan Qiu, Fanman Meng, Runtong Zhang, Linfeng Xu, Qingbo Wu, Hongliang Li

Abstract

Continual learning of pretrained vision-language models (VLMs) is prone to catastrophic forgetting, yet current approaches adapt to new tasks without explicitly preserving the cross-modal semantic geometry inherited from pretraining and previous stages, allowing new-task supervision to induce geometric distortion. We observe that the most pronounced drift tends to concentrate in vulnerable neighborhoods near the old-new semantic interface, where shared visual patterns are easily re-explained by new textual semantics. To address this under an exemplar-free constraint, we propose Semantic Geometry Preservation for Continual Learning (SeGP-CL). SeGP-CL first probes the drift-prone region by constructing a compact set of adversarial anchors with dual-targeted projected gradient descent (DPGD), which drives selected new-task seeds toward old-class semantics while remaining faithful in raw visual space. During training, we preserve cross-modal structure by anchor-guided cross-modal geometry distillation (ACGD), and stabilize the textual reference frame across tasks via a lightweight text semantic-geometry regularization (TSGR). After training, we estimate anchor-induced raw-space drift to transfer old visual prototypes and perform dual-path inference by fusing cross-modal and visual cues. Extensive experiments on five continual learning benchmarks demonstrate that SeGP-CL consistently improves stability and forward transfer, achieving state-of-the-art performance while better preserving semantic geometry of VLMs.

Continual Learning with Vision-Language Models via Semantic-Geometry Preservation

Abstract

Continual learning of pretrained vision-language models (VLMs) is prone to catastrophic forgetting, yet current approaches adapt to new tasks without explicitly preserving the cross-modal semantic geometry inherited from pretraining and previous stages, allowing new-task supervision to induce geometric distortion. We observe that the most pronounced drift tends to concentrate in vulnerable neighborhoods near the old-new semantic interface, where shared visual patterns are easily re-explained by new textual semantics. To address this under an exemplar-free constraint, we propose Semantic Geometry Preservation for Continual Learning (SeGP-CL). SeGP-CL first probes the drift-prone region by constructing a compact set of adversarial anchors with dual-targeted projected gradient descent (DPGD), which drives selected new-task seeds toward old-class semantics while remaining faithful in raw visual space. During training, we preserve cross-modal structure by anchor-guided cross-modal geometry distillation (ACGD), and stabilize the textual reference frame across tasks via a lightweight text semantic-geometry regularization (TSGR). After training, we estimate anchor-induced raw-space drift to transfer old visual prototypes and perform dual-path inference by fusing cross-modal and visual cues. Extensive experiments on five continual learning benchmarks demonstrate that SeGP-CL consistently improves stability and forward transfer, achieving state-of-the-art performance while better preserving semantic geometry of VLMs.
Paper Structure (21 sections, 34 equations, 11 figures, 6 tables)

This paper contains 21 sections, 34 equations, 11 figures, 6 tables.

Figures (11)

  • Figure 1: Boundary vulnerability in VLM-based continual learning and our remedy. (a) Shared visual patterns near the old-new semantic interface are re-explained by new-task texts, leading to cross-modal semantic-geometry distortion and forgetting. (b) We construct adversarial anchors toward old-class semantics and perform anchor-guided cross-modal geometry distillation (ACGD) to constrain the drift in this vulnerable region.
  • Figure 2: Empirical evidence of boundary vulnerability and comparison of distillation schemes. (a) We measure the distributional shift in cross-modal semantics with Jensen-Shannon divergence (JSD) lin1991divergence; markedly larger drift is observed around the semantic interface (visual feature has a low similarity with its corresponding text embedding) after an incremental-task update. (b) We compare cross-modal distillation using different data sources: new-task data, reference data changpinyo2021conceptualzheng2023preventing, and our adversarial anchors.
  • Figure 3: Overview of proposed SeGP-CL. 1) Anchor construction: Dual-targeted projected gradient descent (DPGD) iteratively perturbs seed samples to synthesize adversarial anchors that are simultaneously guided in raw visual space and CLIP feature space. 2) Continual learning: A LoRA-tuned VLM is optimized on task batches with CE loss, while anchor batches and history texts impose semantic-geometry preservation via ACGD and TSGR.
  • Figure 4: After training: We estimate raw-space drift and transfer visual prototypes for old classes, then perform dual-path inference by ensembling logits from visual and CLIP branches.
  • Figure 5: Comparison with state-of-the-art CL methods in terms of per-task accuracy and global transfer. All results are achieved on the same CLIP ViT-B/16 backbone.
  • ...and 6 more figures