Table of Contents
Fetching ...

Advancing Cross-domain Discriminability in Continual Learning of Vision-Language Models

Yicheng Xu, Yuxin Chen, Jiahao Nie, Yusong Wang, Huiping Zhuang, Manabu Okumura

TL;DR

The paper tackles cross-domain continual learning for Vision-Language Models by addressing forgetting and the erosion of zero-shot capabilities. It introduces Regression-based Analytic Incremental Learning (RAIL), a ridge-regression adapter with primal and dual update forms, plus a training-free fusion module to preserve zero-shot performance on unseen domains, enabling absolute memorization of learned domains. A new X-TAIL setting is proposed to evaluate cross-domain discriminability without domain hints, alongside MTIL comparisons, with theoretical guarantees and empirical SOTA results across 10 domains and 1,100 classes. The approach demonstrates efficient, data-free incremental adaptation of pre-trained VLMs, improving cross-domain discriminability while maintaining zero-shot transfer, which is highly relevant for deployment in dynamic, multi-domain environments.

Abstract

Continual learning (CL) with Vision-Language Models (VLMs) has overcome the constraints of traditional CL, which only focuses on previously encountered classes. During the CL of VLMs, we need not only to prevent the catastrophic forgetting on incrementally learned knowledge but also to preserve the zero-shot ability of VLMs. However, existing methods require additional reference datasets to maintain such zero-shot ability and rely on domain-identity hints to classify images across different domains. In this study, we propose Regression-based Analytic Incremental Learning (RAIL), which utilizes a recursive ridge regression-based adapter to learn from a sequence of domains in a non-forgetting manner and decouple the cross-domain correlations by projecting features to a higher-dimensional space. Cooperating with a training-free fusion module, RAIL absolutely preserves the VLM's zero-shot ability on unseen domains without any reference data. Additionally, we introduce Cross-domain Task-Agnostic Incremental Learning (X-TAIL) setting. In this setting, a CL learner is required to incrementally learn from multiple domains and classify test images from both seen and unseen domains without any domain-identity hint. We theoretically prove RAIL's absolute memorization on incrementally learned domains. Experiment results affirm RAIL's state-of-the-art performance in both X-TAIL and existing Multi-domain Task-Incremental Learning settings. The code is released at https://github.com/linghan1997/Regression-based-Analytic-Incremental-Learning.

Advancing Cross-domain Discriminability in Continual Learning of Vision-Language Models

TL;DR

The paper tackles cross-domain continual learning for Vision-Language Models by addressing forgetting and the erosion of zero-shot capabilities. It introduces Regression-based Analytic Incremental Learning (RAIL), a ridge-regression adapter with primal and dual update forms, plus a training-free fusion module to preserve zero-shot performance on unseen domains, enabling absolute memorization of learned domains. A new X-TAIL setting is proposed to evaluate cross-domain discriminability without domain hints, alongside MTIL comparisons, with theoretical guarantees and empirical SOTA results across 10 domains and 1,100 classes. The approach demonstrates efficient, data-free incremental adaptation of pre-trained VLMs, improving cross-domain discriminability while maintaining zero-shot transfer, which is highly relevant for deployment in dynamic, multi-domain environments.

Abstract

Continual learning (CL) with Vision-Language Models (VLMs) has overcome the constraints of traditional CL, which only focuses on previously encountered classes. During the CL of VLMs, we need not only to prevent the catastrophic forgetting on incrementally learned knowledge but also to preserve the zero-shot ability of VLMs. However, existing methods require additional reference datasets to maintain such zero-shot ability and rely on domain-identity hints to classify images across different domains. In this study, we propose Regression-based Analytic Incremental Learning (RAIL), which utilizes a recursive ridge regression-based adapter to learn from a sequence of domains in a non-forgetting manner and decouple the cross-domain correlations by projecting features to a higher-dimensional space. Cooperating with a training-free fusion module, RAIL absolutely preserves the VLM's zero-shot ability on unseen domains without any reference data. Additionally, we introduce Cross-domain Task-Agnostic Incremental Learning (X-TAIL) setting. In this setting, a CL learner is required to incrementally learn from multiple domains and classify test images from both seen and unseen domains without any domain-identity hint. We theoretically prove RAIL's absolute memorization on incrementally learned domains. Experiment results affirm RAIL's state-of-the-art performance in both X-TAIL and existing Multi-domain Task-Incremental Learning settings. The code is released at https://github.com/linghan1997/Regression-based-Analytic-Incremental-Learning.

Paper Structure

This paper contains 30 sections, 2 theorems, 24 equations, 10 figures, 3 tables, 2 algorithms.

Key Result

Theorem 1

The parameter calculated by is an optimal solution to the optimization problem of joint training on all $n$ domains in Eqn. eqn:n_th_solution, where $\mathbf{M}_p^{\left(n\right)}$ is obtained by

Figures (10)

  • Figure 1: Comparison of different CL settings. (a) In CIL, models classify images within all previously encountered classes. (b) In MTIL, models classify images from both seen and unseen domains based on the given domain-identities. (c) In X-TAIL, models classify images from both seen and unseen domains without any domain-identity hint.
  • Figure 2: Metrics for X-TAIL setting.
  • Figure 3: Pearson correlation coefficients (CCs) for 10 pairs of domain-prototypes.
  • Figure 4: Comparison of in-domain accuracy (%) on each domain with three classifiers.
  • Figure 5: RAIL Overview. (a) During inference, the fusion module utilizes the Zero-shot logits to identify whether a test image is aligned with seen or unseen classes. If classified as a seen class, the Fusion logits combine the RAIL-Adapter logits and the Zero-shot logits; otherwise solely rely on the Zero-shot logits. (b) Primal: at the $n$-th learning step, features $\mathbf{X}_e$ extracted by CLIP's image encoder are projected to higher dimensional $\mathbf{\Phi}$ via RHL and then update the parameter $\mathbf{W}$ and memory $\mathbf{M}_p$ by Theorem \ref{['primal_theorem']}. (c) Dual: features extracted by CLIP's image encoder update the kernel $\mathbf{K}$, parameter $\boldsymbol{\alpha}$, and memory $\mathbf{M}_d$ by Theorem \ref{['dual_theorem']}.
  • ...and 5 more figures

Theorems & Definitions (2)

  • Theorem 1
  • Theorem 2