Table of Contents
Fetching ...

Advancing Analytic Class-Incremental Learning through Vision-Language Calibration

Binyu Zhao, Wei Zhang, Xingrui Yu, Zhaonian Zou, Ivor Tsang

TL;DR

This work tackles class-incremental learning with pre-trained models by addressing the brittleness of analytic learning due to representation rigidity. It introduces VILA, a dual-branch vision-language calibration framework that couples a task-adapted ViT-Adapter with a frozen CLIP-based universal branch, bridged by Unified Geometric Calibration and reinforced by Candidate Semantic Enhancement to refine decisions with semantic priors. The approach preserves the efficiency of analytic recursive updates while expanding the representational subspace to better accommodate long sequences and fine-grained distinctions, achieving state-of-the-art performance across eight benchmarks. Empirical results demonstrate improved stability, plasticity, and efficiency, highlighting the value of open-world semantic priors in continual learning. The work offers a practical pathway to scalable online CIL with foundation-model priors, while noting limitations such as increased parameter count and memory for the analytic solver, and suggesting future directions in model compression and adaptive domain handling.

Abstract

Class-incremental learning (CIL) with pre-trained models (PTMs) faces a critical trade-off between efficient adaptation and long-term stability. While analytic learning enables rapid, recursive closed-form updates, its efficacy is often compromised by accumulated errors and feature incompatibility. In this paper, we first conduct a systematic study to dissect the failure modes of PTM-based analytic CIL, identifying representation rigidity as the primary bottleneck. Motivated by these insights, we propose \textbf{VILA}, a novel dual-branch framework that advances analytic CIL via a two-level vision-language calibration strategy. Specifically, we coherently fuse plastic, task-adapted features with a frozen, universal semantic anchor at the feature level through geometric calibration, and leverage cross-modal priors at the decision level to rectify prediction bias. This confluence maintains analytic-learning's extreme efficiency while overcoming its inherent brittleness. Extensive experiments across eight benchmarks demonstrate that VILA consistently yields superior performance, particularly in fine-grained and long-sequence scenarios. Our framework harmonizes high-fidelity prediction with the simplicity of analytic learning. Our code is available at https://github.com/byzhaoAI/VILA

Advancing Analytic Class-Incremental Learning through Vision-Language Calibration

TL;DR

This work tackles class-incremental learning with pre-trained models by addressing the brittleness of analytic learning due to representation rigidity. It introduces VILA, a dual-branch vision-language calibration framework that couples a task-adapted ViT-Adapter with a frozen CLIP-based universal branch, bridged by Unified Geometric Calibration and reinforced by Candidate Semantic Enhancement to refine decisions with semantic priors. The approach preserves the efficiency of analytic recursive updates while expanding the representational subspace to better accommodate long sequences and fine-grained distinctions, achieving state-of-the-art performance across eight benchmarks. Empirical results demonstrate improved stability, plasticity, and efficiency, highlighting the value of open-world semantic priors in continual learning. The work offers a practical pathway to scalable online CIL with foundation-model priors, while noting limitations such as increased parameter count and memory for the analytic solver, and suggesting future directions in model compression and adaptive domain handling.

Abstract

Class-incremental learning (CIL) with pre-trained models (PTMs) faces a critical trade-off between efficient adaptation and long-term stability. While analytic learning enables rapid, recursive closed-form updates, its efficacy is often compromised by accumulated errors and feature incompatibility. In this paper, we first conduct a systematic study to dissect the failure modes of PTM-based analytic CIL, identifying representation rigidity as the primary bottleneck. Motivated by these insights, we propose \textbf{VILA}, a novel dual-branch framework that advances analytic CIL via a two-level vision-language calibration strategy. Specifically, we coherently fuse plastic, task-adapted features with a frozen, universal semantic anchor at the feature level through geometric calibration, and leverage cross-modal priors at the decision level to rectify prediction bias. This confluence maintains analytic-learning's extreme efficiency while overcoming its inherent brittleness. Extensive experiments across eight benchmarks demonstrate that VILA consistently yields superior performance, particularly in fine-grained and long-sequence scenarios. Our framework harmonizes high-fidelity prediction with the simplicity of analytic learning. Our code is available at https://github.com/byzhaoAI/VILA
Paper Structure (24 sections, 1 theorem, 11 equations, 11 figures, 4 tables, 2 algorithms)

This paper contains 24 sections, 1 theorem, 11 equations, 11 figures, 4 tables, 2 algorithms.

Key Result

Proposition 3.1

Let $y \in \mathcal{Y}_t$ be the target function for a future task $t$. The expected approximation error of the analytic classifier is lower-bounded by the alignment gap between the Task 1 distribution $\mathcal{P}_1$ and Task $t$ distribution $\mathcal{P}_t$: where $\mathbf{P}_{\mathcal{S}_1}$ is the orthogonal projection operator onto the feature space learned on Task 1. (See Appendix app:proof

Figures (11)

  • Figure 1: Performance and efficiency overview. (Left) Performance comparison across 8 diverse benchmarks. It demonstrates VILA's superior generality in both coarse- and fine-grained scenarios. (Right) Training time vs. Average accuracy. VILA occupies the optimal top-left corner, offering high-fidelity predictions with significantly lower latency.
  • Figure 2: From observation to solution. Left: feature space trilemma. The specialized ViT-Adapter branch collapses into a rigid subspace, which creats a geometric misalignment with the universal CLIP hypersphere and fails to cover future-task features. Right: VILA asymmetric dual-branch architecture. It integrates a frozen universal branch to mitigate stability-plasticity dilemma (Obs1 & Obs2). The UGC module projects heterogeneous features onto a unified manifold to address misalignment (Obs3). The CSE module leverages text priors to rectify decision boundaries. Solely training on Task 1 and updating classifier via analytic learning ensures extreme efficiency.
  • Figure 3: Incremental performance trajectory. Comparison of step-wise accuracy on SUN (Left, $T=30$) and Cars (Right, $T=20$). VILA demonstrates superior stability and resistance to forgetting compared to SOTA methods. See Appendix \ref{['app:runs']} for more figures.
  • Figure 4: Comparison of total training vs. inference time averaged over 8 datasets ($T=20$ for all). VILA occupies the high efficiency zone, offering the best trade-off by being significantly faster than complex baselines while outperforming lighter methods.
  • Figure 5: Comparison to the baseline (left) and with UGC (right) on ImageNet-R. We visualize the density estimation of cosine similarities for intra-class (green/blue) and inter-class (orange/red) pairs. The shaded area (colored in purple) quantifies the confusion region.
  • ...and 6 more figures

Theorems & Definitions (1)

  • Proposition 3.1: Representation Rigidity Bound