Advancing Analytic Class-Incremental Learning through Vision-Language Calibration
Binyu Zhao, Wei Zhang, Xingrui Yu, Zhaonian Zou, Ivor Tsang
TL;DR
This work tackles class-incremental learning with pre-trained models by addressing the brittleness of analytic learning due to representation rigidity. It introduces VILA, a dual-branch vision-language calibration framework that couples a task-adapted ViT-Adapter with a frozen CLIP-based universal branch, bridged by Unified Geometric Calibration and reinforced by Candidate Semantic Enhancement to refine decisions with semantic priors. The approach preserves the efficiency of analytic recursive updates while expanding the representational subspace to better accommodate long sequences and fine-grained distinctions, achieving state-of-the-art performance across eight benchmarks. Empirical results demonstrate improved stability, plasticity, and efficiency, highlighting the value of open-world semantic priors in continual learning. The work offers a practical pathway to scalable online CIL with foundation-model priors, while noting limitations such as increased parameter count and memory for the analytic solver, and suggesting future directions in model compression and adaptive domain handling.
Abstract
Class-incremental learning (CIL) with pre-trained models (PTMs) faces a critical trade-off between efficient adaptation and long-term stability. While analytic learning enables rapid, recursive closed-form updates, its efficacy is often compromised by accumulated errors and feature incompatibility. In this paper, we first conduct a systematic study to dissect the failure modes of PTM-based analytic CIL, identifying representation rigidity as the primary bottleneck. Motivated by these insights, we propose \textbf{VILA}, a novel dual-branch framework that advances analytic CIL via a two-level vision-language calibration strategy. Specifically, we coherently fuse plastic, task-adapted features with a frozen, universal semantic anchor at the feature level through geometric calibration, and leverage cross-modal priors at the decision level to rectify prediction bias. This confluence maintains analytic-learning's extreme efficiency while overcoming its inherent brittleness. Extensive experiments across eight benchmarks demonstrate that VILA consistently yields superior performance, particularly in fine-grained and long-sequence scenarios. Our framework harmonizes high-fidelity prediction with the simplicity of analytic learning. Our code is available at https://github.com/byzhaoAI/VILA
