Table of Contents
Fetching ...

LADA: Scalable Label-Specific CLIP Adapter for Continual Learning

Mao-Lin Luo, Zi-Hao Zhou, Tong Wei, Min-Ling Zhang

TL;DR

This work tackles continual learning with CLIP by addressing forward forgetting and the need to avoid per-task parameter selection. It introduces LADA, which appends label-specific memory units after the frozen image encoder $f_I$ to generate discriminative label-specific features via cluster-derived centers $W_j^k$, forming $\varphi^k(i)$ and aggregating across tasks. To preserve past knowledge, LADA employs distillation via prototypes $p_j^i(l)$ and distribution-preserved training with Gaussian components $\{\pi_j^i(l), p_j^i(l), \Sigma_i^j(l)\}$, while training updates are confined to new task parameters. On the X-TAIL benchmark, LADA achieves state-of-the-art results in both 16-shot and full-shot settings, improving Transfer, Average, and Last metrics without backpropagating into the CLIP image encoder $f_I$, thereby delivering scalable and efficient continual learning for vision-language models.

Abstract

Continual learning with vision-language models like CLIP offers a pathway toward scalable machine learning systems by leveraging its transferable representations. Existing CLIP-based methods adapt the pre-trained image encoder by adding multiple sets of learnable parameters, with each task using a partial set of parameters. This requires selecting the expected parameters for input images during inference, which is prone to error that degrades performance. To address this problem, we introduce LADA (Label-specific ADApter). Instead of partitioning parameters across tasks, LADA appends lightweight, label-specific memory units to the frozen CLIP image encoder, enabling discriminative feature generation by aggregating task-agnostic knowledge. To prevent catastrophic forgetting, LADA employs feature distillation for seen classes, preventing their features from being interfered with by new classes. Positioned after the image encoder, LADA prevents gradient flow to the frozen CLIP parameters, ensuring efficient training. Extensive results show that LADA achieves state-of-the-art performance in continual learning settings. The implementation code is available at https://github.com/MaolinLuo/LADA.

LADA: Scalable Label-Specific CLIP Adapter for Continual Learning

TL;DR

This work tackles continual learning with CLIP by addressing forward forgetting and the need to avoid per-task parameter selection. It introduces LADA, which appends label-specific memory units after the frozen image encoder to generate discriminative label-specific features via cluster-derived centers , forming and aggregating across tasks. To preserve past knowledge, LADA employs distillation via prototypes and distribution-preserved training with Gaussian components , while training updates are confined to new task parameters. On the X-TAIL benchmark, LADA achieves state-of-the-art results in both 16-shot and full-shot settings, improving Transfer, Average, and Last metrics without backpropagating into the CLIP image encoder , thereby delivering scalable and efficient continual learning for vision-language models.

Abstract

Continual learning with vision-language models like CLIP offers a pathway toward scalable machine learning systems by leveraging its transferable representations. Existing CLIP-based methods adapt the pre-trained image encoder by adding multiple sets of learnable parameters, with each task using a partial set of parameters. This requires selecting the expected parameters for input images during inference, which is prone to error that degrades performance. To address this problem, we introduce LADA (Label-specific ADApter). Instead of partitioning parameters across tasks, LADA appends lightweight, label-specific memory units to the frozen CLIP image encoder, enabling discriminative feature generation by aggregating task-agnostic knowledge. To prevent catastrophic forgetting, LADA employs feature distillation for seen classes, preventing their features from being interfered with by new classes. Positioned after the image encoder, LADA prevents gradient flow to the frozen CLIP parameters, ensuring efficient training. Extensive results show that LADA achieves state-of-the-art performance in continual learning settings. The implementation code is available at https://github.com/MaolinLuo/LADA.

Paper Structure

This paper contains 16 sections, 15 equations, 3 figures, 10 tables.

Figures (3)

  • Figure 1: Comparison of CLIP tuning paradigms in continual learning. Our label-specific adapter leverages learned memory of all seen tasks and CLIP representations to generate label-specific features, eliminating the need for parameter selection.
  • Figure 2: Accuracy (%) changes across all tasks over all learning steps in the full-shot setting.
  • Figure 3: Comparison of whether to use zero-shot CLIP as a selector to distinguish between seen and unseen classes. Without the selector, LADA performs better in the continual learning process as it utilizes the learned knowledge to improve average task recall.