Table of Contents
Fetching ...

CLIP model is an Efficient Online Lifelong Learner

Leyuan Wang, Liuyu Xiang, Yujie Wei, Yunlong Wang, Zhaofeng He

TL;DR

This work investigates online lifelong learning (OLL) with non-stationary data and memory constraints, arguing that CLIP-based vision-language models are well-suited for open-world adaptation. It introduces Symmetric Image-Text (SIT) tuning, a parameter-efficient strategy that restores symmetry between image and text updates during online tuning, supported by gradient analyses showing reduced catastrophic forgetting. Extensive experiments on Si-Blurry benchmarks demonstrate that SIT with LoRA/Adapters outperforms replay-based and other baselines in both OLL and online class-incremental learning, and reveals that image-encoder tuning benefits OLL while text-encoder tuning enhances zero-shot generalization. The findings offer a practical pathway to continuous, ever-evolving learning systems, with future work exploring Mixture of Experts to accumulate knowledge without forgetting.

Abstract

Online Lifelong Learning (OLL) addresses the challenge of learning from continuous and non-stationary data streams. Existing online lifelong learning methods based on image classification models often require preset conditions such as the total number of classes or maximum memory capacity, which hinders the realization of real never-ending learning and renders them impractical for real-world scenarios. In this work, we propose that vision-language models, such as Contrastive Language-Image Pretraining (CLIP), are more suitable candidates for online lifelong learning. We discover that maintaining symmetry between image and text is crucial during Parameter-Efficient Tuning (PET) for CLIP model in online lifelong learning. To this end, we introduce the Symmetric Image-Text (SIT) tuning strategy. We conduct extensive experiments on multiple lifelong learning benchmark datasets and elucidate the effectiveness of SIT through gradient analysis. Additionally, we assess the impact of lifelong learning on generalizability of CLIP and found that tuning the image encoder is beneficial for lifelong learning, while tuning the text encoder aids in zero-shot learning.

CLIP model is an Efficient Online Lifelong Learner

TL;DR

This work investigates online lifelong learning (OLL) with non-stationary data and memory constraints, arguing that CLIP-based vision-language models are well-suited for open-world adaptation. It introduces Symmetric Image-Text (SIT) tuning, a parameter-efficient strategy that restores symmetry between image and text updates during online tuning, supported by gradient analyses showing reduced catastrophic forgetting. Extensive experiments on Si-Blurry benchmarks demonstrate that SIT with LoRA/Adapters outperforms replay-based and other baselines in both OLL and online class-incremental learning, and reveals that image-encoder tuning benefits OLL while text-encoder tuning enhances zero-shot generalization. The findings offer a practical pathway to continuous, ever-evolving learning systems, with future work exploring Mixture of Experts to accumulate knowledge without forgetting.

Abstract

Online Lifelong Learning (OLL) addresses the challenge of learning from continuous and non-stationary data streams. Existing online lifelong learning methods based on image classification models often require preset conditions such as the total number of classes or maximum memory capacity, which hinders the realization of real never-ending learning and renders them impractical for real-world scenarios. In this work, we propose that vision-language models, such as Contrastive Language-Image Pretraining (CLIP), are more suitable candidates for online lifelong learning. We discover that maintaining symmetry between image and text is crucial during Parameter-Efficient Tuning (PET) for CLIP model in online lifelong learning. To this end, we introduce the Symmetric Image-Text (SIT) tuning strategy. We conduct extensive experiments on multiple lifelong learning benchmark datasets and elucidate the effectiveness of SIT through gradient analysis. Additionally, we assess the impact of lifelong learning on generalizability of CLIP and found that tuning the image encoder is beneficial for lifelong learning, while tuning the text encoder aids in zero-shot learning.
Paper Structure (16 sections, 3 equations, 2 figures, 6 tables)

This paper contains 16 sections, 3 equations, 2 figures, 6 tables.

Figures (2)

  • Figure 1: Data in (a) are from GoogleTrends SearchInterest. In (b-d), the horizontal coordinate is the time step/batch and the vertical coordinate is the class index. Those above the red solid line are blurry classes, and those below are disjoint classes, both are sorted according to their occurrence time.
  • Figure 2: Analysis of Symmetric Image-Text tuning strategy. In the coordinate axis, b denotes blurry classes, and d represents disjoint classes. Furthermore, the classes are arranged according to their occurrence time. For the gradient timeline, the subplot at the top represents the gradient of the positive samples, while the subplot at the bottom represents the gradient of the negative samples.