Caption, Create, Continue: Continual Learning with Pre-trained Generative Vision-Language Models
Indu Solomon, Aye Phyu Phyu Aung, Uttam Kumar, Senthilnath Jayavelu
TL;DR
CLTS addresses catastrophic forgetting in Class-IL by avoiding raw data replay and privacy concerns, instead storing textual captions and using text-conditioned diffusion to synthesize samples for training a Task Router. The architecture comprises Task Heads that expand with new tasks and a Task Router that selects the appropriate head, guided by BLIP-generated captions and Stable Diffusion-generated images. Key contributions include modular architectural expansion, a lightweight caption-based memory, and caption-conditioned image generation to train the router, achieving state-of-the-art results on CIFAR10/100 and TinyImageNet with dramatically reduced memory overhead. This work paves the way for scalable continual learning in real-world data streams where data storage and labeling are constrained.
Abstract
Continual learning (CL) enables models to adapt to evolving data streams without catastrophic forgetting, a fundamental requirement for real-world AI systems. However, the current methods often depend on large replay buffers or heavily annotated datasets which are impractical due to storage, privacy, and cost constraints. We propose CLTS (Continual Learning via Text-Image Synergy), a novel class-incremental framework that mitigates forgetting without storing real task data. CLTS leverages pre-trained vision-language models, BLIP (Bootstrapping Language-Image Pre-training) for caption generation and stable diffusion for sample generation. Each task is handled by a dedicated Task Head, while a Task Router learns to assign inputs to the correct Task Head using the generated data. On three benchmark datasets, CLTS improves average task accuracy by up to 54% and achieves 63 times better memory efficiency compared to four recent continual learning baselines, demonstrating improved retention and adaptability. CLTS introduces a novel perspective by integrating generative text-image augmentation for scalable continual learning.
