Table of Contents
Fetching ...

Annotation-Free Class-Incremental Learning

Hari Chandana Kuchibhotla, K S Ananth, Vineeth N Balasubramanian

TL;DR

The paper tackles annotation-free class-incremental learning by proposing CrossWorld-CL, which grounds learning in external world knowledge from ImageNet to guide unlabeled streams. It combines world-knowledge distillation, semantic expansion via LLMs, dual visual–semantic alignment between DINO and CLIP, and a privacy-preserving replay mechanism using ImageNet proxies. The approach employs a prompt-guided cross-domain framework with stage-wise losses, including $L_{DD}$, $L_{II}$, $L_{ID}$, and $L_{DI}$, plus KL regularization to stabilize prototypes across tasks. Across four diverse datasets, CrossWorld-CL outperforms CLIP-based, unlabeled, and conventional continual-learning baselines, with particularly strong gains on late tasks, highlighting the value of structured external knowledge for robust, annotation-free continual learning.

Abstract

Despite significant progress in continual learning ranging from architectural novelty to clever strategies for mitigating catastrophic forgetting most existing methods rest on a strong but unrealistic assumption the availability of labeled data throughout the learning process. In real-world scenarios, however, data often arrives sequentially and without annotations, rendering conventional approaches impractical. In this work, we revisit the fundamental assumptions of continual learning and ask: Can current systems adapt when labels are absent and tasks emerge incrementally over time? To this end, we introduce Annotation-Free Class-Incremental Learning (AFCIL), a more realistic and challenging paradigm where unlabeled data arrives continuously, and the learner must incrementally acquire new classes without any supervision. To enable effective learning under AFCIL, we propose CrossWorld CL, a Cross Domain World Guided Continual Learning framework that incorporates external world knowledge as a stable auxiliary source. The method retrieves semantically related ImageNet classes for each downstream category, maps downstream and ImageNet features through a cross domain alignment strategy and finally introduce a novel replay strategy. This design lets the model uncover semantic structure without annotations while keeping earlier knowledge intact. Across four datasets, CrossWorld-CL surpasses CLIP baselines and existing continual and unlabeled learning methods, underscoring the benefit of world knowledge for annotation free continual learning.

Annotation-Free Class-Incremental Learning

TL;DR

The paper tackles annotation-free class-incremental learning by proposing CrossWorld-CL, which grounds learning in external world knowledge from ImageNet to guide unlabeled streams. It combines world-knowledge distillation, semantic expansion via LLMs, dual visual–semantic alignment between DINO and CLIP, and a privacy-preserving replay mechanism using ImageNet proxies. The approach employs a prompt-guided cross-domain framework with stage-wise losses, including , , , and , plus KL regularization to stabilize prototypes across tasks. Across four diverse datasets, CrossWorld-CL outperforms CLIP-based, unlabeled, and conventional continual-learning baselines, with particularly strong gains on late tasks, highlighting the value of structured external knowledge for robust, annotation-free continual learning.

Abstract

Despite significant progress in continual learning ranging from architectural novelty to clever strategies for mitigating catastrophic forgetting most existing methods rest on a strong but unrealistic assumption the availability of labeled data throughout the learning process. In real-world scenarios, however, data often arrives sequentially and without annotations, rendering conventional approaches impractical. In this work, we revisit the fundamental assumptions of continual learning and ask: Can current systems adapt when labels are absent and tasks emerge incrementally over time? To this end, we introduce Annotation-Free Class-Incremental Learning (AFCIL), a more realistic and challenging paradigm where unlabeled data arrives continuously, and the learner must incrementally acquire new classes without any supervision. To enable effective learning under AFCIL, we propose CrossWorld CL, a Cross Domain World Guided Continual Learning framework that incorporates external world knowledge as a stable auxiliary source. The method retrieves semantically related ImageNet classes for each downstream category, maps downstream and ImageNet features through a cross domain alignment strategy and finally introduce a novel replay strategy. This design lets the model uncover semantic structure without annotations while keeping earlier knowledge intact. Across four datasets, CrossWorld-CL surpasses CLIP baselines and existing continual and unlabeled learning methods, underscoring the benefit of world knowledge for annotation free continual learning.

Paper Structure

This paper contains 15 sections, 12 equations, 6 figures, 4 tables, 6 algorithms.

Figures (6)

  • Figure 1: Examples of retrieved ImageNet classes for four downstream dataset categories; Bottle, Globe Thistle, Braided, and Maine Coon. The retrieved neighbors show clear semantic similarity to each target class, illustrating how our method gathers meaningful auxiliary supervision from ImageNet.
  • Figure 2: An overview of our annotation-free continual learning framework, which integrates world knowledge, multimodal alignment for better label prediction, and replay from public ImageNet data to learn new classes over time without the need to store downstream samples. Purple indicates downstream data flow and Green indicates ImageNet data flow. Stage 1: Task-aware world knowledge distillation retrieves semantically related ImageNet categories for each downstream class, forming an auxiliary supervision pool $\mathcal{A}^t$. Stage 2: An LLM expands downstream labels into diverse descriptions, which are encoded with CLIP to obtain robust text prototypes for both downstream ($t_D$) and ImageNet ($t_I$) sets. Stage-3: Dual-supervised visual–semantic alignment maps DINO features into CLIP space using pseudo-labeled downstream samples and supervised ImageNet pairs. Stage-4: Prompt-guided cross-domain alignment tunes the CLIP image encoder and text prototypes using multiple alignment losses ($L_{DD}, L_{DI}, L_{ID}, L_{II}$) across downstream and ImageNet branches. Stage-5: A replay strategy constructs a lightweight exemplar set from aligned ImageNet images, enabling rehearsal without storing private downstream data $\mathcal{A}_R^t$.
  • Figure 3: Effect of varying the number of retrieved ImageNet classes ($k$) and images per class (img_num = 5, 10) on task-wise accuracy across all five tasks for Oxford Pets dataset. Increasing img_num generally stabilizes performance, especially for later tasks Task-4 and Task-5.
  • Figure 4: t-SNE visualizations comparing prediction quality of Continual CLIP (top row) and our method (bottom row) across all five tasks for CIFAR100 dataset. Continual CLIP produces highly entangled clusters with significant overlap between classes, indicating poor separation and increasing confusion as tasks progress. In contrast, our model yields well formed, compact, and clearly separated clusters for every task, showing that cross domain alignment and world knowledge driven supervision lead helps.
  • Figure 5: We report forgetting values for each task on DTD, CIFAR-100, Oxford Pets, and Oxford Flowers. The trends highlight dataset specific forgetting dynamics, with CIFAR-100 showing higher forgetting, Oxford Pets exhibiting negative forgetting in early tasks, and Oxford Flowers and DTD displaying moderate fluctuations.
  • ...and 1 more figures