Table of Contents
Fetching ...

Hierarchical Semantic Tree Anchoring for CLIP-Based Class-Incremental Learning

Tao Hu, Lan Li, Zhen-Hao Xie, Da-Wei Zhou

TL;DR

HASTEN tackles catastrophic forgetting in CLIP-based class-incremental learning by injecting explicit hierarchical structure into a hyperbolic feature space. It constructs a GPT-5–driven semantic tree, learns per-task hierarchy-aware projections, and maps features into a shared hyperbolic space with a global mapper, while protecting past mappings via null-space gradient projection. Hierarchy-aware entailment constraints and a hyperbolic contrastive objective stabilize cross-modal alignment, and virtual-class anchoring preserves past structure without exemplars. Empirical results across nine benchmarks show strong, consistent improvements over prior methods, with robustness to seeds, backbones, and different LLMs for tree generation. The approach offers a principled way to fuse hierarchical semantics with continual learning in vision-language models, enabling scalable, structure-preserving incremental updates.

Abstract

Class-Incremental Learning (CIL) enables models to learn new classes continually while preserving past knowledge. Recently, vision-language models like CLIP offer transferable features via multi-modal pre-training, making them well-suited for CIL. However, real-world visual and linguistic concepts are inherently hierarchical: a textual concept like "dog" subsumes fine-grained categories such as "Labrador" and "Golden Retriever," and each category entails its images. But existing CLIP-based CIL methods fail to explicitly capture this inherent hierarchy, leading to fine-grained class features drift during incremental updates and ultimately to catastrophic forgetting. To address this challenge, we propose HASTEN (Hierarchical Semantic Tree Anchoring) that anchors hierarchical information into CIL to reduce catastrophic forgetting. First, we employ an external knowledge graph as supervision to embed visual and textual features in hyperbolic space, effectively preserving hierarchical structure as data evolves. Second, to mitigate catastrophic forgetting, we project gradients onto the null space of the shared hyperbolic mapper, preventing interference with prior tasks. These two steps work synergistically to enable the model to resist forgetting by maintaining hierarchical relationships. Extensive experiments show that HASTEN consistently outperforms existing methods while providing a unified structured representation.

Hierarchical Semantic Tree Anchoring for CLIP-Based Class-Incremental Learning

TL;DR

HASTEN tackles catastrophic forgetting in CLIP-based class-incremental learning by injecting explicit hierarchical structure into a hyperbolic feature space. It constructs a GPT-5–driven semantic tree, learns per-task hierarchy-aware projections, and maps features into a shared hyperbolic space with a global mapper, while protecting past mappings via null-space gradient projection. Hierarchy-aware entailment constraints and a hyperbolic contrastive objective stabilize cross-modal alignment, and virtual-class anchoring preserves past structure without exemplars. Empirical results across nine benchmarks show strong, consistent improvements over prior methods, with robustness to seeds, backbones, and different LLMs for tree generation. The approach offers a principled way to fuse hierarchical semantics with continual learning in vision-language models, enabling scalable, structure-preserving incremental updates.

Abstract

Class-Incremental Learning (CIL) enables models to learn new classes continually while preserving past knowledge. Recently, vision-language models like CLIP offer transferable features via multi-modal pre-training, making them well-suited for CIL. However, real-world visual and linguistic concepts are inherently hierarchical: a textual concept like "dog" subsumes fine-grained categories such as "Labrador" and "Golden Retriever," and each category entails its images. But existing CLIP-based CIL methods fail to explicitly capture this inherent hierarchy, leading to fine-grained class features drift during incremental updates and ultimately to catastrophic forgetting. To address this challenge, we propose HASTEN (Hierarchical Semantic Tree Anchoring) that anchors hierarchical information into CIL to reduce catastrophic forgetting. First, we employ an external knowledge graph as supervision to embed visual and textual features in hyperbolic space, effectively preserving hierarchical structure as data evolves. Second, to mitigate catastrophic forgetting, we project gradients onto the null space of the shared hyperbolic mapper, preventing interference with prior tasks. These two steps work synergistically to enable the model to resist forgetting by maintaining hierarchical relationships. Extensive experiments show that HASTEN consistently outperforms existing methods while providing a unified structured representation.

Paper Structure

This paper contains 27 sections, 21 equations, 15 figures, 2 tables.

Figures (15)

  • Figure 1: Effect of feature hierarchy. Top: Without hierarchy, fine-grained features will drift as new classes emerge. Bottom: With a hierarchical constraint, features stay anchored and relations can be preserved in the incremental learning process.
  • Figure 2: Illustration of Hasten. Left: Hierarchical semantic tree building and hyperbolic projection. We use GPT-5 to generate a tree-structured semantic hierarchy and design task-specific hierarchical perception modules to meet downstream knowledge requirements. Euclidean features are projected into hyperbolic space via a global hyperbolic mapping layer. Top-Right: Null space projection of the TP layer, ensuring TP does not interfere with the outputs of old tasks during updates. Bottom-Right: Illustration of the entailment loss in $\mathcal{B}^2$. This loss pushes the embedding $\mathbf{z}_c$ within an imaginary cone projected by its paired parent embedding $\mathbf{z}_p$.
  • Figure 3: Incremental performance of different methods. We report the performance gap after the last incremental stage of Hasten and the runner-up method at the end of the line. All methods utilize the same CLIP pre-trained weight. More results are in the supplementary.
  • Figure 4: Ablation study and parameter sensitivity analysis.
  • Figure 5: t-SNE van2008visualizing visualizations of visual and textual features on CIFAR100 B0 Inc5. We show the feature distributions of old classes and new classes without (left) and with (right) hierarchy.
  • ...and 10 more figures