Table of Contents
Fetching ...

A streamlined Approach to Multimodal Few-Shot Class Incremental Learning for Fine-Grained Datasets

Thang Doan, Sima Behpour, Xin Li, Wenbin He, Liang Gou, Liu Ren

TL;DR

This work tackles Few-Shot Class-Incremental Learning (FSCIL) in fine-grained domains by integrating a minimalist, parameter-efficient Vision-Language Model approach. It introduces two modules: Session-Specific Prompts (SSP) to enhance cross-session separability of image-text embeddings, and a Hyperbolic distance framework to tighten intra-class proximity while expanding inter-class separation within a hyperbolic space. The method (CLIP-M$^3$) trains only a small set of prompts while freezing vision prompts during incremental steps, achieving on average an 10-point improvement on fine-grained benchmarks and at least an 8x reduction in trainable parameters, validated on three new fine-grained datasets. The results are backed by extensive ablations showing SSP boosts performance on fine-grained tasks and Hyperbolic distance contributes to better metric learning, highlighting practical gains in real-world, data-scarce scenarios.

Abstract

Few-shot Class-Incremental Learning (FSCIL) poses the challenge of retaining prior knowledge while learning from limited new data streams, all without overfitting. The rise of Vision-Language models (VLMs) has unlocked numerous applications, leveraging their existing knowledge to fine-tune on custom data. However, training the whole model is computationally prohibitive, and VLMs while being versatile in general domains still struggle with fine-grained datasets crucial for many applications. We tackle these challenges with two proposed simple modules. The first, Session-Specific Prompts (SSP), enhances the separability of image-text embeddings across sessions. The second, Hyperbolic distance, compresses representations of image-text pairs within the same class while expanding those from different classes, leading to better representations. Experimental results demonstrate an average 10-point increase compared to baselines while requiring at least 8 times fewer trainable parameters. This improvement is further underscored on our three newly introduced fine-grained datasets.

A streamlined Approach to Multimodal Few-Shot Class Incremental Learning for Fine-Grained Datasets

TL;DR

This work tackles Few-Shot Class-Incremental Learning (FSCIL) in fine-grained domains by integrating a minimalist, parameter-efficient Vision-Language Model approach. It introduces two modules: Session-Specific Prompts (SSP) to enhance cross-session separability of image-text embeddings, and a Hyperbolic distance framework to tighten intra-class proximity while expanding inter-class separation within a hyperbolic space. The method (CLIP-M) trains only a small set of prompts while freezing vision prompts during incremental steps, achieving on average an 10-point improvement on fine-grained benchmarks and at least an 8x reduction in trainable parameters, validated on three new fine-grained datasets. The results are backed by extensive ablations showing SSP boosts performance on fine-grained tasks and Hyperbolic distance contributes to better metric learning, highlighting practical gains in real-world, data-scarce scenarios.

Abstract

Few-shot Class-Incremental Learning (FSCIL) poses the challenge of retaining prior knowledge while learning from limited new data streams, all without overfitting. The rise of Vision-Language models (VLMs) has unlocked numerous applications, leveraging their existing knowledge to fine-tune on custom data. However, training the whole model is computationally prohibitive, and VLMs while being versatile in general domains still struggle with fine-grained datasets crucial for many applications. We tackle these challenges with two proposed simple modules. The first, Session-Specific Prompts (SSP), enhances the separability of image-text embeddings across sessions. The second, Hyperbolic distance, compresses representations of image-text pairs within the same class while expanding those from different classes, leading to better representations. Experimental results demonstrate an average 10-point increase compared to baselines while requiring at least 8 times fewer trainable parameters. This improvement is further underscored on our three newly introduced fine-grained datasets.
Paper Structure (29 sections, 10 equations, 10 figures, 7 tables)

This paper contains 29 sections, 10 equations, 10 figures, 7 tables.

Figures (10)

  • Figure 1: Overview of CLIP-M$^3$. During the training, the text and features prompts are interleaved within the transformer layers. Each output is then projected into Hyperbolic space and paired through a cross-entropy loss function (left). In the incremental session (right), the previously learned Session-Specific Prompts (dark green) and class prototypes (dark blues) are incorporated in the cross-entropy loss function. Note the weights of the vision prompts (yellow) being frozen and only text prompts are trained.
  • Figure 2: Accuracy evolution across the three fine-grained datasets.
  • Figure 3: Distinct Separability of Image-Text Pairs in Coarse-Grained Datasets Across Sessions.
  • Figure 4: Influence of $SSP$ Module on Image-Text Representation Across Sessions. Without this module (second and fourth columns), image-text embeddings across sessions tend to be more intertwined and closely packed. Adding $SSP$ module (first and third columns) promotes a clearer differentiation and separability between sessions.
  • Figure 5: Heatmap Distance between class prototype and text features for CUB200 and StanfordCars Session 9.
  • ...and 5 more figures