Table of Contents
Fetching ...

Towards Generative Class Prompt Learning for Fine-grained Visual Recognition

Soumitri Chattopadhyay, Sanket Biswas, Emanuele Vivoli, Josep Lladós

TL;DR

This work tackles the challenge of fine-grained visual recognition with domain shifts by moving beyond fixed CLIP prompts to generative class prompts learned via pre-trained text-to-image diffusion models. GCPL learns a learnable class token within a fixed prompt and uses a frozen diffusion model to produce visually enriched class representations from few-shot exemplars, while CoMPLe adds a contrastive loss to encourage inter-class separation. The approach leverages latent diffusion models and diffusion classifiers to perform few-shot inference, obtaining improved results over zero-shot CLIP and existing few-shot prompting baselines across six diverse datasets, including medical and abstract domains. While offering strong performance, the authors acknowledge high computational costs and memory requirements, suggesting future work to improve efficiency and scalability of generation-aided discriminative learning in V/L systems.

Abstract

Although foundational vision-language models (VLMs) have proven to be very successful for various semantic discrimination tasks, they still struggle to perform faithfully for fine-grained categorization. Moreover, foundational models trained on one domain do not generalize well on a different domain without fine-tuning. We attribute these to the limitations of the VLM's semantic representations and attempt to improve their fine-grained visual awareness using generative modeling. Specifically, we propose two novel methods: Generative Class Prompt Learning (GCPL) and Contrastive Multi-class Prompt Learning (CoMPLe). Utilizing text-to-image diffusion models, GCPL significantly improves the visio-linguistic synergy in class embeddings by conditioning on few-shot exemplars with learnable class prompts. CoMPLe builds on this foundation by introducing a contrastive learning component that encourages inter-class separation during the generative optimization process. Our empirical results demonstrate that such a generative class prompt learning approach substantially outperform existing methods, offering a better alternative to few shot image recognition challenges. The source code will be made available at: https://github.com/soumitri2001/GCPL.

Towards Generative Class Prompt Learning for Fine-grained Visual Recognition

TL;DR

This work tackles the challenge of fine-grained visual recognition with domain shifts by moving beyond fixed CLIP prompts to generative class prompts learned via pre-trained text-to-image diffusion models. GCPL learns a learnable class token within a fixed prompt and uses a frozen diffusion model to produce visually enriched class representations from few-shot exemplars, while CoMPLe adds a contrastive loss to encourage inter-class separation. The approach leverages latent diffusion models and diffusion classifiers to perform few-shot inference, obtaining improved results over zero-shot CLIP and existing few-shot prompting baselines across six diverse datasets, including medical and abstract domains. While offering strong performance, the authors acknowledge high computational costs and memory requirements, suggesting future work to improve efficiency and scalability of generation-aided discriminative learning in V/L systems.

Abstract

Although foundational vision-language models (VLMs) have proven to be very successful for various semantic discrimination tasks, they still struggle to perform faithfully for fine-grained categorization. Moreover, foundational models trained on one domain do not generalize well on a different domain without fine-tuning. We attribute these to the limitations of the VLM's semantic representations and attempt to improve their fine-grained visual awareness using generative modeling. Specifically, we propose two novel methods: Generative Class Prompt Learning (GCPL) and Contrastive Multi-class Prompt Learning (CoMPLe). Utilizing text-to-image diffusion models, GCPL significantly improves the visio-linguistic synergy in class embeddings by conditioning on few-shot exemplars with learnable class prompts. CoMPLe builds on this foundation by introducing a contrastive learning component that encourages inter-class separation during the generative optimization process. Our empirical results demonstrate that such a generative class prompt learning approach substantially outperform existing methods, offering a better alternative to few shot image recognition challenges. The source code will be made available at: https://github.com/soumitri2001/GCPL.
Paper Structure (16 sections, 5 equations, 3 figures, 2 tables)

This paper contains 16 sections, 5 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Overview of our approach compared to existing VLM adaptation methods. (a) Zero-shot inference with CLIP; (b) Contextual prompt token learning; (c) Adapter-based tuning with handcrafted prompts on frozen CLIP representations; (d) Our setup (GCPL): generatively learning the [CLASS] token by prompting a frozen text-to-image LDM ldm).
  • Figure 2: Contrastive multi-class prompt learning (CoMPLe) framework. Our proposed CoMPLe learns class prompts by optimizing the LDM loss for the trainable class token, minimizing noise reconstruction for ground truth noise while maximizing it for other class noises. Red arrows show "maximize," and blue arrows show "minimize." The diffusion classifier zsdc_pathak uses our few-shot learned [CLASS] embeddings for inference.
  • Figure 3: Few-shot performance across varying number of shots per class, over the 7 datasets used in this work. Note that the first sub-figure depicts the mean few-shot performance over all datasets (following prior works maplecocoop).