Towards Generative Class Prompt Learning for Fine-grained Visual Recognition
Soumitri Chattopadhyay, Sanket Biswas, Emanuele Vivoli, Josep Lladós
TL;DR
This work tackles the challenge of fine-grained visual recognition with domain shifts by moving beyond fixed CLIP prompts to generative class prompts learned via pre-trained text-to-image diffusion models. GCPL learns a learnable class token within a fixed prompt and uses a frozen diffusion model to produce visually enriched class representations from few-shot exemplars, while CoMPLe adds a contrastive loss to encourage inter-class separation. The approach leverages latent diffusion models and diffusion classifiers to perform few-shot inference, obtaining improved results over zero-shot CLIP and existing few-shot prompting baselines across six diverse datasets, including medical and abstract domains. While offering strong performance, the authors acknowledge high computational costs and memory requirements, suggesting future work to improve efficiency and scalability of generation-aided discriminative learning in V/L systems.
Abstract
Although foundational vision-language models (VLMs) have proven to be very successful for various semantic discrimination tasks, they still struggle to perform faithfully for fine-grained categorization. Moreover, foundational models trained on one domain do not generalize well on a different domain without fine-tuning. We attribute these to the limitations of the VLM's semantic representations and attempt to improve their fine-grained visual awareness using generative modeling. Specifically, we propose two novel methods: Generative Class Prompt Learning (GCPL) and Contrastive Multi-class Prompt Learning (CoMPLe). Utilizing text-to-image diffusion models, GCPL significantly improves the visio-linguistic synergy in class embeddings by conditioning on few-shot exemplars with learnable class prompts. CoMPLe builds on this foundation by introducing a contrastive learning component that encourages inter-class separation during the generative optimization process. Our empirical results demonstrate that such a generative class prompt learning approach substantially outperform existing methods, offering a better alternative to few shot image recognition challenges. The source code will be made available at: https://github.com/soumitri2001/GCPL.
