Beyond Words: Augmenting Discriminative Richness via Diffusions in Unsupervised Prompt Learning
Hairui Ren, Fan Tang, He Zhao, Zixuan Wang, Dandan Guo, Yi Chang
TL;DR
This paper tackles the challenge of high-quality pseudo-labels in unsupervised prompt learning (UPL) for vision–language models by introducing AiR, a diffusion-guided framework that augments discriminative richness. AiR builds an auxiliary image-based classifier from synthetic samples generated by a LoRA-fine-tuned Stable Diffusion model and fuses its predictions with a text-based CLIP classifier to yield more accurate pseudo-labels and stronger semantic–visual alignment. The approach yields consistent, state-of-the-art improvements across five datasets and three learning paradigms (UL, SSL, TRZSL), highlighting the value of bridging text–image and image–image discrimination through diverse, domain-aligned synthetic data. The work demonstrates that diffusion-generated visuals, when properly aligned with the downstream domain, enhance pseudo-label quality and prompt learning, offering a practical path to reduce labeling costs in adapting VLMs to new tasks.
Abstract
Fine-tuning vision-language models (VLMs) with large amounts of unlabeled data has recently garnered significant interest. However, a key challenge remains the lack of high-quality pseudo-labeled data. Current pseudo-labeling strategies often struggle with mismatches between semantic and visual information, leading to sub-optimal performance of unsupervised prompt learning (UPL) methods. In this paper, we introduce a simple yet effective approach called \textbf{A}ugmenting D\textbf{i}scriminative \textbf{R}ichness via Diffusions (AiR), toward learning a richer discriminating way to represent the class comprehensively and thus facilitate classification. Specifically, our approach includes a pseudo-label generation module that leverages high-fidelity synthetic samples to create an auxiliary classifier, which captures richer visual variation, bridging text-image-pair classification to a more robust image-image-pair classification. Additionally, we exploit the diversity of diffusion-based synthetic samples to enhance prompt learning, providing greater information for semantic-visual alignment. Extensive experiments on five public benchmarks, including RESISC45 and Flowers102, and across three learning paradigms-UL, SSL, and TRZSL-demonstrate that AiR achieves substantial and consistent performance improvements over state-of-the-art unsupervised prompt learning methods.
