Table of Contents
Fetching ...

Beyond Words: Augmenting Discriminative Richness via Diffusions in Unsupervised Prompt Learning

Hairui Ren, Fan Tang, He Zhao, Zixuan Wang, Dandan Guo, Yi Chang

TL;DR

This paper tackles the challenge of high-quality pseudo-labels in unsupervised prompt learning (UPL) for vision–language models by introducing AiR, a diffusion-guided framework that augments discriminative richness. AiR builds an auxiliary image-based classifier from synthetic samples generated by a LoRA-fine-tuned Stable Diffusion model and fuses its predictions with a text-based CLIP classifier to yield more accurate pseudo-labels and stronger semantic–visual alignment. The approach yields consistent, state-of-the-art improvements across five datasets and three learning paradigms (UL, SSL, TRZSL), highlighting the value of bridging text–image and image–image discrimination through diverse, domain-aligned synthetic data. The work demonstrates that diffusion-generated visuals, when properly aligned with the downstream domain, enhance pseudo-label quality and prompt learning, offering a practical path to reduce labeling costs in adapting VLMs to new tasks.

Abstract

Fine-tuning vision-language models (VLMs) with large amounts of unlabeled data has recently garnered significant interest. However, a key challenge remains the lack of high-quality pseudo-labeled data. Current pseudo-labeling strategies often struggle with mismatches between semantic and visual information, leading to sub-optimal performance of unsupervised prompt learning (UPL) methods. In this paper, we introduce a simple yet effective approach called \textbf{A}ugmenting D\textbf{i}scriminative \textbf{R}ichness via Diffusions (AiR), toward learning a richer discriminating way to represent the class comprehensively and thus facilitate classification. Specifically, our approach includes a pseudo-label generation module that leverages high-fidelity synthetic samples to create an auxiliary classifier, which captures richer visual variation, bridging text-image-pair classification to a more robust image-image-pair classification. Additionally, we exploit the diversity of diffusion-based synthetic samples to enhance prompt learning, providing greater information for semantic-visual alignment. Extensive experiments on five public benchmarks, including RESISC45 and Flowers102, and across three learning paradigms-UL, SSL, and TRZSL-demonstrate that AiR achieves substantial and consistent performance improvements over state-of-the-art unsupervised prompt learning methods.

Beyond Words: Augmenting Discriminative Richness via Diffusions in Unsupervised Prompt Learning

TL;DR

This paper tackles the challenge of high-quality pseudo-labels in unsupervised prompt learning (UPL) for vision–language models by introducing AiR, a diffusion-guided framework that augments discriminative richness. AiR builds an auxiliary image-based classifier from synthetic samples generated by a LoRA-fine-tuned Stable Diffusion model and fuses its predictions with a text-based CLIP classifier to yield more accurate pseudo-labels and stronger semantic–visual alignment. The approach yields consistent, state-of-the-art improvements across five datasets and three learning paradigms (UL, SSL, TRZSL), highlighting the value of bridging text–image and image–image discrimination through diverse, domain-aligned synthetic data. The work demonstrates that diffusion-generated visuals, when properly aligned with the downstream domain, enhance pseudo-label quality and prompt learning, offering a practical path to reduce labeling costs in adapting VLMs to new tasks.

Abstract

Fine-tuning vision-language models (VLMs) with large amounts of unlabeled data has recently garnered significant interest. However, a key challenge remains the lack of high-quality pseudo-labeled data. Current pseudo-labeling strategies often struggle with mismatches between semantic and visual information, leading to sub-optimal performance of unsupervised prompt learning (UPL) methods. In this paper, we introduce a simple yet effective approach called \textbf{A}ugmenting D\textbf{i}scriminative \textbf{R}ichness via Diffusions (AiR), toward learning a richer discriminating way to represent the class comprehensively and thus facilitate classification. Specifically, our approach includes a pseudo-label generation module that leverages high-fidelity synthetic samples to create an auxiliary classifier, which captures richer visual variation, bridging text-image-pair classification to a more robust image-image-pair classification. Additionally, we exploit the diversity of diffusion-based synthetic samples to enhance prompt learning, providing greater information for semantic-visual alignment. Extensive experiments on five public benchmarks, including RESISC45 and Flowers102, and across three learning paradigms-UL, SSL, and TRZSL-demonstrate that AiR achieves substantial and consistent performance improvements over state-of-the-art unsupervised prompt learning methods.

Paper Structure

This paper contains 19 sections, 10 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Left: Confusion matrix for ground truth labels $vs.$ text classifier predictions on the Flowers102 dataset, highlighting persistent generation of incorrect pseudo-labels. Right: An example of misclassification in Flowers102, where "Thorn apple" is misidentified as "Giant white arum lily" due to its white petals resembling the latter’s semantic information. This misclassification can be alleviated by jointly considering text and image predictions.
  • Figure 2: Overview of our proposed AiR, which consists of ACG (Auxiliary Classifier Generation) Module and PLG (Pseudo Label Generation) Module. ACG generates synthetic samples using the LoRA fine-tuned SD model and selects the representative samples by cosine similarity to build auxiliary classifier. PLG generates pseudo-labels by fusing the predictions of the auxiliary classifier built on synthetic images and the text classifier. The overall loss for training prompt consists of the classification loss of real images with pseudo-labels $\mathcal{L}_{r}$ and classification loss of synthetic images with corresponding categories $\mathcal{L}_{s}$, where ours can learn visual and textual prompt.
  • Figure 3: Comparison of top-1 test accuracy ($\%$) across varying numbers of synthetic samples in unsupervised learning on the RESISC45 and EuroSAT datasets.
  • Figure 4: Comparison of top-1 test accuracy ($\%$) of pseudo labels in unsupervised learning. 'Ours' denotes the use of our approach AiR, while 'CLIP' indicates the use of CLIP's text encoder.
  • Figure 5: Visualization of the activation regions for the text classifier and synthetic-sample-based auxiliary classifier using CAM. 'T $\&$ S' represents the activation after merging the outputs of both classifiers.
  • ...and 2 more figures