Table of Contents
Fetching ...

Discffusion: Discriminative Diffusion Models as Few-shot Vision and Language Learners

Xuehai He, Weixi Feng, Tsu-Jui Fu, Varun Jampani, Arjun Akula, Pradyumna Narayana, Sugato Basu, William Yang Wang, Xin Eric Wang

TL;DR

The paper tackles the challenge of few-shot image-text matching by repurposing pre-trained diffusion models, specifically Stable Diffusion, for discriminative tasks. It introduces Discffusion, which uses cross-attention scores to quantify image-text alignment and employs attention-based prompts to fine-tune the model in a data-efficient manner. Experiments on ComVG and RefCOCOg (few-shot) and Winoground/VL-checklist (zero-shot), along with VQA evaluations, show that Discffusion outperforms CLIP-based baselines and demonstrates competitive generalization. The work highlights the potential of diffusion models for discriminative vision-language tasks and provides practical techniques like LogSumExp pooling and multi-layer cross-attention utilization to enhance performance.

Abstract

Diffusion models, such as Stable Diffusion, have shown incredible performance on text-to-image generation. Since text-to-image generation often requires models to generate visual concepts with fine-grained details and attributes specified in text prompts, can we leverage the powerful representations learned by pre-trained diffusion models for discriminative tasks such as image-text matching? To answer this question, we propose a novel approach, Discriminative Stable Diffusion (DSD), which turns pre-trained text-to-image diffusion models into few-shot discriminative learners. Our approach mainly uses the cross-attention score of a Stable Diffusion model to capture the mutual influence between visual and textual information and fine-tune the model via efficient attention-based prompt learning to perform image-text matching. By comparing DSD with state-of-the-art methods on several benchmark datasets, we demonstrate the potential of using pre-trained diffusion models for discriminative tasks with superior results on few-shot image-text matching.

Discffusion: Discriminative Diffusion Models as Few-shot Vision and Language Learners

TL;DR

The paper tackles the challenge of few-shot image-text matching by repurposing pre-trained diffusion models, specifically Stable Diffusion, for discriminative tasks. It introduces Discffusion, which uses cross-attention scores to quantify image-text alignment and employs attention-based prompts to fine-tune the model in a data-efficient manner. Experiments on ComVG and RefCOCOg (few-shot) and Winoground/VL-checklist (zero-shot), along with VQA evaluations, show that Discffusion outperforms CLIP-based baselines and demonstrates competitive generalization. The work highlights the potential of diffusion models for discriminative vision-language tasks and provides practical techniques like LogSumExp pooling and multi-layer cross-attention utilization to enhance performance.

Abstract

Diffusion models, such as Stable Diffusion, have shown incredible performance on text-to-image generation. Since text-to-image generation often requires models to generate visual concepts with fine-grained details and attributes specified in text prompts, can we leverage the powerful representations learned by pre-trained diffusion models for discriminative tasks such as image-text matching? To answer this question, we propose a novel approach, Discriminative Stable Diffusion (DSD), which turns pre-trained text-to-image diffusion models into few-shot discriminative learners. Our approach mainly uses the cross-attention score of a Stable Diffusion model to capture the mutual influence between visual and textual information and fine-tune the model via efficient attention-based prompt learning to perform image-text matching. By comparing DSD with state-of-the-art methods on several benchmark datasets, we demonstrate the potential of using pre-trained diffusion models for discriminative tasks with superior results on few-shot image-text matching.
Paper Structure (42 sections, 6 equations, 11 figures, 12 tables, 1 algorithm)

This paper contains 42 sections, 6 equations, 11 figures, 12 tables, 1 algorithm.

Figures (11)

  • Figure 1: The upper subfigure in the teaser image illustrates the ability of Stable Diffusion to generate realistic images given a text prompt. The bottom subfigure illustrates the process of our proposed method, Discriminative Stable Diffusion (Discffusion), for utilizing Stable Diffusion for the image-text matching task. Discffusion can output a matching score for a given text prompt and image, with a higher score indicating a stronger match.
  • Figure 2: Overview of our Discriminative Stable Diffusion framework, which measures how much the given images and texts matched use the cross-attention mechanism in the Stable Diffusion. Discriminative Stable Diffusion added prompt embeddings over attention matrices (red boxes). We then fine-tune the weights under the few-shot setting.
  • Figure 3: Ablation study on the number of attention maps used from layers of the U-Net (x-axis). The y-axis represents the accuracy on the ComVG dataset. Tests on two variants of Stable-Diffusion v2: trained as a standard noise-prediction model on 512x512 images and 768x768 images.
  • Figure 4: Ablation study on the number of attention heads (five in total within the Stable Diffusion) in the U-Net (x-axis) with few-shot performance on the ComVG dataset (y-axis) under the two scenarios: using the average of all attention maps and using our dynamic attention head weighting method. The results illustrate the superiority of our weighting method.
  • Figure 5: Ablation study on using cosine similarity, maximum value from each column of the attention map, and the smoothed maximum (LogSumExp pooling); and the amount of noise added during the diffusion process: using consistent noise levels of $0.4$, $0.8$ and using ensembling.
  • ...and 6 more figures