Table of Contents
Fetching ...

VGDiffZero: Text-to-image Diffusion Models Can Be Zero-shot Visual Grounders

Xuyang Liu, Siteng Huang, Yachen Kang, Honggang Chen, Donglin Wang

TL;DR

Zero-shot visual grounding addresses object localization from natural language without task-specific annotations. This work demonstrates that pre-trained text-to-image diffusion models can be repurposed for grounding by a two-stage framework of Noise Injection and Noise Prediction, scoring proposals with $e_\text{total}=e_\text{mask}+e_\text{crop}$. The VGDiffZero approach leverages isolated global and local contexts, Faster R-CNN proposals, and CLIP text embeddings to achieve strong zero-shot results on RefCOCO, RefCOCO+, and RefCOCOg, illustrating the viability of diffusion-based vision-language models for discriminative tasks and reducing the need for costly fine-tuning.

Abstract

Large-scale text-to-image diffusion models have shown impressive capabilities for generative tasks by leveraging strong vision-language alignment from pre-training. However, most vision-language discriminative tasks require extensive fine-tuning on carefully-labeled datasets to acquire such alignment, with great cost in time and computing resources. In this work, we explore directly applying a pre-trained generative diffusion model to the challenging discriminative task of visual grounding without any fine-tuning and additional training dataset. Specifically, we propose VGDiffZero, a simple yet effective zero-shot visual grounding framework based on text-to-image diffusion models. We also design a comprehensive region-scoring method considering both global and local contexts of each isolated proposal. Extensive experiments on RefCOCO, RefCOCO+, and RefCOCOg show that VGDiffZero achieves strong performance on zero-shot visual grounding. Our code is available at https://github.com/xuyang-liu16/VGDiffZero.

VGDiffZero: Text-to-image Diffusion Models Can Be Zero-shot Visual Grounders

TL;DR

Zero-shot visual grounding addresses object localization from natural language without task-specific annotations. This work demonstrates that pre-trained text-to-image diffusion models can be repurposed for grounding by a two-stage framework of Noise Injection and Noise Prediction, scoring proposals with . The VGDiffZero approach leverages isolated global and local contexts, Faster R-CNN proposals, and CLIP text embeddings to achieve strong zero-shot results on RefCOCO, RefCOCO+, and RefCOCOg, illustrating the viability of diffusion-based vision-language models for discriminative tasks and reducing the need for costly fine-tuning.

Abstract

Large-scale text-to-image diffusion models have shown impressive capabilities for generative tasks by leveraging strong vision-language alignment from pre-training. However, most vision-language discriminative tasks require extensive fine-tuning on carefully-labeled datasets to acquire such alignment, with great cost in time and computing resources. In this work, we explore directly applying a pre-trained generative diffusion model to the challenging discriminative task of visual grounding without any fine-tuning and additional training dataset. Specifically, we propose VGDiffZero, a simple yet effective zero-shot visual grounding framework based on text-to-image diffusion models. We also design a comprehensive region-scoring method considering both global and local contexts of each isolated proposal. Extensive experiments on RefCOCO, RefCOCO+, and RefCOCOg show that VGDiffZero achieves strong performance on zero-shot visual grounding. Our code is available at https://github.com/xuyang-liu16/VGDiffZero.
Paper Structure (8 sections, 3 equations, 2 figures, 3 tables)

This paper contains 8 sections, 3 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Illustration of two types of vision-language tasks. Motivated by the strong abilities of text-to-image diffusion models, we propose VGDiffZero for zero-shot visual grounding.
  • Figure 2: Overview of our VGDiffZero. Given an input image, isolated proposals are generated via cropping and masking, and then encoded individually into latent vectors $Z_0$. Gaussian noise $\epsilon$ sampled from $\mathcal{N}(0, 1)$ is injected into each latent vector to obtain noised latent representations $Z_{noised}$. Subsequently, each noised latent together with the text embeddings is fed into the UNet to select the best matching proposal as the final prediction.