Table of Contents
Fetching ...

Exploring Phrase-Level Grounding with Text-to-Image Diffusion Model

Danni Yang, Ruohan Dong, Jiayi Ji, Yiwei Ma, Haowei Wang, Xiaoshuai Sun, Rongrong Ji

TL;DR

This work investigates phrase-level grounding using text-to-image diffusion models by reformulating Panoptic Narrative Grounding (PNG) as a zero-shot localization-segmentation-refinement problem. It introduces DiffPNG, a pipeline that leverages cross-attention for anchor localization, self-attention for segmentation, and a SAM-based refinement step to produce high-quality masks without training data. The key contributions include the Locate-to-Segment Processor (LSP), the Subject-Focused Feature Aggregator, and the SAM-based Mask Refinement (SMR), along with comprehensive ablations demonstrating significant gains on PNG under zero-shot settings. The results show that diffusion models can achieve context-aware, phrase-level visual understanding, offering a practical path toward efficient, language-guided segmentation without annotated data.

Abstract

Recently, diffusion models have increasingly demonstrated their capabilities in vision understanding. By leveraging prompt-based learning to construct sentences, these models have shown proficiency in classification and visual grounding tasks. However, existing approaches primarily showcase their ability to perform sentence-level localization, leaving the potential for leveraging contextual information for phrase-level understanding largely unexplored. In this paper, we utilize Panoptic Narrative Grounding (PNG) as a proxy task to investigate this capability further. PNG aims to segment object instances mentioned by multiple noun phrases within a given narrative text. Specifically, we introduce the DiffPNG framework, a straightforward yet effective approach that fully capitalizes on the diffusion's architecture for segmentation by decomposing the process into a sequence of localization, segmentation, and refinement steps. The framework initially identifies anchor points using cross-attention mechanisms and subsequently performs segmentation with self-attention to achieve zero-shot PNG. Moreover, we introduce a refinement module based on SAM to enhance the quality of the segmentation masks. Our extensive experiments on the PNG dataset demonstrate that DiffPNG achieves strong performance in the zero-shot PNG task setting, conclusively proving the diffusion model's capability for context-aware, phrase-level understanding. Source code is available at \url{https://github.com/nini0919/DiffPNG}.

Exploring Phrase-Level Grounding with Text-to-Image Diffusion Model

TL;DR

This work investigates phrase-level grounding using text-to-image diffusion models by reformulating Panoptic Narrative Grounding (PNG) as a zero-shot localization-segmentation-refinement problem. It introduces DiffPNG, a pipeline that leverages cross-attention for anchor localization, self-attention for segmentation, and a SAM-based refinement step to produce high-quality masks without training data. The key contributions include the Locate-to-Segment Processor (LSP), the Subject-Focused Feature Aggregator, and the SAM-based Mask Refinement (SMR), along with comprehensive ablations demonstrating significant gains on PNG under zero-shot settings. The results show that diffusion models can achieve context-aware, phrase-level visual understanding, offering a practical path toward efficient, language-guided segmentation without annotated data.

Abstract

Recently, diffusion models have increasingly demonstrated their capabilities in vision understanding. By leveraging prompt-based learning to construct sentences, these models have shown proficiency in classification and visual grounding tasks. However, existing approaches primarily showcase their ability to perform sentence-level localization, leaving the potential for leveraging contextual information for phrase-level understanding largely unexplored. In this paper, we utilize Panoptic Narrative Grounding (PNG) as a proxy task to investigate this capability further. PNG aims to segment object instances mentioned by multiple noun phrases within a given narrative text. Specifically, we introduce the DiffPNG framework, a straightforward yet effective approach that fully capitalizes on the diffusion's architecture for segmentation by decomposing the process into a sequence of localization, segmentation, and refinement steps. The framework initially identifies anchor points using cross-attention mechanisms and subsequently performs segmentation with self-attention to achieve zero-shot PNG. Moreover, we introduce a refinement module based on SAM to enhance the quality of the segmentation masks. Our extensive experiments on the PNG dataset demonstrate that DiffPNG achieves strong performance in the zero-shot PNG task setting, conclusively proving the diffusion model's capability for context-aware, phrase-level understanding. Source code is available at \url{https://github.com/nini0919/DiffPNG}.
Paper Structure (27 sections, 15 equations, 8 figures, 6 tables)

This paper contains 27 sections, 15 equations, 8 figures, 6 tables.

Figures (8)

  • Figure 1: A comparison between the previous fully-supervised PNG paradigm with our proposed Zero-Shot Diffusion-based Paradigm. Motivated by the strong image-text alignment of text-to-image diffusion models, we employ these generative models in our PNG task through a zero-shot manner. This is aimed at exploring the ability of diffusion models to perform phrase-level grounding.
  • Figure 2: Overview of the proposed DiffPNG framework, of which all components, i.e., Feature Extraction, Locate-to-Segment Processor, Subject-Focused Feature Aggregator, SAM-based Mask Refinement.
  • Figure 3: Qualitative analysis compares our proposed DiffPNG* with ground truth.
  • Figure 4: An example demonstrates how our model mitigated the issue of error accumulation in this multi-step approach.
  • Figure 5: An example illustrating the working principle of the SAM-based Mask Refinement (SMR) strategy.
  • ...and 3 more figures