Table of Contents
Fetching ...

Dynamic Prompting of Frozen Text-to-Image Diffusion Models for Panoptic Narrative Grounding

Hongyu Li, Tianrui Hui, Zihan Ding, Jing Zhang, Bin Ma, Xiaoming Wei, Jizhong Han, Si Liu

TL;DR

This paper tackles panoptic narrative grounding by aiming for fine-grained pixel-phrase alignment in complex narratives. It introduces dynamic prompting through an Extractive-Injective Phrase Adapter (EIPA) and a Multi-Level Mutual Aggregation (MLMA) module to enable bidirectional vision-language interaction and multi-scale feature fusion within a frozen diffusion backbone. The approach achieves state-of-the-art results on the PNG benchmark, demonstrating effective transfer of generative diffusion pretraining to a discriminative grounding task. The findings highlight the potential of dynamic prompts and cross-modal fusion for open-vocabulary, narrative-grounded segmentation.

Abstract

Panoptic narrative grounding (PNG), whose core target is fine-grained image-text alignment, requires a panoptic segmentation of referred objects given a narrative caption. Previous discriminative methods achieve only weak or coarse-grained alignment by panoptic segmentation pretraining or CLIP model adaptation. Given the recent progress of text-to-image Diffusion models, several works have shown their capability to achieve fine-grained image-text alignment through cross-attention maps and improved general segmentation performance. However, the direct use of phrase features as static prompts to apply frozen Diffusion models to the PNG task still suffers from a large task gap and insufficient vision-language interaction, yielding inferior performance. Therefore, we propose an Extractive-Injective Phrase Adapter (EIPA) bypass within the Diffusion UNet to dynamically update phrase prompts with image features and inject the multimodal cues back, which leverages the fine-grained image-text alignment capability of Diffusion models more sufficiently. In addition, we also design a Multi-Level Mutual Aggregation (MLMA) module to reciprocally fuse multi-level image and phrase features for segmentation refinement. Extensive experiments on the PNG benchmark show that our method achieves new state-of-the-art performance.

Dynamic Prompting of Frozen Text-to-Image Diffusion Models for Panoptic Narrative Grounding

TL;DR

This paper tackles panoptic narrative grounding by aiming for fine-grained pixel-phrase alignment in complex narratives. It introduces dynamic prompting through an Extractive-Injective Phrase Adapter (EIPA) and a Multi-Level Mutual Aggregation (MLMA) module to enable bidirectional vision-language interaction and multi-scale feature fusion within a frozen diffusion backbone. The approach achieves state-of-the-art results on the PNG benchmark, demonstrating effective transfer of generative diffusion pretraining to a discriminative grounding task. The findings highlight the potential of dynamic prompts and cross-modal fusion for open-vocabulary, narrative-grounded segmentation.

Abstract

Panoptic narrative grounding (PNG), whose core target is fine-grained image-text alignment, requires a panoptic segmentation of referred objects given a narrative caption. Previous discriminative methods achieve only weak or coarse-grained alignment by panoptic segmentation pretraining or CLIP model adaptation. Given the recent progress of text-to-image Diffusion models, several works have shown their capability to achieve fine-grained image-text alignment through cross-attention maps and improved general segmentation performance. However, the direct use of phrase features as static prompts to apply frozen Diffusion models to the PNG task still suffers from a large task gap and insufficient vision-language interaction, yielding inferior performance. Therefore, we propose an Extractive-Injective Phrase Adapter (EIPA) bypass within the Diffusion UNet to dynamically update phrase prompts with image features and inject the multimodal cues back, which leverages the fine-grained image-text alignment capability of Diffusion models more sufficiently. In addition, we also design a Multi-Level Mutual Aggregation (MLMA) module to reciprocally fuse multi-level image and phrase features for segmentation refinement. Extensive experiments on the PNG benchmark show that our method achieves new state-of-the-art performance.
Paper Structure (17 sections, 14 equations, 6 figures, 4 tables)

This paper contains 17 sections, 14 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Static prompting of frozen Diffusion models suffers from a large task gap and insufficient vision-language interaction, leading to sub-optimal generalization on the PNG task. We propose a dynamic prompting scheme via Phrase Adapters which bidirectionally update image and text features to better leverage the fine-grained image-text alignment capability of Diffusion models.
  • Figure 2: The overall architecture of our pipeline. Input image and caption are first processed by Diffusion UNet and text encoder. An additional bypass composed of our proposed Extractive-Injective Phrase Adapter (EIPA) is introduced to update phrase features with image features, forming a bidirectional vision-language interaction. Multi-level image and phrase features obtained are further fed into our designed Multi-Level Mutual Aggregation (MLMA) module to integrate multi-level semantic information. Finally, the segmentation mask of each phrase is predicted by a Transformer decoder.
  • Figure 3: The detailed structure of Extractive-Injective Phrase Adapter (EIPA). Feature dimensions in adapters are zoomed in and out to reduce the number of tuned parameters.
  • Figure 4: Average Recall curves of our model ablations in Table \ref{['tab:ablation:components']}, (a) comparing four component analysis ablations, disaggregated into (b) things and stuff categories, and (c) singulars and plurals noun phrases.
  • Figure 5: Visualization of cross-attention maps in different layers ($L$) of our EIPA. We assign the most matched phrase label to each pixel to illustrate the overall effect of cross-attention.
  • ...and 1 more figures