Dynamic Prompting of Frozen Text-to-Image Diffusion Models for Panoptic Narrative Grounding
Hongyu Li, Tianrui Hui, Zihan Ding, Jing Zhang, Bin Ma, Xiaoming Wei, Jizhong Han, Si Liu
TL;DR
This paper tackles panoptic narrative grounding by aiming for fine-grained pixel-phrase alignment in complex narratives. It introduces dynamic prompting through an Extractive-Injective Phrase Adapter (EIPA) and a Multi-Level Mutual Aggregation (MLMA) module to enable bidirectional vision-language interaction and multi-scale feature fusion within a frozen diffusion backbone. The approach achieves state-of-the-art results on the PNG benchmark, demonstrating effective transfer of generative diffusion pretraining to a discriminative grounding task. The findings highlight the potential of dynamic prompts and cross-modal fusion for open-vocabulary, narrative-grounded segmentation.
Abstract
Panoptic narrative grounding (PNG), whose core target is fine-grained image-text alignment, requires a panoptic segmentation of referred objects given a narrative caption. Previous discriminative methods achieve only weak or coarse-grained alignment by panoptic segmentation pretraining or CLIP model adaptation. Given the recent progress of text-to-image Diffusion models, several works have shown their capability to achieve fine-grained image-text alignment through cross-attention maps and improved general segmentation performance. However, the direct use of phrase features as static prompts to apply frozen Diffusion models to the PNG task still suffers from a large task gap and insufficient vision-language interaction, yielding inferior performance. Therefore, we propose an Extractive-Injective Phrase Adapter (EIPA) bypass within the Diffusion UNet to dynamically update phrase prompts with image features and inject the multimodal cues back, which leverages the fine-grained image-text alignment capability of Diffusion models more sufficiently. In addition, we also design a Multi-Level Mutual Aggregation (MLMA) module to reciprocally fuse multi-level image and phrase features for segmentation refinement. Extensive experiments on the PNG benchmark show that our method achieves new state-of-the-art performance.
