Enriching Phrases with Coupled Pixel and Object Contexts for Panoptic Narrative Grounding
Tianrui Hui, Zihan Ding, Junshi Huang, Xiaoming Wei, Xiaolin Wei, Jiao Dai, Jizhong Han, Si Liu
TL;DR
This work addresses Panoptic Narrative Grounding (PNG), where phrases describing image regions must be segmented at the pixel level. It introduces a Phrase-Pixel-Object Transformer Decoder (PPO-TD) that jointly aggregates fine-grained pixel contexts and coarse-grained object cues by using concatenated phrase features and learnable object tokens as queries against multi-scale image features, with masked cross-attention and interleaved self-attention. To further refine object context associations, a Phrase-Object Contrastive Loss (POCL) leverages Hungarian matching to align phrases with ground-truth object tokens and applies a BCE-based loss to encourage matched pairs and discourage unmatched pairs. The approach yields new state-of-the-art results on the PNG benchmark, achieving large margins over prior methods by improving the quality of visual-linguistic interaction and object-context grounding. This has practical implications for applications requiring precise, phrase-level segmentation guided by natural language, such as image editing and human–robot interaction, by providing richer and more accurate cross-modal representations.
Abstract
Panoptic narrative grounding (PNG) aims to segment things and stuff objects in an image described by noun phrases of a narrative caption. As a multimodal task, an essential aspect of PNG is the visual-linguistic interaction between image and caption. The previous two-stage method aggregates visual contexts from offline-generated mask proposals to phrase features, which tend to be noisy and fragmentary. The recent one-stage method aggregates only pixel contexts from image features to phrase features, which may incur semantic misalignment due to lacking object priors. To realize more comprehensive visual-linguistic interaction, we propose to enrich phrases with coupled pixel and object contexts by designing a Phrase-Pixel-Object Transformer Decoder (PPO-TD), where both fine-grained part details and coarse-grained entity clues are aggregated to phrase features. In addition, we also propose a PhraseObject Contrastive Loss (POCL) to pull closer the matched phrase-object pairs and push away unmatched ones for aggregating more precise object contexts from more phrase-relevant object tokens. Extensive experiments on the PNG benchmark show our method achieves new state-of-the-art performance with large margins.
