Table of Contents
Fetching ...

Enriching Phrases with Coupled Pixel and Object Contexts for Panoptic Narrative Grounding

Tianrui Hui, Zihan Ding, Junshi Huang, Xiaoming Wei, Xiaolin Wei, Jiao Dai, Jizhong Han, Si Liu

TL;DR

This work addresses Panoptic Narrative Grounding (PNG), where phrases describing image regions must be segmented at the pixel level. It introduces a Phrase-Pixel-Object Transformer Decoder (PPO-TD) that jointly aggregates fine-grained pixel contexts and coarse-grained object cues by using concatenated phrase features and learnable object tokens as queries against multi-scale image features, with masked cross-attention and interleaved self-attention. To further refine object context associations, a Phrase-Object Contrastive Loss (POCL) leverages Hungarian matching to align phrases with ground-truth object tokens and applies a BCE-based loss to encourage matched pairs and discourage unmatched pairs. The approach yields new state-of-the-art results on the PNG benchmark, achieving large margins over prior methods by improving the quality of visual-linguistic interaction and object-context grounding. This has practical implications for applications requiring precise, phrase-level segmentation guided by natural language, such as image editing and human–robot interaction, by providing richer and more accurate cross-modal representations.

Abstract

Panoptic narrative grounding (PNG) aims to segment things and stuff objects in an image described by noun phrases of a narrative caption. As a multimodal task, an essential aspect of PNG is the visual-linguistic interaction between image and caption. The previous two-stage method aggregates visual contexts from offline-generated mask proposals to phrase features, which tend to be noisy and fragmentary. The recent one-stage method aggregates only pixel contexts from image features to phrase features, which may incur semantic misalignment due to lacking object priors. To realize more comprehensive visual-linguistic interaction, we propose to enrich phrases with coupled pixel and object contexts by designing a Phrase-Pixel-Object Transformer Decoder (PPO-TD), where both fine-grained part details and coarse-grained entity clues are aggregated to phrase features. In addition, we also propose a PhraseObject Contrastive Loss (POCL) to pull closer the matched phrase-object pairs and push away unmatched ones for aggregating more precise object contexts from more phrase-relevant object tokens. Extensive experiments on the PNG benchmark show our method achieves new state-of-the-art performance with large margins.

Enriching Phrases with Coupled Pixel and Object Contexts for Panoptic Narrative Grounding

TL;DR

This work addresses Panoptic Narrative Grounding (PNG), where phrases describing image regions must be segmented at the pixel level. It introduces a Phrase-Pixel-Object Transformer Decoder (PPO-TD) that jointly aggregates fine-grained pixel contexts and coarse-grained object cues by using concatenated phrase features and learnable object tokens as queries against multi-scale image features, with masked cross-attention and interleaved self-attention. To further refine object context associations, a Phrase-Object Contrastive Loss (POCL) leverages Hungarian matching to align phrases with ground-truth object tokens and applies a BCE-based loss to encourage matched pairs and discourage unmatched pairs. The approach yields new state-of-the-art results on the PNG benchmark, achieving large margins over prior methods by improving the quality of visual-linguistic interaction and object-context grounding. This has practical implications for applications requiring precise, phrase-level segmentation guided by natural language, such as image editing and human–robot interaction, by providing richer and more accurate cross-modal representations.

Abstract

Panoptic narrative grounding (PNG) aims to segment things and stuff objects in an image described by noun phrases of a narrative caption. As a multimodal task, an essential aspect of PNG is the visual-linguistic interaction between image and caption. The previous two-stage method aggregates visual contexts from offline-generated mask proposals to phrase features, which tend to be noisy and fragmentary. The recent one-stage method aggregates only pixel contexts from image features to phrase features, which may incur semantic misalignment due to lacking object priors. To realize more comprehensive visual-linguistic interaction, we propose to enrich phrases with coupled pixel and object contexts by designing a Phrase-Pixel-Object Transformer Decoder (PPO-TD), where both fine-grained part details and coarse-grained entity clues are aggregated to phrase features. In addition, we also propose a PhraseObject Contrastive Loss (POCL) to pull closer the matched phrase-object pairs and push away unmatched ones for aggregating more precise object contexts from more phrase-relevant object tokens. Extensive experiments on the PNG benchmark show our method achieves new state-of-the-art performance with large margins.
Paper Structure (16 sections, 11 equations, 4 figures, 4 tables)

This paper contains 16 sections, 11 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Comparison of visual-linguistic interaction schemes between previous methods and ours. (a) The previous two-stage method aggregates proposal contexts from offline-generated mask region proposals, which tend to be fragmentary and noisy. (b) The previous one-stage method aggregates pixel contexts directly from image features, but lacking object-level contexts inclines to incur semantic misalignment between phrases and pixels. (c) Our method proposes to enrich phrases with coupled pixel and object contexts containing both fine-grained part details and coarse-grained entity clues, forming more comprehensive visual-linguistic interaction.
  • Figure 2: Overview of our method. Image and phrase features are extracted by vision and language encoders. A pixel decoder further refines the multi-scale image features. Our proposed Phrase-Pixel-Object Transformer Decoder takes the concatenation of phrase features and learnable object tokens as queries and multi-scale image features as keys and values, where phrases are enriched with coupled pixel and object contexts to form a more comprehensive visual-linguistic interaction. A Phrase-Object Contrastive Loss is also proposed to increase feature similarities between matched phrase-object pairs and decrease those between unmatched ones so that more phrase-relevant object contexts are aggregated. The final mask predictions are obtained by the inner product between phrase and image features. The parameters of the language encoder, vision encoder, and pixel decoder are fixed during training.
  • Figure 3: Average Recall Curves for our method performance (a) compared to the state-of-the-art methods, disaggregated into (b) things and stuff categories, and (c) singulars and plurals noun phrases.
  • Figure 4: Qualitative results of our method. The colors of phrases correspond to those of segments.