Table of Contents
Fetching ...

Vision-Aware Text Features in Referring Image Segmentation: From Object Understanding to Context Understanding

Hai Nguyen-Truong, E-Ro Nguyen, Tuan-Anh Vu, Minh-Triet Tran, Binh-Son Hua, Sai-Kit Yeung

TL;DR

A novel framework that specifically emphasizes object and context comprehension inspired by human cognitive processes through Vision-Aware Text Features is proposed that achieves significant performance improvements on three benchmark datasets RefCOCO, RefCOCO+ and G-Ref Project.

Abstract

Referring image segmentation is a challenging task that involves generating pixel-wise segmentation masks based on natural language descriptions. The complexity of this task increases with the intricacy of the sentences provided. Existing methods have relied mostly on visual features to generate the segmentation masks while treating text features as supporting components. However, this under-utilization of text understanding limits the model's capability to fully comprehend the given expressions. In this work, we propose a novel framework that specifically emphasizes object and context comprehension inspired by human cognitive processes through Vision-Aware Text Features. Firstly, we introduce a CLIP Prior module to localize the main object of interest and embed the object heatmap into the query initialization process. Secondly, we propose a combination of two components: Contextual Multimodal Decoder and Meaning Consistency Constraint, to further enhance the coherent and consistent interpretation of language cues with the contextual understanding obtained from the image. Our method achieves significant performance improvements on three benchmark datasets RefCOCO, RefCOCO+ and G-Ref. Project page: \url{https://vatex.hkustvgd.com/}.

Vision-Aware Text Features in Referring Image Segmentation: From Object Understanding to Context Understanding

TL;DR

A novel framework that specifically emphasizes object and context comprehension inspired by human cognitive processes through Vision-Aware Text Features is proposed that achieves significant performance improvements on three benchmark datasets RefCOCO, RefCOCO+ and G-Ref Project.

Abstract

Referring image segmentation is a challenging task that involves generating pixel-wise segmentation masks based on natural language descriptions. The complexity of this task increases with the intricacy of the sentences provided. Existing methods have relied mostly on visual features to generate the segmentation masks while treating text features as supporting components. However, this under-utilization of text understanding limits the model's capability to fully comprehend the given expressions. In this work, we propose a novel framework that specifically emphasizes object and context comprehension inspired by human cognitive processes through Vision-Aware Text Features. Firstly, we introduce a CLIP Prior module to localize the main object of interest and embed the object heatmap into the query initialization process. Secondly, we propose a combination of two components: Contextual Multimodal Decoder and Meaning Consistency Constraint, to further enhance the coherent and consistent interpretation of language cues with the contextual understanding obtained from the image. Our method achieves significant performance improvements on three benchmark datasets RefCOCO, RefCOCO+ and G-Ref. Project page: \url{https://vatex.hkustvgd.com/}.
Paper Structure (33 sections, 8 equations, 13 figures, 18 tables)

This paper contains 33 sections, 8 equations, 13 figures, 18 tables.

Figures (13)

  • Figure 1: Qualitative comparison between LAVT and Ours. The yellow box indicates the wrong segmentation results. Object understanding and Context understanding are required to tackle the challenge of complex and ambiguous language expression.
  • Figure 2: The overall architecture of VATEX, which processes input images and language expressions through two concurrent pathways. Initially, the CLIP Prior module generates object queries, while simultaneously, traditional Visual and Text Encoders create multiscale visual feature maps and word-level text features. These visual and text features are passed into the Contextual Multimodal Decoder to enable multimodal interactions, yielding vision-aware text features and text-enhanced visual features. We then harness vision-aware text features to ensure semantic consistency across varied textual descriptions that reference the same object by employing sentence-level contrastive learning, as described in the Meaning Consistency Constraint section. On the other hand, the text-enhanced visual features and the object queries generated by the CLIP Prior are refined through a Masked-attention Transformer Decoder to produce the final output segmentation masks.
  • Figure 3: Our CLIP Prior exploits the alignment of CLIP-Image and CLIP-Text embeddings for better query initialization. Best viewed with zoom.
  • Figure 4: Illustration of Meaning Consistency Constraint. Vision-aware text embeddings of different expressions are passed through a contrastive learning module in sentence-level feature space. Embeddings referring to the same object are pulled closer while pushing others far away. Best view in color.
  • Figure 5: Results on RefCOCO(+/g) datasets. We compare our results with CRIS and LAVT. Our method excels at segmenting objects in complex scenarios, such as distinguishing between similar objects and localizing specific instances within a scene. The last two columns of the results show failure cases. Best viewed in color.
  • ...and 8 more figures