Table of Contents
Fetching ...

Instance-Aware Generalized Referring Expression Segmentation

E-Ro Nguyen, Hieu Le, Dimitris Samaras, Michael Ryoo

TL;DR

InstAlign is proposed, a method that incorporates object-level reasoning into the segmentation process and significantly advances state-of-the-art performance, setting a new standard for precise and flexible GRES.

Abstract

Recent works on Generalized Referring Expression Segmentation (GRES) struggle with handling complex expressions referring to multiple distinct objects. This is because these methods typically employ an end-to-end foreground-background segmentation and lack a mechanism to explicitly differentiate and associate different object instances to the text query. To this end, we propose InstAlign, a method that incorporates object-level reasoning into the segmentation process. Our model leverages both text and image inputs to extract a set of object-level tokens that capture both the semantic information in the input prompt and the objects within the image. By modeling the text-object alignment via instance-level supervision, each token uniquely represents an object segment in the image, while also aligning with relevant semantic information from the text. Extensive experiments on the gRefCOCO and Ref-ZOM benchmarks demonstrate that our method significantly advances state-of-the-art performance, setting a new standard for precise and flexible GRES.

Instance-Aware Generalized Referring Expression Segmentation

TL;DR

InstAlign is proposed, a method that incorporates object-level reasoning into the segmentation process and significantly advances state-of-the-art performance, setting a new standard for precise and flexible GRES.

Abstract

Recent works on Generalized Referring Expression Segmentation (GRES) struggle with handling complex expressions referring to multiple distinct objects. This is because these methods typically employ an end-to-end foreground-background segmentation and lack a mechanism to explicitly differentiate and associate different object instances to the text query. To this end, we propose InstAlign, a method that incorporates object-level reasoning into the segmentation process. Our model leverages both text and image inputs to extract a set of object-level tokens that capture both the semantic information in the input prompt and the objects within the image. By modeling the text-object alignment via instance-level supervision, each token uniquely represents an object segment in the image, while also aligning with relevant semantic information from the text. Extensive experiments on the gRefCOCO and Ref-ZOM benchmarks demonstrate that our method significantly advances state-of-the-art performance, setting a new standard for precise and flexible GRES.

Paper Structure

This paper contains 28 sections, 11 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: (a) Previous GRES methods typically output a single foreground mask in an end-to-end manner, struggling with complex cases involving multiple referred object instances. In contrast, our proposed method automatically localizes relevant object instances associated with different parts of the input prompt before aggregating them to produce the final mask (b).
  • Figure 2: Overview of InstAlign. Our proposed method identifies object queries that produce only instance masks of objects specified in the input prompt. To achieve this, we begin with a set of initial object queries and progressively refine them, utilizing both image and text features to associate each query with a targeted object instance in the image as well as a phrase extracted from the input text.
  • Figure 3: Phrase-Object Transformer. We employ an Object-Text Cross Attention layer with a bidirectional attention mechanism allowing both text features and object queries to be transformed based on information from both sides.
  • Figure 4: Phrase-Object Alignment Loss. Given an object query, we first compute a text feature embedding representing a text phrase that best aligns with the query (highlighted in red). Then, our phrase-object alignment loss penalizes the cosine difference between the object query and this phrase feature.
  • Figure 5: Adaptive Instance Aggregation. We use $\sigma$ for the PReLU activation.
  • ...and 2 more figures