Table of Contents
Fetching ...

Bring Adaptive Binding Prototypes to Generalized Referring Expression Segmentation

Weize Li, Zhicheng Zhao, Haochen Bai, Fei Su

TL;DR

This work targets Generalized Referring Expression Segmentation (GRES), where expressions may refer to multiple objects or none at all. It introduces Model with Adaptive Binding Prototypes (MABP), featuring region-based queries generated by a region-aware query generator, a mixed-modal decoder for iterative multimodal reasoning, and a regional supervision head with main and no-target branches to adapt prototypes to region patches. The approach yields state-of-the-art results on gRefCOCO, RefCOCO+, and G-Ref, and strong results on classic RES, while offering valuable insights via ablations and visualizations that highlight region-prototype binding and language-guided attention. The method improves robustness to complex referents and shows potential for video extensions, marking a significant step toward flexible, region-aware cross-modal segmentation in practical applications.

Abstract

Referring Expression Segmentation (RES) has attracted rising attention, aiming to identify and segment objects based on natural language expressions. While substantial progress has been made in RES, the emergence of Generalized Referring Expression Segmentation (GRES) introduces new challenges by allowing expressions to describe multiple objects or lack specific object references. Existing RES methods, usually rely on sophisticated encoder-decoder and feature fusion modules, and are difficult to generate class prototypes that match each instance individually when confronted with the complex referent and binary labels of GRES. In this paper, reevaluating the differences between RES and GRES, we propose a novel Model with Adaptive Binding Prototypes (MABP) that adaptively binds queries to object features in the corresponding region. It enables different query vectors to match instances of different categories or different parts of the same instance, significantly expanding the decoder's flexibility, dispersing global pressure across all queries, and easing the demands on the encoder. Experimental results demonstrate that MABP significantly outperforms state-of-the-art methods in all three splits on gRefCOCO dataset. Meanwhile, MABP also surpasses state-of-the-art methods on RefCOCO+ and G-Ref datasets, and achieves very competitive results on RefCOCO. Code is available at https://github.com/buptLwz/MABP

Bring Adaptive Binding Prototypes to Generalized Referring Expression Segmentation

TL;DR

This work targets Generalized Referring Expression Segmentation (GRES), where expressions may refer to multiple objects or none at all. It introduces Model with Adaptive Binding Prototypes (MABP), featuring region-based queries generated by a region-aware query generator, a mixed-modal decoder for iterative multimodal reasoning, and a regional supervision head with main and no-target branches to adapt prototypes to region patches. The approach yields state-of-the-art results on gRefCOCO, RefCOCO+, and G-Ref, and strong results on classic RES, while offering valuable insights via ablations and visualizations that highlight region-prototype binding and language-guided attention. The method improves robustness to complex referents and shows potential for video extensions, marking a significant step toward flexible, region-aware cross-modal segmentation in practical applications.

Abstract

Referring Expression Segmentation (RES) has attracted rising attention, aiming to identify and segment objects based on natural language expressions. While substantial progress has been made in RES, the emergence of Generalized Referring Expression Segmentation (GRES) introduces new challenges by allowing expressions to describe multiple objects or lack specific object references. Existing RES methods, usually rely on sophisticated encoder-decoder and feature fusion modules, and are difficult to generate class prototypes that match each instance individually when confronted with the complex referent and binary labels of GRES. In this paper, reevaluating the differences between RES and GRES, we propose a novel Model with Adaptive Binding Prototypes (MABP) that adaptively binds queries to object features in the corresponding region. It enables different query vectors to match instances of different categories or different parts of the same instance, significantly expanding the decoder's flexibility, dispersing global pressure across all queries, and easing the demands on the encoder. Experimental results demonstrate that MABP significantly outperforms state-of-the-art methods in all three splits on gRefCOCO dataset. Meanwhile, MABP also surpasses state-of-the-art methods on RefCOCO+ and G-Ref datasets, and achieves very competitive results on RefCOCO. Code is available at https://github.com/buptLwz/MABP
Paper Structure (18 sections, 5 equations, 8 figures, 7 tables)

This paper contains 18 sections, 5 equations, 8 figures, 7 tables.

Figures (8)

  • Figure 1: RES vs. GRES. The classic RES is designed to handle expressions that specify a single target object. In contrast, GRES extends this capability by supporting expressions that indicate an arbitrary number of target objects. For example, GRES accommodates multi-target expressions such as (b) and (c), as well as expressions indicating no target, as shown in (d). Notably, some multi-target expressions in GRES may even describe instances belonging to different classes, such as (c).
  • Figure 2: Comparison between the proposed adaptive binding prototypes and previous methods. (a) shows the naive pixelwise classification approach commonly used in segmentation, exemplified by a linear layer serving as the classification head. (b) ReLA'sliu2023gres mask head, uses downsampled ground truths (GTs) as weights to aggregate the prediction masks generated by multiple queries. (c) introduces the proposed adaptive binding prototype method. We divide the feature map into various regions and compute the loss separately, thereby constraining the queries to become more learnable class prototypes compared with the above two approaches.
  • Figure 3: The overall architecture of the proposed MABP. Initially, we utilize a feature extractor to obtain the linguistic features and visual features. The linguistic features are then combined with learnable region embeddings to generate region-text-specific queries via a query generator. Then, a set of mixed modal decoders (MMDs) are employed for these queries to interact gradually with visual features for reasoning. Finally, the decoded queries, along with visual and linguistic features, are fed into the Regional Supervision Head (RSH) to obtain prediction masks and no-target indicators. (b) and (c) illustrate the two branches of RSH, while (d) shows the detailed structure of the MMD.
  • Figure 4: The structure of the proposed query generator. Unlike traditional random initialization, our initialization query first undergoes cross-attention processing with linguistic features. For example, in the description "right cute dog and guy in blue jacket", the dog is obviously on the right side of the guy, and the guy is on the left. Therefore, when our query generator is used, the query in the left region will integrate more information about the guy, whereas the query on the right will focus on information about the dog.
  • Figure 5: Visualizations of the attention maps for the third-layer cross-attention module in the decoder. We input the same no-target sample into both the No. 2 model in Table \ref{['tab:ablation']} and our model, visualizing the cross-attention matrices of the decoder's third layer, i.e., the decoding module before the first mask head. As our mixed modal decoder incorporates linguistic features as placeholders, our model can learn a more easily interpretable non-uniform attention map, achieving better recognition for no-target samples.
  • ...and 3 more figures