Table of Contents
Fetching ...

Mask Grounding for Referring Image Segmentation

Yong Xien Chng, Henry Zheng, Yizeng Han, Xuchong Qiu, Gao Huang

TL;DR

The paper tackles the modality gap in Referring Image Segmentation by introducing Mask Grounding, an auxiliary task that teaches fine-grained word–object associations through masked token prediction grounded in visual and segmentation cues. It further enhances cross-modal alignment with a Cross-modal Alignment Module (CAM) and a holistic Cross-modal Alignment Loss (CAL), forming MagNet. Empirical results on RefCOCO, RefCOCO+, and G-Ref demonstrate state-of-the-art performance and strong ablations validate the individual and combined contributions. The approach is modular and transfer-friendly, with potential applicability to other multi-modal dense prediction tasks.

Abstract

Referring Image Segmentation (RIS) is a challenging task that requires an algorithm to segment objects referred by free-form language expressions. Despite significant progress in recent years, most state-of-the-art (SOTA) methods still suffer from considerable language-image modality gap at the pixel and word level. These methods generally 1) rely on sentence-level language features for language-image alignment and 2) lack explicit training supervision for fine-grained visual grounding. Consequently, they exhibit weak object-level correspondence between visual and language features. Without well-grounded features, prior methods struggle to understand complex expressions that require strong reasoning over relationships among multiple objects, especially when dealing with rarely used or ambiguous clauses. To tackle this challenge, we introduce a novel Mask Grounding auxiliary task that significantly improves visual grounding within language features, by explicitly teaching the model to learn fine-grained correspondence between masked textual tokens and their matching visual objects. Mask Grounding can be directly used on prior RIS methods and consistently bring improvements. Furthermore, to holistically address the modality gap, we also design a cross-modal alignment loss and an accompanying alignment module. These additions work synergistically with Mask Grounding. With all these techniques, our comprehensive approach culminates in MagNet (Mask-grounded Network), an architecture that significantly outperforms prior arts on three key benchmarks (RefCOCO, RefCOCO+ and G-Ref), demonstrating our method's effectiveness in addressing current limitations of RIS algorithms. Our code and pre-trained weights will be released.

Mask Grounding for Referring Image Segmentation

TL;DR

The paper tackles the modality gap in Referring Image Segmentation by introducing Mask Grounding, an auxiliary task that teaches fine-grained word–object associations through masked token prediction grounded in visual and segmentation cues. It further enhances cross-modal alignment with a Cross-modal Alignment Module (CAM) and a holistic Cross-modal Alignment Loss (CAL), forming MagNet. Empirical results on RefCOCO, RefCOCO+, and G-Ref demonstrate state-of-the-art performance and strong ablations validate the individual and combined contributions. The approach is modular and transfer-friendly, with potential applicability to other multi-modal dense prediction tasks.

Abstract

Referring Image Segmentation (RIS) is a challenging task that requires an algorithm to segment objects referred by free-form language expressions. Despite significant progress in recent years, most state-of-the-art (SOTA) methods still suffer from considerable language-image modality gap at the pixel and word level. These methods generally 1) rely on sentence-level language features for language-image alignment and 2) lack explicit training supervision for fine-grained visual grounding. Consequently, they exhibit weak object-level correspondence between visual and language features. Without well-grounded features, prior methods struggle to understand complex expressions that require strong reasoning over relationships among multiple objects, especially when dealing with rarely used or ambiguous clauses. To tackle this challenge, we introduce a novel Mask Grounding auxiliary task that significantly improves visual grounding within language features, by explicitly teaching the model to learn fine-grained correspondence between masked textual tokens and their matching visual objects. Mask Grounding can be directly used on prior RIS methods and consistently bring improvements. Furthermore, to holistically address the modality gap, we also design a cross-modal alignment loss and an accompanying alignment module. These additions work synergistically with Mask Grounding. With all these techniques, our comprehensive approach culminates in MagNet (Mask-grounded Network), an architecture that significantly outperforms prior arts on three key benchmarks (RefCOCO, RefCOCO+ and G-Ref), demonstrating our method's effectiveness in addressing current limitations of RIS algorithms. Our code and pre-trained weights will be released.
Paper Structure (16 sections, 4 equations, 7 figures, 3 tables)

This paper contains 16 sections, 4 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Importance of Fine-grained Visual Grounding for RIS. Most RIS algorithms lack well-grounded text features. As a result, they struggle in difficult cases illustrated in (a) and (b). Red mask are predictions of LAVT, one of the recent SOTA RIS methods. Yellow dotted boxes are the ground truths.
  • Figure 2: (a) Current SOTA RIS methods mainly focus on designing and improving multi-modal alignment modules and/or alignment losses. These methods generally 1) do not have explicit training supervision for fine-grained visual grounding and 2) use sentence-level language features or image/pixel-level image features for alignment. As a result, their language features lack precise visual-textual object correspondence. (b) Our proposed Mask Grounding remedies this problem by explicitly teaching our model to learn fine-grained correspondence between masked word tokens and their matching visual objects through an auxiliary alignment task.
  • Figure 3: Overview of Mask Grounding. This task enriches fine-grained visual grounding in language features by guiding the model to learn detailed textual-visual associations. To perform this task, we first use an MLP-based Mask Encoder to encode center-coordinates of segmentation masks. Then, we randomly mask textual tokens in language inputs before extracting their features. Finally, we pass the encoded language, image and mask features to a Transformer-based Masked Token Predictor to perform masked token prediction.
  • Figure 4: Cross-modal Alignment Module. This module enables bidirectional language-image interaction and addresses granularity mismatches between language and image features, thereby enhancing segmentation accuracy for RIS. $\text{X-MHA}$ denotes bi-directional cross-modal multi-head attention. $\textbf{P}_i$ and $\textbf{P}_{i+1}$ denote input and output image features, whereas $\textbf{O}_i$ and $\textbf{O}_{i+1}$ denote input and output language features. Up denotes upsampling.
  • Figure 5: Visualization of MagNet's predictions. Compared to one of the state-of-the-art method, LAVT, our method performs much better in various complex scenerios, suggesting its impressive capability to reason about various complex visual-object relationships.
  • ...and 2 more figures