Table of Contents
Fetching ...

RISAM: Referring Image Segmentation via Mutual-Aware Attention Features

Mengxi Zhang, Yiming Liu, Xiangjun Yin, Huanjing Yue, Jingyu Yang

TL;DR

This work addresses referring image segmentation by bridging large vision-language foundation models and RIS. It introduces RISAM, a cross-modal architecture that uses mutual-aware attention with a Vision-Guided and a Language-Guided branch, plus a Mutual-Aware Mask Decoder and a multi-modal query token to enforce language-consistent masks. A feature enhancement module and a parameter-efficient fine-tuning strategy enable transferring knowledge from SAM while preserving encoder generalization. Empirical results on RefCOCO, RefCOCO+, G-Ref, PhraseCut, and gRefCOCO demonstrate state-of-the-art performance, strong generalization, and effective multi-object RIS capabilities, underscoring the practical value of integrating SAM into RIS via targeted cross-modal design.

Abstract

Referring image segmentation (RIS) aims to segment a particular region based on a language expression prompt. Existing methods incorporate linguistic features into visual features and obtain multi-modal features for mask decoding. However, these methods may segment the visually salient entity instead of the correct referring region, as the multi-modal features are dominated by the abundant visual context. In this paper, we propose MARIS, a referring image segmentation method that leverages the Segment Anything Model (SAM) and introduces a mutual-aware attention mechanism to enhance the cross-modal fusion via two parallel branches. Specifically, our mutual-aware attention mechanism consists of Vision-Guided Attention and Language-Guided Attention, which bidirectionally model the relationship between visual and linguistic features. Correspondingly, we design a Mask Decoder to enable explicit linguistic guidance for more consistent segmentation with the language expression. To this end, a multi-modal query token is proposed to integrate linguistic information and interact with visual information simultaneously. Extensive experiments on three benchmark datasets show that our method outperforms the state-of-the-art RIS methods. Our code will be publicly available.

RISAM: Referring Image Segmentation via Mutual-Aware Attention Features

TL;DR

This work addresses referring image segmentation by bridging large vision-language foundation models and RIS. It introduces RISAM, a cross-modal architecture that uses mutual-aware attention with a Vision-Guided and a Language-Guided branch, plus a Mutual-Aware Mask Decoder and a multi-modal query token to enforce language-consistent masks. A feature enhancement module and a parameter-efficient fine-tuning strategy enable transferring knowledge from SAM while preserving encoder generalization. Empirical results on RefCOCO, RefCOCO+, G-Ref, PhraseCut, and gRefCOCO demonstrate state-of-the-art performance, strong generalization, and effective multi-object RIS capabilities, underscoring the practical value of integrating SAM into RIS via targeted cross-modal design.

Abstract

Referring image segmentation (RIS) aims to segment a particular region based on a language expression prompt. Existing methods incorporate linguistic features into visual features and obtain multi-modal features for mask decoding. However, these methods may segment the visually salient entity instead of the correct referring region, as the multi-modal features are dominated by the abundant visual context. In this paper, we propose MARIS, a referring image segmentation method that leverages the Segment Anything Model (SAM) and introduces a mutual-aware attention mechanism to enhance the cross-modal fusion via two parallel branches. Specifically, our mutual-aware attention mechanism consists of Vision-Guided Attention and Language-Guided Attention, which bidirectionally model the relationship between visual and linguistic features. Correspondingly, we design a Mask Decoder to enable explicit linguistic guidance for more consistent segmentation with the language expression. To this end, a multi-modal query token is proposed to integrate linguistic information and interact with visual information simultaneously. Extensive experiments on three benchmark datasets show that our method outperforms the state-of-the-art RIS methods. Our code will be publicly available.
Paper Structure (29 sections, 12 equations, 10 figures, 11 tables)

This paper contains 29 sections, 12 equations, 10 figures, 11 tables.

Figures (10)

  • Figure 1: The illustration of Vision-Guided Attention (a) and Language-Guided Attention (b). For Vision-Guided Attention, we list the three most informative words for the image region symbolized by a red pentangle. For Language-Guided Attention, the most corresponding regions for each word, $e.g.$, 'cap', 'pants', and 'racket', are denoted by different colors. Previous RIS methods only consider Vision-Guided Attention to fuse visual and linguistic features, but none of these methods introduce Language-Guided Attention to generate vision-aware linguistic features and further use them in the mask decoder.
  • Figure 2: Segmentation masks generated by our method (c) and other methods, including directly using SAM sam (d), CRIS cris (e), and ReLA gres (f). Directly using SAM means training the SAM decoder only.
  • Figure 3: The overview of RISAM. For an input image, the image encoder extracts shallow/middle/deep visual features ($F_{v_1},F_{v_2},F_{v_3}$). For the language expression, the text encoder generates linguistic features ($F_l$). Then, these features are sent into the Feature Enhancement module and obtain enhanced visual features ($F_{v}$). Subsequently, Mutual-Aware Attention (MA) blocks receive enhanced visual features and linguistic features as inputs to get mutual-aware attention features. After that, the Mutual-Aware Mask Decoder utilizes a multi-modal query token and mutual-aware attention features to get the final segmentation mask. (Pos. Enc. symbolizes position encodings of linguistic features.)
  • Figure 4: The architecture of Mutual-Aware Attention block. The left part (a) is the Vision-Guided Attention branch, and the right part (b) is the Language-Guided Attention branch. $A$ and $A^{\top}$ denote the attention weights. $\otimes$ symbolizes the matmul product operation.
  • Figure 5: The architecture of Mutual-Aware Mask Decoder. This mask decoder receives vision-aware linguistic features and language-aware visual features as inputs, where the former acts as the extra linguistic guidance. Additionally, we introduce a multi-modal query token to aggregate linguistic information and interact with visual features, which is beneficial to a high-quality segmentation mask.
  • ...and 5 more figures