Table of Contents
Fetching ...

SaFiRe: Saccade-Fixation Reiteration with Mamba for Referring Image Segmentation

Zhenjie Mao, Yuhuan Yang, Chaofan Ma, Dongsheng Jiang, Jiangchao Yao, Ya Zhang, Yanfeng Wang

TL;DR

A novel framework, SaFiRe, is proposed, which mimics the human two-phase cognitive process--first forming a global understanding, then refining it through detail-oriented inspection, which demonstrates the superiority of SaFiRe over state-of-the-art baselines.

Abstract

Referring Image Segmentation (RIS) aims to segment the target object in an image given a natural language expression. While recent methods leverage pre-trained vision backbones and more training corpus to achieve impressive results, they predominantly focus on simple expressions--short, clear noun phrases like "red car" or "left girl". This simplification often reduces RIS to a key word/concept matching problem, limiting the model's ability to handle referential ambiguity in expressions. In this work, we identify two challenging real-world scenarios: object-distracting expressions, which involve multiple entities with contextual cues, and category-implicit expressions, where the object class is not explicitly stated. To address the challenges, we propose a novel framework, SaFiRe, which mimics the human two-phase cognitive process--first forming a global understanding, then refining it through detail-oriented inspection. This is naturally supported by Mamba's scan-then-update property, which aligns with our phased design and enables efficient multi-cycle refinement with linear complexity. We further introduce aRefCOCO, a new benchmark designed to evaluate RIS models under ambiguous referring expressions. Extensive experiments on both standard and proposed datasets demonstrate the superiority of SaFiRe over state-of-the-art baselines.

SaFiRe: Saccade-Fixation Reiteration with Mamba for Referring Image Segmentation

TL;DR

A novel framework, SaFiRe, is proposed, which mimics the human two-phase cognitive process--first forming a global understanding, then refining it through detail-oriented inspection, which demonstrates the superiority of SaFiRe over state-of-the-art baselines.

Abstract

Referring Image Segmentation (RIS) aims to segment the target object in an image given a natural language expression. While recent methods leverage pre-trained vision backbones and more training corpus to achieve impressive results, they predominantly focus on simple expressions--short, clear noun phrases like "red car" or "left girl". This simplification often reduces RIS to a key word/concept matching problem, limiting the model's ability to handle referential ambiguity in expressions. In this work, we identify two challenging real-world scenarios: object-distracting expressions, which involve multiple entities with contextual cues, and category-implicit expressions, where the object class is not explicitly stated. To address the challenges, we propose a novel framework, SaFiRe, which mimics the human two-phase cognitive process--first forming a global understanding, then refining it through detail-oriented inspection. This is naturally supported by Mamba's scan-then-update property, which aligns with our phased design and enables efficient multi-cycle refinement with linear complexity. We further introduce aRefCOCO, a new benchmark designed to evaluate RIS models under ambiguous referring expressions. Extensive experiments on both standard and proposed datasets demonstrate the superiority of SaFiRe over state-of-the-art baselines.

Paper Structure

This paper contains 27 sections, 13 equations, 9 figures, 5 tables.

Figures (9)

  • Figure 1: Referential Ambiguity: One Object, Divergent Attention. Attention maps under three types of referring expressions, all targeting the same object. (a) Simple expression yields accurate focus. (b) Object-distracting expression misguides attention to irrelevant regions. (c) Category-implicit expression leads to dispersed attention. This highlights the challenge of referential ambiguity for "key word/concept matching" method and motivates our saccade-fixation framework.
  • Figure 2: Overview of the Architecture. For each SFLayer, it consists of Saccade operation and Fixation operation. The Saccade operation corresponds to the phase of global semantic understanding. It enables the model to rapidly scan both visual and textual inputs, establishing a coarse-level alignment between the two modalities. The Fixation operation mirrors the cross-modal refinement phase. It allows the model to attend to specific local visual regions while re-examining the textual input, facilitating the extraction of fine-grained, task-relevant information.
  • Figure 3: Performance Comparison of Different Fixation Window Sizes.
  • Figure 3: Layer‑Wise Visual Feature Maps. Left→right corresponds to shallow→deep layers. The full model shows balanced activation. Without Fixation, local detail is missing; without Saccade, global focus weakens. The differing activation patterns reflect their complementary roles.
  • Figure 4: Visualization Results for aRefCOCO. Compared to the other two methods, our is more capable of comprehending ambiguous referring descriptions.
  • ...and 4 more figures