Table of Contents
Fetching ...

Spatial Semantic Recurrent Mining for Referring Image Segmentation

Jiaxing Yang, Lihe Zhang, Jiayu Sun, Huchuan Lu

TL;DR

The paper tackles Referring Image Segmentation (RIS) by introducing Spatial Semantic Recurrent Mining (S2RM), a cross-modality fusion framework that distributes language features, co-parses semantics across row/column slices, and balances parsed semantics. It is complemented by the Cross-scale Abstract Semantic Guided Decoder (CASG), which fuses multiscale vision features with language cues to refine the referent mask. The approach is built on top of Swin Transformer visual features and Bert language encoding, and is demonstrated to achieve strong, consistent improvements on four challenging RIS benchmarks with ablations confirming the contribution of each component. The work offers a scalable, low-overhead strategy for robust cross-modal reasoning in RIS, with practical implications for improved human-robot interaction and image understanding.

Abstract

Referring Image Segmentation (RIS) consistently requires language and appearance semantics to more understand each other. The need becomes acute especially under hard situations. To achieve, existing works tend to resort to various trans-representing mechanisms to directly feed forward language semantic along main RGB branch, which however will result in referent distribution weakly-mined in space and non-referent semantic contaminated along channel. In this paper, we propose Spatial Semantic Recurrent Mining (S\textsuperscript{2}RM) to achieve high-quality cross-modality fusion. It follows a working strategy of trilogy: distributing language feature, spatial semantic recurrent coparsing, and parsed-semantic balancing. During fusion, S\textsuperscript{2}RM will first generate a constraint-weak yet distribution-aware language feature, then bundle features of each row and column from rotated features of one modality context to recurrently correlate relevant semantic contained in feature from other modality context, and finally resort to self-distilled weights to weigh on the contributions of different parsed semantics. Via coparsing, S\textsuperscript{2}RM transports information from the near and remote slice layers of generator context to the current slice layer of parsed context, capable of better modeling global relationship bidirectional and structured. Besides, we also propose a Cross-scale Abstract Semantic Guided Decoder (CASG) to emphasize the foreground of the referent, finally integrating different grained features at a comparatively low cost. Extensive experimental results on four current challenging datasets show that our proposed method performs favorably against other state-of-the-art algorithms.

Spatial Semantic Recurrent Mining for Referring Image Segmentation

TL;DR

The paper tackles Referring Image Segmentation (RIS) by introducing Spatial Semantic Recurrent Mining (S2RM), a cross-modality fusion framework that distributes language features, co-parses semantics across row/column slices, and balances parsed semantics. It is complemented by the Cross-scale Abstract Semantic Guided Decoder (CASG), which fuses multiscale vision features with language cues to refine the referent mask. The approach is built on top of Swin Transformer visual features and Bert language encoding, and is demonstrated to achieve strong, consistent improvements on four challenging RIS benchmarks with ablations confirming the contribution of each component. The work offers a scalable, low-overhead strategy for robust cross-modal reasoning in RIS, with practical implications for improved human-robot interaction and image understanding.

Abstract

Referring Image Segmentation (RIS) consistently requires language and appearance semantics to more understand each other. The need becomes acute especially under hard situations. To achieve, existing works tend to resort to various trans-representing mechanisms to directly feed forward language semantic along main RGB branch, which however will result in referent distribution weakly-mined in space and non-referent semantic contaminated along channel. In this paper, we propose Spatial Semantic Recurrent Mining (S\textsuperscript{2}RM) to achieve high-quality cross-modality fusion. It follows a working strategy of trilogy: distributing language feature, spatial semantic recurrent coparsing, and parsed-semantic balancing. During fusion, S\textsuperscript{2}RM will first generate a constraint-weak yet distribution-aware language feature, then bundle features of each row and column from rotated features of one modality context to recurrently correlate relevant semantic contained in feature from other modality context, and finally resort to self-distilled weights to weigh on the contributions of different parsed semantics. Via coparsing, S\textsuperscript{2}RM transports information from the near and remote slice layers of generator context to the current slice layer of parsed context, capable of better modeling global relationship bidirectional and structured. Besides, we also propose a Cross-scale Abstract Semantic Guided Decoder (CASG) to emphasize the foreground of the referent, finally integrating different grained features at a comparatively low cost. Extensive experimental results on four current challenging datasets show that our proposed method performs favorably against other state-of-the-art algorithms.
Paper Structure (16 sections, 15 equations, 6 figures, 7 tables)

This paper contains 16 sections, 15 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Architecture overview. On top of Swin and Bert, the proposed S2RM is installed to mine global distribution information in a bidirectional and structured way. Middle green parts (second step of S2RM) visualize how to use generator context $\mathcal{T}^{\rm dist}$ to generate content-adaptive slices to correlate parsed context $\mathcal{V}_{4}$ as maps in column-wise and row-wise.
  • Figure 2: Transformations in Spatial Semantic Recurrent Coparsing of S2RM, where slice layers from one modality parse semantics from other modality as four group maps.
  • Figure 3: The detailed $i$th stage of CASG. In the process, sentence-level language feature and the semantic-rich feature from previous decoding stages are used to help the pure vision feature supplement effective details of referent.
  • Figure 4: Visualization results of proposed techniques on some samples selected from the validation set of RefCOCO.
  • Figure 5: Visualization of the proposed method. The samples are selected from the validation set of RefCOCO.
  • ...and 1 more figures