Table of Contents
Fetching ...

ReMamber: Referring Image Segmentation with Mamba Twister

Yuhuan Yang, Chaofan Ma, Jiangchao Yao, Zhun Zhong, Ya Zhang, Yanfeng Wang

TL;DR

This work tackles the costly cross-modal attention challenge in referring image segmentation by introducing ReMamber, a Mamba-based RIS architecture that uses Mamba Twister blocks to fuse visual and textual information with linear-time complexity. The method explicitly models vision-language interactions, forms a hybrid feature cube via global and local cross-modal cues, and then twists the cube through channel and spatial scans to enhance multi-modal fusion. It demonstrates competitive performance across RefCOCO, RefCOCO+, and G-Ref benchmarks and provides extensive analyses comparing fusion designs, visualizations of attention maps, and ablations of the key components. The approach offers a scalable, efficient pathway for RIS and broader multi-modal understanding, with code release to facilitate adoption and further research.

Abstract

Referring Image Segmentation~(RIS) leveraging transformers has achieved great success on the interpretation of complex visual-language tasks. However, the quadratic computation cost makes it resource-consuming in capturing long-range visual-language dependencies. Fortunately, Mamba addresses this with efficient linear complexity in processing. However, directly applying Mamba to multi-modal interactions presents challenges, primarily due to inadequate channel interactions for the effective fusion of multi-modal data. In this paper, we propose ReMamber, a novel RIS architecture that integrates the power of Mamba with a multi-modal Mamba Twister block. The Mamba Twister explicitly models image-text interaction, and fuses textual and visual features through its unique channel and spatial twisting mechanism. We achieve competitive results on three challenging benchmarks with a simple and efficient architecture. Moreover, we conduct thorough analyses of ReMamber and discuss other fusion designs using Mamba. These provide valuable perspectives for future research. The code has been released at: https://github.com/yyh-rain-song/ReMamber.

ReMamber: Referring Image Segmentation with Mamba Twister

TL;DR

This work tackles the costly cross-modal attention challenge in referring image segmentation by introducing ReMamber, a Mamba-based RIS architecture that uses Mamba Twister blocks to fuse visual and textual information with linear-time complexity. The method explicitly models vision-language interactions, forms a hybrid feature cube via global and local cross-modal cues, and then twists the cube through channel and spatial scans to enhance multi-modal fusion. It demonstrates competitive performance across RefCOCO, RefCOCO+, and G-Ref benchmarks and provides extensive analyses comparing fusion designs, visualizations of attention maps, and ablations of the key components. The approach offers a scalable, efficient pathway for RIS and broader multi-modal understanding, with code release to facilitate adoption and further research.

Abstract

Referring Image Segmentation~(RIS) leveraging transformers has achieved great success on the interpretation of complex visual-language tasks. However, the quadratic computation cost makes it resource-consuming in capturing long-range visual-language dependencies. Fortunately, Mamba addresses this with efficient linear complexity in processing. However, directly applying Mamba to multi-modal interactions presents challenges, primarily due to inadequate channel interactions for the effective fusion of multi-modal data. In this paper, we propose ReMamber, a novel RIS architecture that integrates the power of Mamba with a multi-modal Mamba Twister block. The Mamba Twister explicitly models image-text interaction, and fuses textual and visual features through its unique channel and spatial twisting mechanism. We achieve competitive results on three challenging benchmarks with a simple and efficient architecture. Moreover, we conduct thorough analyses of ReMamber and discuss other fusion designs using Mamba. These provide valuable perspectives for future research. The code has been released at: https://github.com/yyh-rain-song/ReMamber.
Paper Structure (31 sections, 7 equations, 8 figures, 4 tables)

This paper contains 31 sections, 7 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: We propose ReMamber, a novel referring segmentation architecture with Mamba twister. It consists of several Mamba Twister block. Each block contains several visual state space (VSS) layers and a Twisting layer. The Twisting layer first calculates the interaction between image and text, and then forms a hybrid feature cube. Finally, it "twists" the feature cube using the Channel and Spatial Scan along each dimension.
  • Figure 2: Overview architecture of our ReMamber. The basic block for ReMamber is the Mamba Twister block. It consists of several visual state space (VSS) layers and a Twisting layer. The Twisting layer first constructs hybrid feature cube from text, image, and multi-modal features via channel concatenation. Then, it "twists" the cube by Channel Scan and Spatial Scan. We extract intermediate features after each Mamba Twister block, and feed it into a flexible decoder for final segmentation.
  • Figure 3: Other multi-modal fusion designs. (a) In-context Conditioning appends text tokens ahead of image tokens. (b) Attention-based Conditioning utilizes cross-attention mechanism for modality fusion. (c) Norm Adaptation learns a scale and bias for the model's normalization layers.
  • Figure 4: Cross-Attention map (up) and our local interaction map (down) comparison. Though both methods are able to predict target correctly, the cross-attention maps don't show correct image-text correlation, while ours are able to capture this relationship accurately, indicating that Mamba Twister is able to gradually fusing the two modality.
  • Figure 5: Data distribution after Channel Scan and Spacial Scan.Image in red and text data in blue. The Channel Scan tends to aggregate different modalities towards the distribution of textual side. The Spatial Scan reintegrates the previously aligned modalities, distributing them in a manner that reflects their combined influence.
  • ...and 3 more figures