ReMamber: Referring Image Segmentation with Mamba Twister
Yuhuan Yang, Chaofan Ma, Jiangchao Yao, Zhun Zhong, Ya Zhang, Yanfeng Wang
TL;DR
This work tackles the costly cross-modal attention challenge in referring image segmentation by introducing ReMamber, a Mamba-based RIS architecture that uses Mamba Twister blocks to fuse visual and textual information with linear-time complexity. The method explicitly models vision-language interactions, forms a hybrid feature cube via global and local cross-modal cues, and then twists the cube through channel and spatial scans to enhance multi-modal fusion. It demonstrates competitive performance across RefCOCO, RefCOCO+, and G-Ref benchmarks and provides extensive analyses comparing fusion designs, visualizations of attention maps, and ablations of the key components. The approach offers a scalable, efficient pathway for RIS and broader multi-modal understanding, with code release to facilitate adoption and further research.
Abstract
Referring Image Segmentation~(RIS) leveraging transformers has achieved great success on the interpretation of complex visual-language tasks. However, the quadratic computation cost makes it resource-consuming in capturing long-range visual-language dependencies. Fortunately, Mamba addresses this with efficient linear complexity in processing. However, directly applying Mamba to multi-modal interactions presents challenges, primarily due to inadequate channel interactions for the effective fusion of multi-modal data. In this paper, we propose ReMamber, a novel RIS architecture that integrates the power of Mamba with a multi-modal Mamba Twister block. The Mamba Twister explicitly models image-text interaction, and fuses textual and visual features through its unique channel and spatial twisting mechanism. We achieve competitive results on three challenging benchmarks with a simple and efficient architecture. Moreover, we conduct thorough analyses of ReMamber and discuss other fusion designs using Mamba. These provide valuable perspectives for future research. The code has been released at: https://github.com/yyh-rain-song/ReMamber.
