DeRIS: Decoupling Perception and Cognition for Enhanced Referring Image Segmentation through Loopback Synergy

Ming Dai; Wenxuan Cheng; Jiang-jiang Liu; Sen Yang; Wenxiao Cai; Yanpeng Sun; Wankou Yang

DeRIS: Decoupling Perception and Cognition for Enhanced Referring Image Segmentation through Loopback Synergy

Ming Dai, Wenxuan Cheng, Jiang-jiang Liu, Sen Yang, Wenxiao Cai, Yanpeng Sun, Wankou Yang

TL;DR

DeRIS addresses referring image segmentation by decoupling perception and cognition into two specialized branches connected through loopback synergy, which iteratively refines both perceptual localization and multimodal understanding. It identifies cognition as the primary bottleneck in RIS and remedies it with progressive cross-modal query exchanges and an NSC augmentation that expands non-referent training pairs. The framework achieves state-of-the-art results on RefCOCO/+/g and gRefCOCO rela, demonstrating strong generalization to non referent and multi referent scenarios while maintaining efficiency. The approach offers practical benefits for accurate and robust image-text grounded segmentation in diverse real-world settings.

Abstract

Referring Image Segmentation (RIS) is a challenging task that aims to segment objects in an image based on natural language expressions. While prior studies have predominantly concentrated on improving vision-language interactions and achieving fine-grained localization, a systematic analysis of the fundamental bottlenecks in existing RIS frameworks remains underexplored. To bridge this gap, we propose DeRIS, a novel framework that decomposes RIS into two key components: perception and cognition. This modular decomposition facilitates a systematic analysis of the primary bottlenecks impeding RIS performance. Our findings reveal that the predominant limitation lies not in perceptual deficiencies, but in the insufficient multi-modal cognitive capacity of current models. To mitigate this, we propose a Loopback Synergy mechanism, which enhances the synergy between the perception and cognition modules, thereby enabling precise segmentation while simultaneously improving robust image-text comprehension. Additionally, we analyze and introduce a simple non-referent sample conversion data augmentation to address the long-tail distribution issue related to target existence judgement in general scenarios. Notably, DeRIS demonstrates inherent adaptability to both non- and multi-referents scenarios without requiring specialized architectural modifications, enhancing its general applicability. The codes and models are available at https://github.com/Dmmm1997/DeRIS.

DeRIS: Decoupling Perception and Cognition for Enhanced Referring Image Segmentation through Loopback Synergy

TL;DR

Abstract

DeRIS: Decoupling Perception and Cognition for Enhanced Referring Image Segmentation through Loopback Synergy

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)