Table of Contents
Fetching ...

DeRIS: Decoupling Perception and Cognition for Enhanced Referring Image Segmentation through Loopback Synergy

Ming Dai, Wenxuan Cheng, Jiang-jiang Liu, Sen Yang, Wenxiao Cai, Yanpeng Sun, Wankou Yang

TL;DR

DeRIS addresses referring image segmentation by decoupling perception and cognition into two specialized branches connected through loopback synergy, which iteratively refines both perceptual localization and multimodal understanding. It identifies cognition as the primary bottleneck in RIS and remedies it with progressive cross-modal query exchanges and an NSC augmentation that expands non-referent training pairs. The framework achieves state-of-the-art results on RefCOCO/+/g and gRefCOCO rela, demonstrating strong generalization to non referent and multi referent scenarios while maintaining efficiency. The approach offers practical benefits for accurate and robust image-text grounded segmentation in diverse real-world settings.

Abstract

Referring Image Segmentation (RIS) is a challenging task that aims to segment objects in an image based on natural language expressions. While prior studies have predominantly concentrated on improving vision-language interactions and achieving fine-grained localization, a systematic analysis of the fundamental bottlenecks in existing RIS frameworks remains underexplored. To bridge this gap, we propose DeRIS, a novel framework that decomposes RIS into two key components: perception and cognition. This modular decomposition facilitates a systematic analysis of the primary bottlenecks impeding RIS performance. Our findings reveal that the predominant limitation lies not in perceptual deficiencies, but in the insufficient multi-modal cognitive capacity of current models. To mitigate this, we propose a Loopback Synergy mechanism, which enhances the synergy between the perception and cognition modules, thereby enabling precise segmentation while simultaneously improving robust image-text comprehension. Additionally, we analyze and introduce a simple non-referent sample conversion data augmentation to address the long-tail distribution issue related to target existence judgement in general scenarios. Notably, DeRIS demonstrates inherent adaptability to both non- and multi-referents scenarios without requiring specialized architectural modifications, enhancing its general applicability. The codes and models are available at https://github.com/Dmmm1997/DeRIS.

DeRIS: Decoupling Perception and Cognition for Enhanced Referring Image Segmentation through Loopback Synergy

TL;DR

DeRIS addresses referring image segmentation by decoupling perception and cognition into two specialized branches connected through loopback synergy, which iteratively refines both perceptual localization and multimodal understanding. It identifies cognition as the primary bottleneck in RIS and remedies it with progressive cross-modal query exchanges and an NSC augmentation that expands non-referent training pairs. The framework achieves state-of-the-art results on RefCOCO/+/g and gRefCOCO rela, demonstrating strong generalization to non referent and multi referent scenarios while maintaining efficiency. The approach offers practical benefits for accurate and robust image-text grounded segmentation in diverse real-world settings.

Abstract

Referring Image Segmentation (RIS) is a challenging task that aims to segment objects in an image based on natural language expressions. While prior studies have predominantly concentrated on improving vision-language interactions and achieving fine-grained localization, a systematic analysis of the fundamental bottlenecks in existing RIS frameworks remains underexplored. To bridge this gap, we propose DeRIS, a novel framework that decomposes RIS into two key components: perception and cognition. This modular decomposition facilitates a systematic analysis of the primary bottlenecks impeding RIS performance. Our findings reveal that the predominant limitation lies not in perceptual deficiencies, but in the insufficient multi-modal cognitive capacity of current models. To mitigate this, we propose a Loopback Synergy mechanism, which enhances the synergy between the perception and cognition modules, thereby enabling precise segmentation while simultaneously improving robust image-text comprehension. Additionally, we analyze and introduce a simple non-referent sample conversion data augmentation to address the long-tail distribution issue related to target existence judgement in general scenarios. Notably, DeRIS demonstrates inherent adaptability to both non- and multi-referents scenarios without requiring specialized architectural modifications, enhancing its general applicability. The codes and models are available at https://github.com/Dmmm1997/DeRIS.

Paper Structure

This paper contains 24 sections, 8 equations, 6 figures, 8 tables.

Figures (6)

  • Figure 1: The visualizations of architectures with different emphases are shown in Fig. \ref{['fig:motivation']}. (a) The perception-centric structure emphasizes fine-grained hierarchical features. (b) The cognition-centric structure comprehensively understands the image-text context. (c) Our DeRIS integrates the advantages of both.
  • Figure 2: Comparison of architectural paradigms: (a) Perception-centric models rely on hierarchical encoders resnetyolov3swin to preserve fine-grained spatial features. (b) Cognition-centric models leverage vision-language pre-trained model to enhance multi-modal representation and alignment, where V+L refers to two-stream models clipregionclipdeclip, and V-L denotes one-stream models beit3vilt. (c) The proposed DeRIS framework, which integrates robust cognition and fine-grained perception capabilities.
  • Figure 3: Visualization of strong perception with weak cognition.
  • Figure 4: Overview of the proposed DeRIS framework. The RIS task is decoupled into perception and cognition branches, with a loopback synergy mechanism facilitating iterative information exchange. This design enhances synergy between the two branches, enabling a dynamic and progressive understanding of both perceptual targets and multi-modal semantics.
  • Figure 5: The architecture of the loopback synergy. Each round of interaction consists of a cognition layer and a perception layer. The perception layer provides object queries $Q_p$ and the decoded mask $M_p$ to the cognition layer. The cognition layer interacts with these, and produces cognition queries $Q_c$ and referent scores $S_r$.
  • ...and 1 more figures