Table of Contents
Fetching ...

SSNet: Saliency Prior and State Space Model-based Network for Salient Object Detection in RGB-D Images

Gargi Panda, Soumitra Kundu, Saumik Bhattacharya, Aurobinda Routray

TL;DR

This work tackles RGB-D salient object detection by modeling cross-modal global dependencies with a state space model (SSM) based decoder, introducing CM-S6 for inter-modal coupling. It integrates adaptive depth enhancement (ACE) and a saliency enhancement module (SEM) that leverages three saliency priors to refine feature representations. The proposed multi-modal multi-scale decoder (M2DM) with self-modality and cross-modality blocks (SGFB/CGFB) yields linear-complexity long-range context modeling across modalities. Extensive experiments on seven benchmarks show state-of-the-art performance, with ablations confirming the effectiveness of CM-S6, ACE, SEM, SMDB, and CMDB. The approach offers efficient cross-modal saliency reasoning and robustness to depth quality issues, advancing RGB-D SOD in real-world scenarios.

Abstract

Salient object detection (SOD) in RGB-D images is an essential task in computer vision, enabling applications in scene understanding, robotics, and augmented reality. However, existing methods struggle to capture global dependency across modalities, lack comprehensive saliency priors from both RGB and depth data, and are ineffective in handling low-quality depth maps. To address these challenges, we propose SSNet, a saliency-prior and state space model (SSM)-based network for the RGB-D SOD task. Unlike existing convolution- or transformer-based approaches, SSNet introduces an SSM-based multi-modal multi-scale decoder module to efficiently capture both intra- and inter-modal global dependency with linear complexity. Specifically, we propose a cross-modal selective scan SSM (CM-S6) mechanism, which effectively captures global dependency between different modalities. Furthermore, we introduce a saliency enhancement module (SEM) that integrates three saliency priors with deep features to refine feature representation and improve the localization of salient objects. To further address the issue of low-quality depth maps, we propose an adaptive contrast enhancement technique that dynamically refines depth maps, making them more suitable for the RGB-D SOD task. Extensive quantitative and qualitative experiments on seven benchmark datasets demonstrate that SSNet outperforms state-of-the-art methods.

SSNet: Saliency Prior and State Space Model-based Network for Salient Object Detection in RGB-D Images

TL;DR

This work tackles RGB-D salient object detection by modeling cross-modal global dependencies with a state space model (SSM) based decoder, introducing CM-S6 for inter-modal coupling. It integrates adaptive depth enhancement (ACE) and a saliency enhancement module (SEM) that leverages three saliency priors to refine feature representations. The proposed multi-modal multi-scale decoder (M2DM) with self-modality and cross-modality blocks (SGFB/CGFB) yields linear-complexity long-range context modeling across modalities. Extensive experiments on seven benchmarks show state-of-the-art performance, with ablations confirming the effectiveness of CM-S6, ACE, SEM, SMDB, and CMDB. The approach offers efficient cross-modal saliency reasoning and robustness to depth quality issues, advancing RGB-D SOD in real-world scenarios.

Abstract

Salient object detection (SOD) in RGB-D images is an essential task in computer vision, enabling applications in scene understanding, robotics, and augmented reality. However, existing methods struggle to capture global dependency across modalities, lack comprehensive saliency priors from both RGB and depth data, and are ineffective in handling low-quality depth maps. To address these challenges, we propose SSNet, a saliency-prior and state space model (SSM)-based network for the RGB-D SOD task. Unlike existing convolution- or transformer-based approaches, SSNet introduces an SSM-based multi-modal multi-scale decoder module to efficiently capture both intra- and inter-modal global dependency with linear complexity. Specifically, we propose a cross-modal selective scan SSM (CM-S6) mechanism, which effectively captures global dependency between different modalities. Furthermore, we introduce a saliency enhancement module (SEM) that integrates three saliency priors with deep features to refine feature representation and improve the localization of salient objects. To further address the issue of low-quality depth maps, we propose an adaptive contrast enhancement technique that dynamically refines depth maps, making them more suitable for the RGB-D SOD task. Extensive quantitative and qualitative experiments on seven benchmark datasets demonstrate that SSNet outperforms state-of-the-art methods.

Paper Structure

This paper contains 18 sections, 8 figures, 2 tables.

Figures (8)

  • Figure 1: Overall framework and architecture of SSNet.
  • Figure 2: Structure of Saliency Enhancement Module (SEM).
  • Figure 3: Structure of Multi-modal Multi-scale Decoder Module (M2DM) and Reconstruction Module (RM).
  • Figure 4: Architecture of Self-modality Global Feature Block (SGFB) and Cross-modality Global Feature Block (CGFB).
  • Figure 5: Visual comparisons of SSNet to SOTA methods.
  • ...and 3 more figures