Table of Contents
Fetching ...

Dual Mutual Learning Network with Global-local Awareness for RGB-D Salient Object Detection

Kang Yi, Haoran Tang, Yumeng Li, Jing Xu, Jun Zhang

TL;DR

This paper proposes the GL-DMNet, a novel dual mutual learning network with global-local awareness that combines a position mutual fusion module and a channel mutual fusion module to exploit the interdependencies among different modalities in spatial and channel dimensions.

Abstract

RGB-D salient object detection (SOD), aiming to highlight prominent regions of a given scene by jointly modeling RGB and depth information, is one of the challenging pixel-level prediction tasks. Recently, the dual-attention mechanism has been devoted to this area due to its ability to strengthen the detection process. However, most existing methods directly fuse attentional cross-modality features under a manual-mandatory fusion paradigm without considering the inherent discrepancy between the RGB and depth, which may lead to a reduction in performance. Moreover, the long-range dependencies derived from global and local information make it difficult to leverage a unified efficient fusion strategy. Hence, in this paper, we propose the GL-DMNet, a novel dual mutual learning network with global-local awareness. Specifically, we present a position mutual fusion module and a channel mutual fusion module to exploit the interdependencies among different modalities in spatial and channel dimensions. Besides, we adopt an efficient decoder based on cascade transformer-infused reconstruction to integrate multi-level fusion features jointly. Extensive experiments on six benchmark datasets demonstrate that our proposed GL-DMNet performs better than 24 RGB-D SOD methods, achieving an average improvement of ~3% across four evaluation metrics compared to the second-best model (S3Net). Codes and results are available at https://github.com/kingkung2016/GL-DMNet.

Dual Mutual Learning Network with Global-local Awareness for RGB-D Salient Object Detection

TL;DR

This paper proposes the GL-DMNet, a novel dual mutual learning network with global-local awareness that combines a position mutual fusion module and a channel mutual fusion module to exploit the interdependencies among different modalities in spatial and channel dimensions.

Abstract

RGB-D salient object detection (SOD), aiming to highlight prominent regions of a given scene by jointly modeling RGB and depth information, is one of the challenging pixel-level prediction tasks. Recently, the dual-attention mechanism has been devoted to this area due to its ability to strengthen the detection process. However, most existing methods directly fuse attentional cross-modality features under a manual-mandatory fusion paradigm without considering the inherent discrepancy between the RGB and depth, which may lead to a reduction in performance. Moreover, the long-range dependencies derived from global and local information make it difficult to leverage a unified efficient fusion strategy. Hence, in this paper, we propose the GL-DMNet, a novel dual mutual learning network with global-local awareness. Specifically, we present a position mutual fusion module and a channel mutual fusion module to exploit the interdependencies among different modalities in spatial and channel dimensions. Besides, we adopt an efficient decoder based on cascade transformer-infused reconstruction to integrate multi-level fusion features jointly. Extensive experiments on six benchmark datasets demonstrate that our proposed GL-DMNet performs better than 24 RGB-D SOD methods, achieving an average improvement of ~3% across four evaluation metrics compared to the second-best model (S3Net). Codes and results are available at https://github.com/kingkung2016/GL-DMNet.
Paper Structure (19 sections, 17 equations, 9 figures, 5 tables)

This paper contains 19 sections, 17 equations, 9 figures, 5 tables.

Figures (9)

  • Figure 1: The results of our GL-DMNet and other representative methods, including CATNet CATNet, HiDANet wu2023hidanet and TriTransNet liu2021tritransnet.
  • Figure 2: Comparison between (a) FPN framework, (b) dense decode network, (c) group transformer network, (d) visual transformer FPN, (e) triplet transformer embedding network, and our (f) transformer-infused reconstruction network.
  • Figure 3: Detailed framework of the proposed GL-DMNet. We adopt the ResNet-50 network to extract features of RGB and depth inputs, respectively. Then, position mutual fusion (PMF) and channel mutual fusion (CMF) are proposed to fuse the multi-modal features. The fused features of all stages are decoded by the cascade transformer-infused reconstruction network. The saliency head fan2021rethinking is also added to generate the final predicted feature maps.
  • Figure 4: The details of position mutual fusion (PMF) module and channel mutual fusion (CMF) module.
  • Figure 5: Visual comparisons of the proposed GL-DMNet and other state-of-the-art RGB-D SOD methods, including MIRV li2024mutual, HINet bi2023cross, DLMNet yang2022depth, CCAFNet zhou2022ccafnet, DENet xu2022weakly, MoADNet jin2022moadnet, MMNet gao2022unified, CMINet yi2022cross and DCF sun2021deep. Our approach obtains competitive performance in a variety of challenging scenarios.
  • ...and 4 more figures