Table of Contents
Fetching ...

Depth Awakens: A Depth-perceptual Attention Fusion Network for RGB-D Camouflaged Object Detection

Xinran Liua, Lin Qia, Yuxuan Songa, Qi Wen

TL;DR

This work tackles camouflaged object detection by leveraging depth maps as direct inputs to reveal 3D cues absent in 2D RGB images. It introduces DAF-Net, featuring a three-branch encoder, a Depth-weighted Cross-attention Fusion module, and a lightweight Feature Aggregation Decoder to adaptively fuse RGB and depth information. Through MiDaS-based depth estimation and extensive experiments on CAMO, COD10K, and NC4K, the approach achieves state-of-the-art COD performance and demonstrates the value of depth cues in challenging camouflage scenarios. The study bridges single-image depth estimation (SIDE) and COD, showing depth information can be effectively exploited despite potential noise, and opens avenues for broader multimodal fusion in COD and related tasks.

Abstract

Camouflaged object detection (COD) presents a persistent challenge in accurately identifying objects that seamlessly blend into their surroundings. However, most existing COD models overlook the fact that visual systems operate within a genuine 3D environment. The scene depth inherent in a single 2D image provides rich spatial clues that can assist in the detection of camouflaged objects. Therefore, we propose a novel depth-perception attention fusion network that leverages the depth map as an auxiliary input to enhance the network's ability to perceive 3D information, which is typically challenging for the human eye to discern from 2D images. The network uses a trident-branch encoder to extract chromatic and depth information and their communications. Recognizing that certain regions of a depth map may not effectively highlight the camouflaged object, we introduce a depth-weighted cross-attention fusion module to dynamically adjust the fusion weights on depth and RGB feature maps. To keep the model simple without compromising effectiveness, we design a straightforward feature aggregation decoder that adaptively fuses the enhanced aggregated features. Experiments demonstrate the significant superiority of our proposed method over other states of the arts, which further validates the contribution of depth information in camouflaged object detection. The code will be available at https://github.com/xinran-liu00/DAF-Net.

Depth Awakens: A Depth-perceptual Attention Fusion Network for RGB-D Camouflaged Object Detection

TL;DR

This work tackles camouflaged object detection by leveraging depth maps as direct inputs to reveal 3D cues absent in 2D RGB images. It introduces DAF-Net, featuring a three-branch encoder, a Depth-weighted Cross-attention Fusion module, and a lightweight Feature Aggregation Decoder to adaptively fuse RGB and depth information. Through MiDaS-based depth estimation and extensive experiments on CAMO, COD10K, and NC4K, the approach achieves state-of-the-art COD performance and demonstrates the value of depth cues in challenging camouflage scenarios. The study bridges single-image depth estimation (SIDE) and COD, showing depth information can be effectively exploited despite potential noise, and opens avenues for broader multimodal fusion in COD and related tasks.

Abstract

Camouflaged object detection (COD) presents a persistent challenge in accurately identifying objects that seamlessly blend into their surroundings. However, most existing COD models overlook the fact that visual systems operate within a genuine 3D environment. The scene depth inherent in a single 2D image provides rich spatial clues that can assist in the detection of camouflaged objects. Therefore, we propose a novel depth-perception attention fusion network that leverages the depth map as an auxiliary input to enhance the network's ability to perceive 3D information, which is typically challenging for the human eye to discern from 2D images. The network uses a trident-branch encoder to extract chromatic and depth information and their communications. Recognizing that certain regions of a depth map may not effectively highlight the camouflaged object, we introduce a depth-weighted cross-attention fusion module to dynamically adjust the fusion weights on depth and RGB feature maps. To keep the model simple without compromising effectiveness, we design a straightforward feature aggregation decoder that adaptively fuses the enhanced aggregated features. Experiments demonstrate the significant superiority of our proposed method over other states of the arts, which further validates the contribution of depth information in camouflaged object detection. The code will be available at https://github.com/xinran-liu00/DAF-Net.
Paper Structure (22 sections, 7 equations, 7 figures, 6 tables)

This paper contains 22 sections, 7 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Visual examples of different methods. (a) RGB images. (b) Ground truths. (c) Depth maps. (d) Our results. (e)-(f) Prediction maps produced by SINet-V2fan2021concealed and LSR+lv2023towards, respectively.
  • Figure 2: Overview of our proposed Depth-perceptual Attention Fusion Network(DAF-Net). The proposed network consists of our designed Depth-weighted Cross-attention Fusion (DCF, see Section \ref{['AA']}) module and Feature Aggregation Decoder (FAD, see Section \ref{['BB']}). DCF aims to fuse valuable depth cues while fully suppressing redundant information and background noise. The decoder aims to adaptively fuse the enhanced aggregated features without increasing the complexity of the model. RMFE is from zhu2022can.
  • Figure 3: Visualization of depth maps generated by different advanced depth map estimation methods. Compared with DPTranftl2021vision and AdelaiDepthyin2022towards, the depth maps produced by MiDaSbirkl2023midas can highlight the camouflaged objects best.
  • Figure 4: The details of Depth-weighted Cross-attention Fusion module (DCF).
  • Figure 5: Qualitative comparison with eight state-of-art COD methods.
  • ...and 2 more figures