Table of Contents
Fetching ...

Modality-Specific Hierarchical Enhancement for RGB-D Camouflaged Object Detection

Yuzhen Niu, Yangqing Wang, Ri Cheng, Fusheng Li, Rongshen Wang, Zhichen Yang

Abstract

Camouflaged object detection (COD) is challenging due to high target-background similarity, and recent methods address this by complementarily using RGB-D texture and geometry cues. However, RGB-D COD methods still underutilize modality-specific cues, which limits fusion quality. We believe this is because RGB and depth features are fused directly after backbone extraction without modality-specific enhancement. To address this limitation, we propose MHENet, an RGB-D COD framework that performs modality-specific hierarchical enhancement and adaptive fusion of RGB and depth features. Specifically, we introduce a Texture Hierarchical Enhancement Module (THEM) to amplify subtle texture variations by extracting high-frequency information and a Geometry Hierarchical Enhancement Module (GHEM) to enhance geometric structures via learnable gradient extraction, while preserving cross-scale semantic consistency. Finally, an Adaptive Dynamic Fusion Module (ADFM) adaptively fuses the enhanced texture and geometry features with spatially varying weights. Experiments on four benchmarks demonstrate that MHENet surpasses 16 state-of-the-art methods qualitatively and quantitatively. Code is available at https://github.com/afdsgh/MHENet.

Modality-Specific Hierarchical Enhancement for RGB-D Camouflaged Object Detection

Abstract

Camouflaged object detection (COD) is challenging due to high target-background similarity, and recent methods address this by complementarily using RGB-D texture and geometry cues. However, RGB-D COD methods still underutilize modality-specific cues, which limits fusion quality. We believe this is because RGB and depth features are fused directly after backbone extraction without modality-specific enhancement. To address this limitation, we propose MHENet, an RGB-D COD framework that performs modality-specific hierarchical enhancement and adaptive fusion of RGB and depth features. Specifically, we introduce a Texture Hierarchical Enhancement Module (THEM) to amplify subtle texture variations by extracting high-frequency information and a Geometry Hierarchical Enhancement Module (GHEM) to enhance geometric structures via learnable gradient extraction, while preserving cross-scale semantic consistency. Finally, an Adaptive Dynamic Fusion Module (ADFM) adaptively fuses the enhanced texture and geometry features with spatially varying weights. Experiments on four benchmarks demonstrate that MHENet surpasses 16 state-of-the-art methods qualitatively and quantitatively. Code is available at https://github.com/afdsgh/MHENet.

Paper Structure

This paper contains 23 sections, 16 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Texture enhancement enriches the texture details of the limbs (red boxes) and geometry enhancement strengthens the bat and its boundary activations (green boxes), enabling fusion to better combine complementary cues. RGB activates more on texture-rich limbs but less on the camouflaged bat, while depth complements the bat for complementary fusion.
  • Figure 2: The overall architecture of the proposed MHENet, which consists of three key components, Texture Hierarchical Enhancement Module (THEM), Geometry Hierarchical Enhancement Module (GHEM), and Adaptive Dynamic Fusion Module (ADFM).
  • Figure 3: Overview of the Adaptive Dynamic Fusion Module.
  • Figure 4: Visual comparisons of some recent COD methods and ours on different types of samples. More comparisons are provided in the supplementary material. Best viewed by zooming in for more details.
  • Figure S1: Failure cases and potential extensions of MHENet under occlusion, ambiguous boundaries, and noisy depth scenarios.
  • ...and 2 more figures