Table of Contents
Fetching ...

CSDNet: Detect Salient Object in Depth-Thermal via A Lightweight Cross Shallow and Deep Perception Network

Xiaotong Yu, Ruihan Xie, Zhihe Zhao, Chang-Wen Chen

TL;DR

CSDNet tackles the inefficiency and noise inherent in multi-modality perceptual systems by exploiting low-coherence depth-thermal data for salient object detection. It introduces a cross shallow and deep perception framework comprising CFARSP for shallow prescreening, ICAN for deep semantic coherence, and SAMAEP to map depth-thermal features into a generalized feature space via SAM guidance. On the VDT-2048 dataset, CSDNet achieves competitive or superior performance relative to RGB-D and RGB-T methods and approaches RGB-D-T baselines while delivering substantial efficiency gains (≈5.97× faster and ≈0.0036× FLOPs). These results demonstrate effective integration of depth-thermal information with reduced computational burden, making the approach well-suited for edge devices and privacy-conscious mobile robotics under challenging lighting conditions.

Abstract

While we enjoy the richness and informativeness of multimodal data, it also introduces interference and redundancy of information. To achieve optimal domain interpretation with limited resources, we propose CSDNet, a lightweight \textbf{C}ross \textbf{S}hallow and \textbf{D}eep Perception \textbf{Net}work designed to integrate two modalities with less coherence, thereby discarding redundant information or even modality. We implement our CSDNet for Salient Object Detection (SOD) task in robotic perception. The proposed method capitalises on spatial information prescreening and implicit coherence navigation across shallow and deep layers of the depth-thermal (D-T) modality, prioritising integration over fusion to maximise the scene interpretation. To further refine the descriptive capabilities of the encoder for the less-known D-T modalities, we also propose SAMAEP to guide an effective feature mapping to the generalised feature space. Our approach is tested on the VDT-2048 dataset, leveraging the D-T modality outperforms those of SOTA methods using RGB-T or RGB-D modalities for the first time, achieves comparable performance with the RGB-D-T triple-modality benchmark method with 5.97 times faster at runtime and demanding 0.0036 times fewer FLOPs. Demonstrates the proposed CSDNet effectively integrates the information from the D-T modality. The code will be released upon acceptance.

CSDNet: Detect Salient Object in Depth-Thermal via A Lightweight Cross Shallow and Deep Perception Network

TL;DR

CSDNet tackles the inefficiency and noise inherent in multi-modality perceptual systems by exploiting low-coherence depth-thermal data for salient object detection. It introduces a cross shallow and deep perception framework comprising CFARSP for shallow prescreening, ICAN for deep semantic coherence, and SAMAEP to map depth-thermal features into a generalized feature space via SAM guidance. On the VDT-2048 dataset, CSDNet achieves competitive or superior performance relative to RGB-D and RGB-T methods and approaches RGB-D-T baselines while delivering substantial efficiency gains (≈5.97× faster and ≈0.0036× FLOPs). These results demonstrate effective integration of depth-thermal information with reduced computational burden, making the approach well-suited for edge devices and privacy-conscious mobile robotics under challenging lighting conditions.

Abstract

While we enjoy the richness and informativeness of multimodal data, it also introduces interference and redundancy of information. To achieve optimal domain interpretation with limited resources, we propose CSDNet, a lightweight \textbf{C}ross \textbf{S}hallow and \textbf{D}eep Perception \textbf{Net}work designed to integrate two modalities with less coherence, thereby discarding redundant information or even modality. We implement our CSDNet for Salient Object Detection (SOD) task in robotic perception. The proposed method capitalises on spatial information prescreening and implicit coherence navigation across shallow and deep layers of the depth-thermal (D-T) modality, prioritising integration over fusion to maximise the scene interpretation. To further refine the descriptive capabilities of the encoder for the less-known D-T modalities, we also propose SAMAEP to guide an effective feature mapping to the generalised feature space. Our approach is tested on the VDT-2048 dataset, leveraging the D-T modality outperforms those of SOTA methods using RGB-T or RGB-D modalities for the first time, achieves comparable performance with the RGB-D-T triple-modality benchmark method with 5.97 times faster at runtime and demanding 0.0036 times fewer FLOPs. Demonstrates the proposed CSDNet effectively integrates the information from the D-T modality. The code will be released upon acceptance.
Paper Structure (15 sections, 9 equations, 6 figures, 7 tables)

This paper contains 15 sections, 9 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: (a) The TSNE representations of different modalities; (left) Depth and thermal are highlighted; (right) RGB modality is highlighted (b) The visualised results of existing methods on D-T modality, the RGB-dominated models show less capability in interpreting D-T data.
  • Figure 2: The overview of the proposed network CSDNet
  • Figure 3: The schematic of CFAR Saliency Prescreening Module
  • Figure 4: The schematic of SAM-assist depth encoder pre-training framework
  • Figure 5: Visual Comparison on VDT-2048 dataset
  • ...and 1 more figures