Table of Contents
Fetching ...

Salient Object Detection in RGB-D Videos

Ao Mou, Yukang Lu, Jiahao He, Dingyao Min, Keren Fu, Qijun Zhao

TL;DR

This work targets RGB-D video salient object detection by introducing RDVS, a dataset with 57 sequences (4,087 frames) and gaze-guided per-frame saliency masks using realistic depth, alongside DCTNet+, a trimodal network with RGB as the main modality and depth/OF as auxiliaries. The model leverages a Multi-modal Attention Module (MAM) for cross-modal long-range dependencies, a Refinement Fusion Module (RFM) for noise-robust fusion via a Universal Interaction Module (UIM), and Holistic Multi-modal Attentive Paths (HMAPs) to refine low-level features before fusion. The authors provide extensive ablations showing the value of depth, main-modality choice, and the proposed modules, and demonstrate state-of-the-art performance on both RDVS and conventional RGB-D/VSOD benchmarks, with realistic depth offering clear benefits over synthetic depth. The RDVS dataset and codebase aim to accelerate RGB-D VSOD research and stimulate progress in related tasks such as video segmentation and depth-aware perception in real-world scenarios.

Abstract

Given the widespread adoption of depth-sensing acquisition devices, RGB-D videos and related data/media have gained considerable traction in various aspects of daily life. Consequently, conducting salient object detection (SOD) in RGB-D videos presents a highly promising and evolving avenue. Despite the potential of this area, SOD in RGB-D videos remains somewhat under-explored, with RGB-D SOD and video SOD (VSOD) traditionally studied in isolation. To explore this emerging field, this paper makes two primary contributions: the dataset and the model. On one front, we construct the RDVS dataset, a new RGB-D VSOD dataset with realistic depth and characterized by its diversity of scenes and rigorous frame-by-frame annotations. We validate the dataset through comprehensive attribute and object-oriented analyses, and provide training and testing splits. Moreover, we introduce DCTNet+, a three-stream network tailored for RGB-D VSOD, with an emphasis on RGB modality and treats depth and optical flow as auxiliary modalities. In pursuit of effective feature enhancement, refinement, and fusion for precise final prediction, we propose two modules: the multi-modal attention module (MAM) and the refinement fusion module (RFM). To enhance interaction and fusion within RFM, we design a universal interaction module (UIM) and then integrate holistic multi-modal attentive paths (HMAPs) for refining multi-modal low-level features before reaching RFMs. Comprehensive experiments, conducted on pseudo RGB-D video datasets alongside our RDVS, highlight the superiority of DCTNet+ over 17 VSOD models and 14 RGB-D SOD models. Ablation experiments were performed on both pseudo and realistic RGB-D video datasets to demonstrate the advantages of individual modules as well as the necessity of introducing realistic depth. Our code together with RDVS dataset will be available at https://github.com/kerenfu/RDVS/.

Salient Object Detection in RGB-D Videos

TL;DR

This work targets RGB-D video salient object detection by introducing RDVS, a dataset with 57 sequences (4,087 frames) and gaze-guided per-frame saliency masks using realistic depth, alongside DCTNet+, a trimodal network with RGB as the main modality and depth/OF as auxiliaries. The model leverages a Multi-modal Attention Module (MAM) for cross-modal long-range dependencies, a Refinement Fusion Module (RFM) for noise-robust fusion via a Universal Interaction Module (UIM), and Holistic Multi-modal Attentive Paths (HMAPs) to refine low-level features before fusion. The authors provide extensive ablations showing the value of depth, main-modality choice, and the proposed modules, and demonstrate state-of-the-art performance on both RDVS and conventional RGB-D/VSOD benchmarks, with realistic depth offering clear benefits over synthetic depth. The RDVS dataset and codebase aim to accelerate RGB-D VSOD research and stimulate progress in related tasks such as video segmentation and depth-aware perception in real-world scenarios.

Abstract

Given the widespread adoption of depth-sensing acquisition devices, RGB-D videos and related data/media have gained considerable traction in various aspects of daily life. Consequently, conducting salient object detection (SOD) in RGB-D videos presents a highly promising and evolving avenue. Despite the potential of this area, SOD in RGB-D videos remains somewhat under-explored, with RGB-D SOD and video SOD (VSOD) traditionally studied in isolation. To explore this emerging field, this paper makes two primary contributions: the dataset and the model. On one front, we construct the RDVS dataset, a new RGB-D VSOD dataset with realistic depth and characterized by its diversity of scenes and rigorous frame-by-frame annotations. We validate the dataset through comprehensive attribute and object-oriented analyses, and provide training and testing splits. Moreover, we introduce DCTNet+, a three-stream network tailored for RGB-D VSOD, with an emphasis on RGB modality and treats depth and optical flow as auxiliary modalities. In pursuit of effective feature enhancement, refinement, and fusion for precise final prediction, we propose two modules: the multi-modal attention module (MAM) and the refinement fusion module (RFM). To enhance interaction and fusion within RFM, we design a universal interaction module (UIM) and then integrate holistic multi-modal attentive paths (HMAPs) for refining multi-modal low-level features before reaching RFMs. Comprehensive experiments, conducted on pseudo RGB-D video datasets alongside our RDVS, highlight the superiority of DCTNet+ over 17 VSOD models and 14 RGB-D SOD models. Ablation experiments were performed on both pseudo and realistic RGB-D video datasets to demonstrate the advantages of individual modules as well as the necessity of introducing realistic depth. Our code together with RDVS dataset will be available at https://github.com/kerenfu/RDVS/.
Paper Structure (32 sections, 9 equations, 10 figures, 12 tables)

This paper contains 32 sections, 9 equations, 10 figures, 12 tables.

Figures (10)

  • Figure 1: We target at the RGB-D VSOD task, which can be deemed as extension from the prevalent RGB-D SOD and VSOD tasks.
  • Figure 2: Illustrative frames (with depth in the bottom-right) from RDVS with fixations (red dots, the top row) and the corresponding continuous saliency maps (overlaying on the RGB frames, the bottom row).
  • Figure 3: Attribute-based analyses of RDVS with comparison to DAVIS (left), and the pairwise dependencies across different attributes (right).
  • Figure 4: Scene/object categories of RDVS.
  • Figure 5: Center bias of RDVS and existing VSOD datasets.
  • ...and 5 more figures