Table of Contents
Fetching ...

Bifurcated backbone strategy for RGB-D salient object detection

Yingjie Zhai, Deng-Ping Fan, Jufeng Yang, Ali Borji, Ling Shao, Junwei Han, Liang Wang

TL;DR

This work tackles RGB-D salient object detection by addressing multi-level feature fusion and depth integration. It introduces Bifurcated Backbone Strategy Network (BBS-Net), which splits cross-modal features into teacher and student streams and employs a cascaded refinement with an initial map S1 followed by a final map S2. A Depth-Enhanced Module (DEM) and a Depth Adapter Module (DAM) improve depth-RGB compatibility and efficiency, with a dual-stage loss $L = \alpha \ell_{ce}(S_1,G) + (1-\alpha) \ell_{ce}(S_2,G)$ and $\alpha = 0.5$. Experiments on eight challenging RGB-D SOD datasets show significant improvements over 18 SOTA models, including an efficient variant that reduces parameters by about 50%, and cross-dataset analyses demonstrate stronger generalization; the authors also release their code for public use.

Abstract

Multi-level feature fusion is a fundamental topic in computer vision. It has been exploited to detect, segment and classify objects at various scales. When multi-level features meet multi-modal cues, the optimal feature aggregation and multi-modal learning strategy become a hot potato. In this paper, we leverage the inherent multi-modal and multi-level nature of RGB-D salient object detection to devise a novel cascaded refinement network. In particular, first, we propose to regroup the multi-level features into teacher and student features using a bifurcated backbone strategy (BBS). Second, we introduce a depth-enhanced module (DEM) to excavate informative depth cues from the channel and spatial views. Then, RGB and depth modalities are fused in a complementary way. Our architecture, named Bifurcated Backbone Strategy Network (BBS-Net), is simple, efficient, and backbone-independent. Extensive experiments show that BBS-Net significantly outperforms eighteen SOTA models on eight challenging datasets under five evaluation measures, demonstrating the superiority of our approach ($\sim 4 \%$ improvement in S-measure $vs.$ the top-ranked model: DMRA-iccv2019). In addition, we provide a comprehensive analysis on the generalization ability of different RGB-D datasets and provide a powerful training set for future research.

Bifurcated backbone strategy for RGB-D salient object detection

TL;DR

This work tackles RGB-D salient object detection by addressing multi-level feature fusion and depth integration. It introduces Bifurcated Backbone Strategy Network (BBS-Net), which splits cross-modal features into teacher and student streams and employs a cascaded refinement with an initial map S1 followed by a final map S2. A Depth-Enhanced Module (DEM) and a Depth Adapter Module (DAM) improve depth-RGB compatibility and efficiency, with a dual-stage loss and . Experiments on eight challenging RGB-D SOD datasets show significant improvements over 18 SOTA models, including an efficient variant that reduces parameters by about 50%, and cross-dataset analyses demonstrate stronger generalization; the authors also release their code for public use.

Abstract

Multi-level feature fusion is a fundamental topic in computer vision. It has been exploited to detect, segment and classify objects at various scales. When multi-level features meet multi-modal cues, the optimal feature aggregation and multi-modal learning strategy become a hot potato. In this paper, we leverage the inherent multi-modal and multi-level nature of RGB-D salient object detection to devise a novel cascaded refinement network. In particular, first, we propose to regroup the multi-level features into teacher and student features using a bifurcated backbone strategy (BBS). Second, we introduce a depth-enhanced module (DEM) to excavate informative depth cues from the channel and spatial views. Then, RGB and depth modalities are fused in a complementary way. Our architecture, named Bifurcated Backbone Strategy Network (BBS-Net), is simple, efficient, and backbone-independent. Extensive experiments show that BBS-Net significantly outperforms eighteen SOTA models on eight challenging datasets under five evaluation measures, demonstrating the superiority of our approach ( improvement in S-measure the top-ranked model: DMRA-iccv2019). In addition, we provide a comprehensive analysis on the generalization ability of different RGB-D datasets and provide a powerful training set for future research.

Paper Structure

This paper contains 20 sections, 14 equations, 11 figures, 13 tables.

Figures (11)

  • Figure 1: Saliency maps of state-of-the-art (SOTA) CNN-based methods (i.e., DMRA piao2019DMRA, CPFP zhao2019CPFP, TANet chen2019TANet, PCF chen2018PCF and Ours) and methods based on handcrafted features (i.e., SE guo2016SE and LBE feng2016LBE). Our method generates higher-quality saliency maps and suppresses background distractors in challenging scenarios (top: complex background; bottom: depth with noise).
  • Figure 2: (a) Existing multi-level feature aggregation methods for RGB-D SOD chen2018PCFzhao2019CPFPpiao2019DMRAchen2019TANetzhu2019PDNetwang2019AFNetLIU2019SSRC. (b) In this paper, we adopt a bifurcated backbone strategy (BBS) to split the multi-level features into student and teacher features. The initial saliency map $S_1$ is utilized to refine the student features to effectively suppress distractors. Then, the refined features are passed to another cascaded decoder to generate the final saliency map $S_2$.
  • Figure 3: Architecture of our BBS-Net. Feature Extraction: 'Conv1'$\sim$'Conv5' denote different layers from ResNet-50 He2016resnet. Multi-level features ($f_1^d\sim f_5^d$) from the depth branch are enhanced by the DEM and are then fused with features (i.e., $f_1^{rgb}\sim f_5^{rgb}$) from the RGB branch. Stage 1: cross-modal teacher features ($f_3^{cm}\sim f_5^{cm}$) are first aggregated by the cascaded decoder (a) to produce the initial saliency map $S_1$. Stage 2: Then, student features ($f_1^{cm}\sim f_3^{cm}$) are refined by the initial saliency map $S_1$ and are integrated by another cascaded decoder to predict the final saliency map $S_2$. See $\S~$\ref{['sec:proposedMethod']} for details.
  • Figure 4: Architecture of the depth adapter module (DAM).
  • Figure 5: PR Curves of the proposed model and 18 SOTA algorithms over six datasets. Dots on the curves represent the value of precision and recall at the maximum F-measure.
  • ...and 6 more figures