Bifurcated backbone strategy for RGB-D salient object detection
Yingjie Zhai, Deng-Ping Fan, Jufeng Yang, Ali Borji, Ling Shao, Junwei Han, Liang Wang
TL;DR
This work tackles RGB-D salient object detection by addressing multi-level feature fusion and depth integration. It introduces Bifurcated Backbone Strategy Network (BBS-Net), which splits cross-modal features into teacher and student streams and employs a cascaded refinement with an initial map S1 followed by a final map S2. A Depth-Enhanced Module (DEM) and a Depth Adapter Module (DAM) improve depth-RGB compatibility and efficiency, with a dual-stage loss $L = \alpha \ell_{ce}(S_1,G) + (1-\alpha) \ell_{ce}(S_2,G)$ and $\alpha = 0.5$. Experiments on eight challenging RGB-D SOD datasets show significant improvements over 18 SOTA models, including an efficient variant that reduces parameters by about 50%, and cross-dataset analyses demonstrate stronger generalization; the authors also release their code for public use.
Abstract
Multi-level feature fusion is a fundamental topic in computer vision. It has been exploited to detect, segment and classify objects at various scales. When multi-level features meet multi-modal cues, the optimal feature aggregation and multi-modal learning strategy become a hot potato. In this paper, we leverage the inherent multi-modal and multi-level nature of RGB-D salient object detection to devise a novel cascaded refinement network. In particular, first, we propose to regroup the multi-level features into teacher and student features using a bifurcated backbone strategy (BBS). Second, we introduce a depth-enhanced module (DEM) to excavate informative depth cues from the channel and spatial views. Then, RGB and depth modalities are fused in a complementary way. Our architecture, named Bifurcated Backbone Strategy Network (BBS-Net), is simple, efficient, and backbone-independent. Extensive experiments show that BBS-Net significantly outperforms eighteen SOTA models on eight challenging datasets under five evaluation measures, demonstrating the superiority of our approach ($\sim 4 \%$ improvement in S-measure $vs.$ the top-ranked model: DMRA-iccv2019). In addition, we provide a comprehensive analysis on the generalization ability of different RGB-D datasets and provide a powerful training set for future research.
