Learning Adaptive Fusion Bank for Multi-modal Salient Object Detection

Kunpeng Wang; Zhengzheng Tu; Chenglong Li; Cheng Zhang; Bin Luo

Learning Adaptive Fusion Bank for Multi-modal Salient Object Detection

Kunpeng Wang, Zhengzheng Tu, Chenglong Li, Cheng Zhang, Bin Luo

TL;DR

This work introduces Learning Adaptive Fusion Bank (LAFB) to address the inherently multi-faceted challenges of multi-modal salient object detection (MSOD). By decomposing fusion into five specialized schemes aligned with center bias, scale variation, image clutter, low illumination, and thermal crossover or depth ambiguity, and by integrating them with an adaptive ensemble module, the method adaptively fuses RGB with depth/thermal data in an encoder. An Indirect Interactive Guidance Module further harmonizes high-level semantics with low-level details to accurately segment hollow objects. Extensive experiments across RGBD and RGBT datasets demonstrate state-of-the-art performance, with ablations confirming the critical roles of AFB, AEM, and IIGM, and the approach showing strong generalization and scalability for additional challenges.

Abstract

Multi-modal salient object detection (MSOD) aims to boost saliency detection performance by integrating visible sources with depth or thermal infrared ones. Existing methods generally design different fusion schemes to handle certain issues or challenges. Although these fusion schemes are effective at addressing specific issues or challenges, they may struggle to handle multiple complex challenges simultaneously. To solve this problem, we propose a novel adaptive fusion bank that makes full use of the complementary benefits from a set of basic fusion schemes to handle different challenges simultaneously for robust MSOD. We focus on handling five major challenges in MSOD, namely center bias, scale variation, image clutter, low illumination, and thermal crossover or depth ambiguity. The fusion bank proposed consists of five representative fusion schemes, which are specifically designed based on the characteristics of each challenge, respectively. The bank is scalable, and more fusion schemes could be incorporated into the bank for more challenges. To adaptively select the appropriate fusion scheme for multi-modal input, we introduce an adaptive ensemble module that forms the adaptive fusion bank, which is embedded into hierarchical layers for sufficient fusion of different source data. Moreover, we design an indirect interactive guidance module to accurately detect salient hollow objects via the skip integration of high-level semantic information and low-level spatial details. Extensive experiments on three RGBT datasets and seven RGBD datasets demonstrate that the proposed method achieves the outstanding performance compared to the state-of-the-art methods. The code and results are available at https://github.com/Angknpng/LAFB.

Learning Adaptive Fusion Bank for Multi-modal Salient Object Detection

TL;DR

Abstract

Paper Structure (15 sections, 15 equations, 10 figures, 10 tables)

This paper contains 15 sections, 15 equations, 10 figures, 10 tables.

Introduction
Related Work
RGBT Salient Object Detection
RGBD Salient Object Detection
The Proposed Method
Overview
Adaptive Fusion Bank
Indirect Interactive Guidance Module
Loss Function
Experiment
Experimental Setup
Comparison with the State-of-the-arts
Ablation Study
Failure Cases and Analyses
Conclusion

Figures (10)

Figure 1: $\textbf{Top}$: Comparison of our LAFB against a state-of-the-art RGBT SOD method, e.g., CGFNet wang2021cgfnet. $\textbf{Bottom}$: Comparison of our LAFB against a state-of-the-art RGBD SOD method, e.g., SPNet zhou2021specificity. The results show that our LAFB can simultaneously handle both modality-shared challenges (i.e., center bias, scale variation, and image clutter) and modality-specific challenges (i.e., low illumination, thermal crossover or depth ambiguity), as well as accurately segment hollow objects. The compared methods can only tackle the certain challenge. For example, CGFNet can handle the center bias challenge well but fails on the low illumination and thermal crossover challenges, and SPNet can handle the scale variation challenge well but fails on the image clutter and depth ambiguity challenges. Neither of them can cope with the hollow objects.
Figure 2: Overview of our proposed learning adaptive fusion bank (LAFB). We first send the extracted multi-modal features (i.e., $f_i^r$ and $f_i^{t/d}$) into the adaptive fusion bank (AFB), which contains five specific fusion schemes (i.e., $Fu{s^{cb}}$, $Fu{s^{sv}}$, $Fu{s^{ic}}$, $Fu{s^{li}}$ and $Fu{s^{td}}$) to generate multi-level features (i.e., ${FB_i}$) for corresponding challenges. Then, the multi-level features are fed into the indirect interactive guidance module (IIGM) to integrate high-level and low-level features smoothly. After that, the generated features (i.e., $I_i$) are fed into the RFB module to increase the receptive field of features. Finally, multi-level saliency maps (i.e., ${S_i}$) are inferred in a top-down manner in the decoder, and ${S_2}$ is taken as the final saliency map.
Figure 3: Architectures of the fusion schemes in the adaptive fusion bank. $f_i^A$ represents the concatenation of multi-modal features (i.e., $f_i^{t/d}$ and $f_i^r$) extracted by the backbone. $f_i^{cb}$, $f_i^{sv}$, $f_i^{ic}$, $f_i^{li}$ and $f_i^{td}$ are generated by the corresponding fusion schemes for different challenges.
Figure 4: Visualization of feature maps generated by different fusion schemes and adaptive ensemble module in the ${5^{th}}$ adaptive fusion bank. From the ${1^{st}}$ row to the ${4^{th}}$ row, the challenge is low light, thermal crossover, depth ambiguity and image clutter, respectively. The results of the saliency maps indicate that our method is able to deal with multiple challenges simultaneously, while the comparison methods (i.e., ECFFNet zhou2021ecffnet and EBFSP huang2021employing) fails on some challenges, such as the first and the third rows.
Figure 5: Weights assigned to different fusion schemes when training on different challenge data, which are all obtained based on the challenge annotation of the training set of VT5000. The x-axis and y-axis represent epochs and weights, respectively. The values of columns in different colors indicate the weight assigned to different fusion schemes during training.
...and 5 more figures

Learning Adaptive Fusion Bank for Multi-modal Salient Object Detection

TL;DR

Abstract

Learning Adaptive Fusion Bank for Multi-modal Salient Object Detection

Authors

TL;DR

Abstract

Table of Contents

Figures (10)