Table of Contents
Fetching ...

Divide-and-Conquer: Confluent Triple-Flow Network for RGB-T Salient Object Detection

Hao Tang, Zechao Li, Dong Zhang, Shengfeng He, Jinhui Tang

TL;DR

This work addresses RGB-T Salient Object Detection under challenging, noisy conditions by introducing ConTriNet, a confluent triple-flow network that divides the task into modality-specific mining and modality-complementary integration. A unified union encoder with a Modality-Induced Feature Modulator (MFM) processes both RGB and thermal data, while two modality-specific flows and a modality-complementary flow—equipped with a Residual Atrous Spatial Pyramid Module (RASPM) and a Modality-aware Dynamic Aggregation Module (MDAM)—enable robust, context-aware fusion. A flow-cooperative fusion strategy yields high-quality full-resolution saliency maps, and extensive experiments on public benchmarks plus a new VT-IMAG dataset demonstrate state-of-the-art performance and strong robustness to unknown challenging scenarios. The proposed VT-IMAG testbed and the ConTriNet architecture offer a practical path toward robust RGB-T SOD in real-world applications, with potential extensions to other RGB-X modalities in the future.

Abstract

RGB-Thermal Salient Object Detection aims to pinpoint prominent objects within aligned pairs of visible and thermal infrared images. Traditional encoder-decoder architectures, while designed for cross-modality feature interactions, may not have adequately considered the robustness against noise originating from defective modalities. Inspired by hierarchical human visual systems, we propose the ConTriNet, a robust Confluent Triple-Flow Network employing a Divide-and-Conquer strategy. Specifically, ConTriNet comprises three flows: two modality-specific flows explore cues from RGB and Thermal modalities, and a third modality-complementary flow integrates cues from both modalities. ConTriNet presents several notable advantages. It incorporates a Modality-induced Feature Modulator in the modality-shared union encoder to minimize inter-modality discrepancies and mitigate the impact of defective samples. Additionally, a foundational Residual Atrous Spatial Pyramid Module in the separated flows enlarges the receptive field, allowing for the capture of multi-scale contextual information. Furthermore, a Modality-aware Dynamic Aggregation Module in the modality-complementary flow dynamically aggregates saliency-related cues from both modality-specific flows. Leveraging the proposed parallel triple-flow framework, we further refine saliency maps derived from different flows through a flow-cooperative fusion strategy, yielding a high-quality, full-resolution saliency map for the final prediction. To evaluate the robustness and stability of our approach, we collect a comprehensive RGB-T SOD benchmark, VT-IMAG, covering various real-world challenging scenarios. Extensive experiments on public benchmarks and our VT-IMAG dataset demonstrate that ConTriNet consistently outperforms state-of-the-art competitors in both common and challenging scenarios.

Divide-and-Conquer: Confluent Triple-Flow Network for RGB-T Salient Object Detection

TL;DR

This work addresses RGB-T Salient Object Detection under challenging, noisy conditions by introducing ConTriNet, a confluent triple-flow network that divides the task into modality-specific mining and modality-complementary integration. A unified union encoder with a Modality-Induced Feature Modulator (MFM) processes both RGB and thermal data, while two modality-specific flows and a modality-complementary flow—equipped with a Residual Atrous Spatial Pyramid Module (RASPM) and a Modality-aware Dynamic Aggregation Module (MDAM)—enable robust, context-aware fusion. A flow-cooperative fusion strategy yields high-quality full-resolution saliency maps, and extensive experiments on public benchmarks plus a new VT-IMAG dataset demonstrate state-of-the-art performance and strong robustness to unknown challenging scenarios. The proposed VT-IMAG testbed and the ConTriNet architecture offer a practical path toward robust RGB-T SOD in real-world applications, with potential extensions to other RGB-X modalities in the future.

Abstract

RGB-Thermal Salient Object Detection aims to pinpoint prominent objects within aligned pairs of visible and thermal infrared images. Traditional encoder-decoder architectures, while designed for cross-modality feature interactions, may not have adequately considered the robustness against noise originating from defective modalities. Inspired by hierarchical human visual systems, we propose the ConTriNet, a robust Confluent Triple-Flow Network employing a Divide-and-Conquer strategy. Specifically, ConTriNet comprises three flows: two modality-specific flows explore cues from RGB and Thermal modalities, and a third modality-complementary flow integrates cues from both modalities. ConTriNet presents several notable advantages. It incorporates a Modality-induced Feature Modulator in the modality-shared union encoder to minimize inter-modality discrepancies and mitigate the impact of defective samples. Additionally, a foundational Residual Atrous Spatial Pyramid Module in the separated flows enlarges the receptive field, allowing for the capture of multi-scale contextual information. Furthermore, a Modality-aware Dynamic Aggregation Module in the modality-complementary flow dynamically aggregates saliency-related cues from both modality-specific flows. Leveraging the proposed parallel triple-flow framework, we further refine saliency maps derived from different flows through a flow-cooperative fusion strategy, yielding a high-quality, full-resolution saliency map for the final prediction. To evaluate the robustness and stability of our approach, we collect a comprehensive RGB-T SOD benchmark, VT-IMAG, covering various real-world challenging scenarios. Extensive experiments on public benchmarks and our VT-IMAG dataset demonstrate that ConTriNet consistently outperforms state-of-the-art competitors in both common and challenging scenarios.

Paper Structure

This paper contains 29 sections, 10 equations, 12 figures, 8 tables.

Figures (12)

  • Figure 1: Comparisons of established RGB-T SOD network architectures (i.e., (a) single-flow and (b) dual-flow) for RGB-T SOD and the proposed one in (c). Our proposed triple-flow paradigm adopts a divide-and-conquer strategy dedicated to the deep exploration of modality-specific cues while the effective fusion of modality-complementary information, thus dealing with various challenging scenarios well (see (d)). MIA liang2022multi and MIDD MIDD_21 correspond to the representative methods of (a) and (b), respectively.
  • Figure 2: An overview of the proposed Confluent Triple-Flow Network (ConTriNet), which adopts an efficient "Divide-and-Conquer" strategy, is presented. ConTriNet comprises three main flows: a modality-complementary flow that predicts a modality-complementary saliency map, and two modality-specific flows that predict RGB- and Thermal-specific saliency maps, respectively. The union encoder of two modality-specific flows shares parameters and the overall framework can be trained end-to-end.
  • Figure 3: Structure diagram of the Modality-induced Feature Modulator (MFM).
  • Figure 4: Visualization for the feature evolution in the modality-induced feature modulator.
  • Figure 5: Structure diagram of the Residual Atrous Spatial Pyramid Module (RASPM).
  • ...and 7 more figures