Table of Contents
Fetching ...

Part-Whole Relational Fusion Towards Multi-Modal Scene Understanding

Yi Liu, Chengxin Li, Shoukun Xu, Jungong Han

TL;DR

This framework treats multi-modal fusion as part-whole relational fusion, and routes multiple individual part-level modalities to a fused whole-level modality using the part-whole relational routing ability of Capsule Networks (CapsNets).

Abstract

Multi-modal fusion has played a vital role in multi-modal scene understanding. Most existing methods focus on cross-modal fusion involving two modalities, often overlooking more complex multi-modal fusion, which is essential for real-world applications like autonomous driving, where visible, depth, event, LiDAR, etc., are used. Besides, few attempts for multi-modal fusion, \emph{e.g.}, simple concatenation, cross-modal attention, and token selection, cannot well dig into the intrinsic shared and specific details of multiple modalities. To tackle the challenge, in this paper, we propose a Part-Whole Relational Fusion (PWRF) framework. For the first time, this framework treats multi-modal fusion as part-whole relational fusion. It routes multiple individual part-level modalities to a fused whole-level modality using the part-whole relational routing ability of Capsule Networks (CapsNets). Through this part-whole routing, our PWRF generates modal-shared and modal-specific semantics from the whole-level modal capsules and the routing coefficients, respectively. On top of that, modal-shared and modal-specific details can be employed to solve the issue of multi-modal scene understanding, including synthetic multi-modal segmentation and visible-depth-thermal salient object detection in this paper. Experiments on several datasets demonstrate the superiority of the proposed PWRF framework for multi-modal scene understanding. The source code has been released on https://github.com/liuyi1989/PWRF.

Part-Whole Relational Fusion Towards Multi-Modal Scene Understanding

TL;DR

This framework treats multi-modal fusion as part-whole relational fusion, and routes multiple individual part-level modalities to a fused whole-level modality using the part-whole relational routing ability of Capsule Networks (CapsNets).

Abstract

Multi-modal fusion has played a vital role in multi-modal scene understanding. Most existing methods focus on cross-modal fusion involving two modalities, often overlooking more complex multi-modal fusion, which is essential for real-world applications like autonomous driving, where visible, depth, event, LiDAR, etc., are used. Besides, few attempts for multi-modal fusion, \emph{e.g.}, simple concatenation, cross-modal attention, and token selection, cannot well dig into the intrinsic shared and specific details of multiple modalities. To tackle the challenge, in this paper, we propose a Part-Whole Relational Fusion (PWRF) framework. For the first time, this framework treats multi-modal fusion as part-whole relational fusion. It routes multiple individual part-level modalities to a fused whole-level modality using the part-whole relational routing ability of Capsule Networks (CapsNets). Through this part-whole routing, our PWRF generates modal-shared and modal-specific semantics from the whole-level modal capsules and the routing coefficients, respectively. On top of that, modal-shared and modal-specific details can be employed to solve the issue of multi-modal scene understanding, including synthetic multi-modal segmentation and visible-depth-thermal salient object detection in this paper. Experiments on several datasets demonstrate the superiority of the proposed PWRF framework for multi-modal scene understanding. The source code has been released on https://github.com/liuyi1989/PWRF.

Paper Structure

This paper contains 39 sections, 37 equations, 9 figures, 12 tables.

Figures (9)

  • Figure 1: Comparison of different multi-modal fusion methods. (a) Multi-modal fusion via concatenation. (b) Multi-modal fusion through parallel cross-attention to attend the primary modality. (c) Multi-modal fusion via selection mechanism. (d) Our multi-modal fusion via part-whole relational routing to generate modal-shared and modal-specific details.
  • Figure 2: Visualization for the split of routing coefficients.
  • Figure 3: SMM semantic segmentation framework based on PWRF. There are 4 stages with different-scale features and outputs. It is noted that we use the same stage 1 as in zhang2023delivering due to the heavy computation for DCR. In stages 2-4, our PWRF models modal-shared and modal-specific details of different auxiliary modalities, which are further integrated with the primary RGB modality. The outputs of four stages are fed to Segformer head xie2021segformer for semantic segmentation.
  • Figure 4: VDT salient object detection framework based on PWRF. Swin-Transformer liu2021swin is utilized to learn the backbone features of triple modalities, which are further fed in our PWRF to get modal-shared and modal-specific semantics. After that, a stacking adjacent-scale attention decoder is designed to integrate different-scale modal-shared/specific semantics. The predictions of these two sub-decoders are combined to achieve the final saliency map.
  • Figure 5: Adjacent-scale Attention Module, which is composed by three components, including adjacent-scale integration, dual-branch attention, and selective aggregation.
  • ...and 4 more figures