Table of Contents
Fetching ...

CrossWeaver: Cross-modal Weaving for Arbitrary-Modality Semantic Segmentation

Zelin Zhang, Kedi Li, Huiqi Liang, Tao Zhang, Chuanzhi Xu

Abstract

Multimodal semantic segmentation has shown great potential in leveraging complementary information across diverse sensing modalities. However, existing approaches often rely on carefully designed fusion strategies that either use modality-specific adaptations or rely on loosely coupled interactions, thereby limiting flexibility and resulting in less effective cross-modal coordination. Moreover, these methods often struggle to balance efficient information exchange with preserving the unique characteristics of each modality across different modality combinations. To address these challenges, we propose CrossWeaver, a simple yet effective multimodal fusion framework for arbitrary-modality semantic segmentation. Its core is a Modality Interaction Block (MIB), which enables selective and reliability-aware cross-modal interaction within the encoder, while a lightweight Seam-Aligned Fusion (SAF) module further aggregates the enhanced features. Extensive experiments on multiple multimodal semantic segmentation benchmarks demonstrate that our framework achieves state-of-the-art performance with minimal additional parameters and strong generalization to unseen modality combinations.

CrossWeaver: Cross-modal Weaving for Arbitrary-Modality Semantic Segmentation

Abstract

Multimodal semantic segmentation has shown great potential in leveraging complementary information across diverse sensing modalities. However, existing approaches often rely on carefully designed fusion strategies that either use modality-specific adaptations or rely on loosely coupled interactions, thereby limiting flexibility and resulting in less effective cross-modal coordination. Moreover, these methods often struggle to balance efficient information exchange with preserving the unique characteristics of each modality across different modality combinations. To address these challenges, we propose CrossWeaver, a simple yet effective multimodal fusion framework for arbitrary-modality semantic segmentation. Its core is a Modality Interaction Block (MIB), which enables selective and reliability-aware cross-modal interaction within the encoder, while a lightweight Seam-Aligned Fusion (SAF) module further aggregates the enhanced features. Extensive experiments on multiple multimodal semantic segmentation benchmarks demonstrate that our framework achieves state-of-the-art performance with minimal additional parameters and strong generalization to unseen modality combinations.

Paper Structure

This paper contains 28 sections, 16 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Performance Comparison across Different Methods. (a) Results on the MCubeS liang2022mcubes Dataset. (b) Results on the DeLiVER zhang2023cmnext Dataset.
  • Figure 2: Overall framework of CrossWeaver, consisting of a shared hierarchical encoder and two plug-and-play modules: (a) Overall Architecture, (b) Modality Interaction Block (MIB) for Reliability-Aware Cross-Modal Encoding, and (c) Seam-Aligned Fusion (SAF) for Boundary Preserving Feature Fusion.
  • Figure 3: Qualitative Visualization of MIB and SAF in CrossWeaver on the MCubeS Dataset. From left to right are the input image, the feature response after MIB, the feature response after SAF, and the final prediction.
  • Figure 4: Visualization of CrossWeaver (MiT-B0 Backbone) on MCubeS liang2022mcubes Dataset.
  • Figure 5: Additional visualizations on MCubeS under missing-modality conditions.
  • ...and 2 more figures