Table of Contents
Fetching ...

SegRGB-X: General RGB-X Semantic Segmentation Model

Jiong Liu, Yingjie Xu, Xingcheng Zhou, Rui Song, Walter Zimmer, Alois Knoll, Hu Cao

Abstract

Semantic segmentation across arbitrary sensor modalities faces significant challenges due to diverse sensor characteristics, and the traditional configurations for this task result in redundant development efforts. We address these challenges by introducing a universal arbitrary-modal semantic segmentation framework that unifies segmentation across multiple modalities. Our approach features three key innovations: (1) the Modality-aware CLIP (MA-CLIP), which provides modality-specific scene understanding guidance through LoRA fine-tuning; (2) Modality-aligned Embeddings for capturing fine-grained features; and (3) the Domain-specific Refinement Module (DSRM) for dynamic feature adjustment. Evaluated on five diverse datasets with different complementary modalities (event, thermal, depth, polarization, and light field), our model surpasses specialized multi-modal methods and achieves state-of-the-art performance with a mIoU of 65.03%. The codes will be released upon acceptance.

SegRGB-X: General RGB-X Semantic Segmentation Model

Abstract

Semantic segmentation across arbitrary sensor modalities faces significant challenges due to diverse sensor characteristics, and the traditional configurations for this task result in redundant development efforts. We address these challenges by introducing a universal arbitrary-modal semantic segmentation framework that unifies segmentation across multiple modalities. Our approach features three key innovations: (1) the Modality-aware CLIP (MA-CLIP), which provides modality-specific scene understanding guidance through LoRA fine-tuning; (2) Modality-aligned Embeddings for capturing fine-grained features; and (3) the Domain-specific Refinement Module (DSRM) for dynamic feature adjustment. Evaluated on five diverse datasets with different complementary modalities (event, thermal, depth, polarization, and light field), our model surpasses specialized multi-modal methods and achieves state-of-the-art performance with a mIoU of 65.03%. The codes will be released upon acceptance.

Paper Structure

This paper contains 34 sections, 4 equations, 11 figures, 10 tables.

Figures (11)

  • Figure 1: Performance comparison between our method and CMX cmx, CMNeXt cmnext, Gemini Fusion gemini, and Stitch Fusion li2024stitchfusion on five multi-modal semantic segmentation datasets: DeLiVER cmnext, MFNet mfnet, NYUDepthV2 nyu, ZJU RGB-P rgbp, and UrbanLF urbanlf. Our general model, SegRGB-X, achieves the best overall performance.
  • Figure 2: Overall framework of our SegRGB-X model. At each stage, feature embeddings from the MA-CLIP are incorporated into the modality-aligned embedding to enhance feature representations. The input embeddings are processed using shared-weight Transformer blocks cmnext_80, enabling consistent and efficient feature extraction across modalities. The extracted features are progressively fused through the FRM and FFM modules, as introduced in cmnext. In the final stage, a Domain-Specific Refinement Module (DSRM) is employed to further refine the modality-specific features. Lastly, the segmentation head processes the fused features to generate the final predictions.
  • Figure 3: Structure of our MA-CLIP. MA-CLIP enhances the standard CLIP architecture clip by incorporating modality-aware cross-modal learning between textual and visual representations.
  • Figure 4: Domain-Specific Refinement Module (DSRM). It contains two identical DSRM blocks with shared weights, each processing a different modality pair ($F_4^r$, $S^r$) and ($F_4^m$, $S^m$) to produce enhanced features ($F^r_s$ and $F^m_s$).
  • Figure 5: t-SNE visualization of modality embeddings from MA-CLIP. Each cluster corresponds to a specific modality, showing clear separation and highlighting the model’s strong ability to extract distinctive modality-specific representations.
  • ...and 6 more figures