Table of Contents
Fetching ...

Learning Modality-agnostic Representation for Semantic Segmentation from Any Modalities

Xu Zheng, Yuanhuiyi Lyu, Lin Wang

TL;DR

The paper tackles modality-incomplete semantic segmentation by learning a modality-agnostic representation guided by MVLMs. It introduces LSCD to distill inter- and intra-modal semantic knowledge from MVLMs and MFF to fuse multi-modal features into a unified representation, enabling robust segmentation from any combination of modalities. On DELIVER and MCubeS, Any2Seg achieves state-of-the-art results in multi-modal settings and large gains in modality-incomplete scenarios, demonstrating strong robustness to sensor failures and adverse conditions. The work highlights potential limitations in pixel-level MVLM binding and sparse-input performance, pointing to pixel-wise MVLMs and further module enhancements as future directions.

Abstract

Image modality is not perfect as it often fails in certain conditions, e.g., night and fast motion. This significantly limits the robustness and versatility of existing multi-modal (i.e., Image+X) semantic segmentation methods when confronting modality absence or failure, as often occurred in real-world applications. Inspired by the open-world learning capability of multi-modal vision-language models (MVLMs), we explore a new direction in learning the modality-agnostic representation via knowledge distillation (KD) from MVLMs. Intuitively, we propose Any2Seg, a novel framework that can achieve robust segmentation from any combination of modalities in any visual conditions. Specifically, we first introduce a novel language-guided semantic correlation distillation (LSCD) module to transfer both inter-modal and intra-modal semantic knowledge in the embedding space from MVLMs, e.g., LanguageBind. This enables us to minimize the modality gap and alleviate semantic ambiguity to combine any modalities in any visual conditions. Then, we introduce a modality-agnostic feature fusion (MFF) module that reweights the multi-modal features based on the inter-modal correlation and selects the fine-grained feature. This way, our Any2Seg finally yields an optimal modality-agnostic representation. Extensive experiments on two benchmarks with four modalities demonstrate that Any2Seg achieves the state-of-the-art under the multi-modal setting (+3.54 mIoU) and excels in the challenging modality-incomplete setting(+19.79 mIoU).

Learning Modality-agnostic Representation for Semantic Segmentation from Any Modalities

TL;DR

The paper tackles modality-incomplete semantic segmentation by learning a modality-agnostic representation guided by MVLMs. It introduces LSCD to distill inter- and intra-modal semantic knowledge from MVLMs and MFF to fuse multi-modal features into a unified representation, enabling robust segmentation from any combination of modalities. On DELIVER and MCubeS, Any2Seg achieves state-of-the-art results in multi-modal settings and large gains in modality-incomplete scenarios, demonstrating strong robustness to sensor failures and adverse conditions. The work highlights potential limitations in pixel-level MVLM binding and sparse-input performance, pointing to pixel-wise MVLMs and further module enhancements as future directions.

Abstract

Image modality is not perfect as it often fails in certain conditions, e.g., night and fast motion. This significantly limits the robustness and versatility of existing multi-modal (i.e., Image+X) semantic segmentation methods when confronting modality absence or failure, as often occurred in real-world applications. Inspired by the open-world learning capability of multi-modal vision-language models (MVLMs), we explore a new direction in learning the modality-agnostic representation via knowledge distillation (KD) from MVLMs. Intuitively, we propose Any2Seg, a novel framework that can achieve robust segmentation from any combination of modalities in any visual conditions. Specifically, we first introduce a novel language-guided semantic correlation distillation (LSCD) module to transfer both inter-modal and intra-modal semantic knowledge in the embedding space from MVLMs, e.g., LanguageBind. This enables us to minimize the modality gap and alleviate semantic ambiguity to combine any modalities in any visual conditions. Then, we introduce a modality-agnostic feature fusion (MFF) module that reweights the multi-modal features based on the inter-modal correlation and selects the fine-grained feature. This way, our Any2Seg finally yields an optimal modality-agnostic representation. Extensive experiments on two benchmarks with four modalities demonstrate that Any2Seg achieves the state-of-the-art under the multi-modal setting (+3.54 mIoU) and excels in the challenging modality-incomplete setting(+19.79 mIoU).
Paper Structure (14 sections, 8 equations, 5 figures, 8 tables)

This paper contains 14 sections, 8 equations, 5 figures, 8 tables.

Figures (5)

  • Figure 1: (a) Our Any2Seg aims to learn a modality-agnostic representation for robust segmentation from any modalities guided by multi-modal VLMs; (b) Performance comparison between CMNeXt zhang2023delivering and our Any2Seg under the multi-modal semantic segmentation (MSS) and modality-incomplete semantic segmentation (MISS) settings.
  • Figure 2: The overall framework of Any2Seg, incorporating the LSCD and MFF modules to learn an optimal modality-agnostic representation for robust segmentation.
  • Figure 3: We use the Modality-agnostic Feature Fusion Module to obtain modality-agnostic features for multi-modal segmentation.
  • Figure 4: System-level Modality-Incomplete Semantic Segmentation (MISS) validation results on DEVLIER Dataset (D: depth; L: LiDAR; and E: Event).
  • Figure 5: Visualization of multi-modal features under different conditions on DEVLIER. (a) sun with over-exposure, and (b) night with under-exposure.