Fourier Prompt Tuning for Modality-Incomplete Scene Segmentation

Ruiping Liu; Jiaming Zhang; Kunyu Peng; Yufan Chen; Ke Cao; Junwei Zheng; M. Saquib Sarfraz; Kailun Yang; Rainer Stiefelhagen

Fourier Prompt Tuning for Modality-Incomplete Scene Segmentation

Ruiping Liu, Jiaming Zhang, Kunyu Peng, Yufan Chen, Ke Cao, Junwei Zheng, M. Saquib Sarfraz, Kailun Yang, Rainer Stiefelhagen

TL;DR

This work tackles modality incompleteness in multi-modal scene segmentation (MISS) by introducing a Missing-aware Modal Switch (MMS) that uses a compact bitwise mechanism to randomize modality availability during training, thereby improving robustness without excessive data or parameter costs. To enable robust adaptation under MISS with limited tunable parameters, the authors propose Fourier Prompt Tuning (FPT), which injects global spectral information through FFT-enhanced prompts that interact with feature tokens via cross-attention and are integrated into a parameter-efficient backbone. Extensive experiments on DeLiVER and Cityscapes demonstrate that FPT with MMS achieves up to $5.84\%$ mIoU gains under missing modalities and consistently outperforms strong baselines across complete and incomplete modality settings, while using only about $1.1\%$ of tunable parameters. The approach advances practical, robust multi-modal perception for autonomous systems and provides a public code release for reproducibility.

Abstract

Integrating information from multiple modalities enhances the robustness of scene perception systems in autonomous vehicles, providing a more comprehensive and reliable sensory framework. However, the modality incompleteness in multi-modal segmentation remains under-explored. In this work, we establish a task called Modality-Incomplete Scene Segmentation (MISS), which encompasses both system-level modality absence and sensor-level modality errors. To avoid the predominant modality reliance in multi-modal fusion, we introduce a Missing-aware Modal Switch (MMS) strategy to proactively manage missing modalities during training. Utilizing bit-level batch-wise sampling enhances the model's performance in both complete and incomplete testing scenarios. Furthermore, we introduce the Fourier Prompt Tuning (FPT) method to incorporate representative spectral information into a limited number of learnable prompts that maintain robustness against all MISS scenarios. Akin to fine-tuning effects but with fewer tunable parameters (1.1%). Extensive experiments prove the efficacy of our proposed approach, showcasing an improvement of 5.84% mIoU over the prior state-of-the-art parameter-efficient methods in modality missing. The source code is publicly available at https://github.com/RuipingL/MISS.

Fourier Prompt Tuning for Modality-Incomplete Scene Segmentation

TL;DR

mIoU gains under missing modalities and consistently outperforms strong baselines across complete and incomplete modality settings, while using only about

of tunable parameters. The approach advances practical, robust multi-modal perception for autonomous systems and provides a public code release for reproducibility.

Abstract

Paper Structure (14 sections, 9 equations, 8 figures, 6 tables)

This paper contains 14 sections, 9 equations, 8 figures, 6 tables.

Introduction
Related Work
Multi-Modal Semantic Segmentation
Missing Modality
Parameter-Efficient Learning
Methodology
Missing-aware Modal Switch
Fourier Prompt Tuning
Experiments
Datasets
Implementation Details
Comparison with the State of the Art
Ablation Studies
Conclusions

Figures (8)

Figure 1: Modality-Incomplete Semantic Segmentation (MISS) aims to cover (a) modality-incomplete scenarios, e.g., in intelligent vehicles. (b) Predominant modality missing leads to severe performance degradation in models trained on complete data.
Figure 2: Individual prompts lee2023missing_prompt
Figure 3: Fourier Prompt
Figure 5: Missing-aware Modal Switch (MMS) method to manage the absence of dense (e.g., RGB and Depth) or/and sparse (e.g., LiDAR and Event) modalities. Due to dense prediction, at least one dense modality is retained during training, while modalities are complete during validation and incomplete during testing. The overline on a modality, e.g.$\overline{\mathrm{R}}$, means that it is missing.
Figure 6: Fourier Prompt Tuning module. Through Fast Fourier Transformation (FFT) and the interaction with spatial tokens (e.g., RGB-Depth), the resulting prompt, although with limited tunable parameters, contains both spectral and spatial information to robustify the frozen model in the modality-incomplete context.
...and 3 more figures

Fourier Prompt Tuning for Modality-Incomplete Scene Segmentation

TL;DR

Abstract

Fourier Prompt Tuning for Modality-Incomplete Scene Segmentation

Authors

TL;DR

Abstract

Table of Contents

Figures (8)