Table of Contents
Fetching ...

Exploring Modality Guidance to Enhance VFM-based Feature Fusion for UDA in 3D Semantic Segmentation

Johannes Spoecklberger, Wei Lin, Pedro Hermosilla, Sivan Doveh, Horst Possegger, M. Jehanzeb Mirza

TL;DR

This work addresses unsupervised domain adaptation for LiDAR-based 3D semantic segmentation by leveraging Vision Foundation Models (VFMs) for cross-modal information and introducing an adaptive fusion refinement network. The method deploys a three-stream architecture (2D VFM, 3D backbone, and fusion branch) where the fusion module is guided by environmental conditions to bias toward the more reliable modality, using predictive alignment losses across modalities. Empirically, it delivers strong gains across four 3D UDA benchmarks (average +6.5 mIoU over State-of-the-Art), demonstrates cross-VFM generalization, and provides thorough ablations validating the design, while acknowledging limitations related to fixed modality priors and proposing per-point reliability estimation as future work. Overall, the results highlight the practical potential of adaptive, VFMs-supported cross-modal fusion for robust 3D perception under diverse environmental and sensor conditions.

Abstract

Vision Foundation Models (VFMs) have become a de facto choice for many downstream vision tasks, like image classification, image segmentation, and object localization. However, they can also provide significant utility for downstream 3D tasks that can leverage the cross-modal information (e.g., from paired image data). In our work, we further explore the utility of VFMs for adapting from a labeled source to unlabeled target data for the task of LiDAR-based 3D semantic segmentation. Our method consumes paired 2D-3D (image and point cloud) data and relies on the robust (cross-domain) features from a VFM to train a 3D backbone on a mix of labeled source and unlabeled target data. At the heart of our method lies a fusion network that is guided by both the image and point cloud streams, with their relative contributions adjusted based on the target domain. We extensively compare our proposed methodology with different state-of-the-art methods in several settings and achieve strong performance gains. For example, achieving an average improvement of 6.5 mIoU (over all tasks), when compared with the previous state-of-the-art.

Exploring Modality Guidance to Enhance VFM-based Feature Fusion for UDA in 3D Semantic Segmentation

TL;DR

This work addresses unsupervised domain adaptation for LiDAR-based 3D semantic segmentation by leveraging Vision Foundation Models (VFMs) for cross-modal information and introducing an adaptive fusion refinement network. The method deploys a three-stream architecture (2D VFM, 3D backbone, and fusion branch) where the fusion module is guided by environmental conditions to bias toward the more reliable modality, using predictive alignment losses across modalities. Empirically, it delivers strong gains across four 3D UDA benchmarks (average +6.5 mIoU over State-of-the-Art), demonstrates cross-VFM generalization, and provides thorough ablations validating the design, while acknowledging limitations related to fixed modality priors and proposing per-point reliability estimation as future work. Overall, the results highlight the practical potential of adaptive, VFMs-supported cross-modal fusion for robust 3D perception under diverse environmental and sensor conditions.

Abstract

Vision Foundation Models (VFMs) have become a de facto choice for many downstream vision tasks, like image classification, image segmentation, and object localization. However, they can also provide significant utility for downstream 3D tasks that can leverage the cross-modal information (e.g., from paired image data). In our work, we further explore the utility of VFMs for adapting from a labeled source to unlabeled target data for the task of LiDAR-based 3D semantic segmentation. Our method consumes paired 2D-3D (image and point cloud) data and relies on the robust (cross-domain) features from a VFM to train a 3D backbone on a mix of labeled source and unlabeled target data. At the heart of our method lies a fusion network that is guided by both the image and point cloud streams, with their relative contributions adjusted based on the target domain. We extensively compare our proposed methodology with different state-of-the-art methods in several settings and achieve strong performance gains. For example, achieving an average improvement of 6.5 mIoU (over all tasks), when compared with the previous state-of-the-art.

Paper Structure

This paper contains 21 sections, 6 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: (a) Cross modal learning with frozen 2D (VFM) backbone features using a learned fusion representation. Fusion networks can lead to a suboptimal feature utilization and unwanted modality bias on the target domain. Therefore, we propose regularizing the fusion by the most effective modality in a certain environment (e.g., based on lighting conditions). (b) mIoU Comparison of xMUDA with different fusion variants and Ours on NuScenes: USA $\to$ Singapore.
  • Figure 2: Our architecture for the cross-modal learning consists of a Vision Foundation Model (VFM) as the 2D encoder and a 3D SparseConvNet as the 3D encoder. We use multiple main task heads of semantic segmentation and mimicry task heads for cross-modal alignment. Besides the 2D branch (green) and 3D branch (blue), we employ a fusion branch (purple) where we concatenate the 2D and 3D feature as the fusion feature. The network is trained with the supervised loss $\mathcal{L}_\text{seg}$, and the self-supervised cross-modal learning losses $\mathcal{L}_\text{align}$ (the fusion branch guiding the 3D branch) and $\mathcal{L}_\text{guide}$ (the fusion branch guided by the 2D or 3D branch).
  • Figure 3: Qualitative comparison of our method on an example from each dataset. We show the softmax average of our fusion and 3D head. Boxes mark locations of interest with zoom-in below. Multiple traffic participants are not detected by xMUDA-VFM-Fuse; VK $\to$ SK: A car is incorrectly identified as nature; A2D2 $\to$ SK: Two persons are missed; USA $\to$ Sing. A bus is wrongly identified as a manmade structure. Our method correctly identifies these traffic participants likely due to our stronger reliance on the well-generalizing VFM features. Day $\to$ Night xMUDA-VFM-Fuse detects false positive vehicles, a potential sign of overreliance on visual features in low-light conditions which can be avoided with our proposed fusion regularization.
  • Figure 4: Comparison of current SOTA VFMs on USA $\to$ Sing. and VK $\to$ SK. We report the mIoU % for our main heads including the VFM head utilized for the fusion regularization.