Table of Contents
Fetching ...

VLMFusionOcc3D: VLM Assisted Multi-Modal 3D Semantic Occupancy Prediction

A. Enes Doruk, Hasan F. Ates

TL;DR

This paper introduces VLMFusionOcc3D, a robust multimodal framework for dense 3D semantic occupancy prediction in autonomous driving, and proposes Instance-driven VLM Attention, which utilizes gated cross-attention and LoRA-adapted CLIP embeddings to inject high-level semantic and geographic priors directly into the 3D voxels.

Abstract

This paper introduces VLMFusionOcc3D, a robust multimodal framework for dense 3D semantic occupancy prediction in autonomous driving. Current voxel-based occupancy models often struggle with semantic ambiguity in sparse geometric grids and performance degradation under adverse weather conditions. To address these challenges, we leverage the rich linguistic priors of Vision-Language Models (VLMs) to anchor ambiguous voxel features to stable semantic concepts. Our framework initiates with a dual-branch feature extraction pipeline that projects multi-view images and LiDAR point clouds into a unified voxel space. We propose Instance-driven VLM Attention (InstVLM), which utilizes gated cross-attention and LoRA-adapted CLIP embeddings to inject high-level semantic and geographic priors directly into the 3D voxels. Furthermore, we introduce Weather-Aware Adaptive Fusion (WeathFusion), a dynamic gating mechanism that utilizes vehicle metadata and weather-conditioned prompts to re-weight sensor contributions based on real-time environmental reliability. To ensure structural consistency, a Depth-Aware Geometric Alignment (DAGA) loss is employed to align dense camera-derived geometry with sparse, spatially accurate LiDAR returns. Extensive experiments on the nuScenes and SemanticKITTI datasets demonstrate that our plug-and-play modules consistently enhance the performance of state-of-the-art voxel-based baselines. Notably, our approach achieves significant improvements in challenging weather scenarios, offering a scalable and robust solution for complex urban navigation.

VLMFusionOcc3D: VLM Assisted Multi-Modal 3D Semantic Occupancy Prediction

TL;DR

This paper introduces VLMFusionOcc3D, a robust multimodal framework for dense 3D semantic occupancy prediction in autonomous driving, and proposes Instance-driven VLM Attention, which utilizes gated cross-attention and LoRA-adapted CLIP embeddings to inject high-level semantic and geographic priors directly into the 3D voxels.

Abstract

This paper introduces VLMFusionOcc3D, a robust multimodal framework for dense 3D semantic occupancy prediction in autonomous driving. Current voxel-based occupancy models often struggle with semantic ambiguity in sparse geometric grids and performance degradation under adverse weather conditions. To address these challenges, we leverage the rich linguistic priors of Vision-Language Models (VLMs) to anchor ambiguous voxel features to stable semantic concepts. Our framework initiates with a dual-branch feature extraction pipeline that projects multi-view images and LiDAR point clouds into a unified voxel space. We propose Instance-driven VLM Attention (InstVLM), which utilizes gated cross-attention and LoRA-adapted CLIP embeddings to inject high-level semantic and geographic priors directly into the 3D voxels. Furthermore, we introduce Weather-Aware Adaptive Fusion (WeathFusion), a dynamic gating mechanism that utilizes vehicle metadata and weather-conditioned prompts to re-weight sensor contributions based on real-time environmental reliability. To ensure structural consistency, a Depth-Aware Geometric Alignment (DAGA) loss is employed to align dense camera-derived geometry with sparse, spatially accurate LiDAR returns. Extensive experiments on the nuScenes and SemanticKITTI datasets demonstrate that our plug-and-play modules consistently enhance the performance of state-of-the-art voxel-based baselines. Notably, our approach achieves significant improvements in challenging weather scenarios, offering a scalable and robust solution for complex urban navigation.
Paper Structure (19 sections, 9 equations, 7 figures, 6 tables)

This paper contains 19 sections, 9 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Comparison of feature fusion architectures. Standard approaches include Addition Fusion for element-wise summation, Concatenation Fusion for channel-wise stacking, and 3D Convolution Fusion for learned spatial refinement. Our proposed WeathFusion introduces a language-aware gating mechanism. It utilizes a frozen CLIP encoder to process environmental descriptions, modulating input features through learned gating weights before final summation.
  • Figure 2: Overall architecture of VLMFusionOcc3D. Our pipeline integrates multi-view camera images and LiDAR point clouds through a dual-branch architecture. High-level semantic priors are injected via Instance-driven VLM Attention (InstVLM), followed by dynamic modality integration using the Weather-Aware Adaptive Fusion (WeathFusion) module. The final dense occupancy grid is optimized using the Depth-Aware Geometric Alignment (DAGA) loss.
  • Figure 3: Architecture of the Instance-driven VLM Attention (InstVLM) module. The module employs a gated cross-attention mechanism to anchor 3D voxel features to continuous text embeddings from a LoRA-adapted CLIP encoder. The gating mechanism ensures that semantic information is selectively fused into spatial voxels.
  • Figure 4: Examples of structured instance prompts. These prompts encapsulate category-specific information and geographic context to provide the VLM with rich environmental priors. During inference, a recursive strategy is employed to maintain temporal semantic stability.
  • Figure 5: Detailed schematic of the Weather-Aware Adaptive Fusion (WeathFusion) module. The gating head processes weather context prompts to compute dynamic reliability weights for each modality. This process enables the framework to robustly transition between sensors in response to environmental degradation, such as LiDAR scattering or low-light camera noise.
  • ...and 2 more figures