Table of Contents
Fetching ...

$M^2$-Occ: Resilient 3D Semantic Occupancy Prediction for Autonomous Driving with Incomplete Camera Inputs

Kaixin Lin, Kunyu Peng, Di Wen, Yufan Chen, Ruiping Liu, Kailun Yang

TL;DR

This work introduces M^2-Occ, a framework designed to preserve geometric structure and semantic coherence when views are missing, and introduces a systematic missing-view evaluation protocol encompassing both deterministic single-view failures and stochastic multi-view dropout scenarios.

Abstract

Semantic occupancy prediction enables dense 3D geometric and semantic understanding for autonomous driving. However, existing camera-based approaches implicitly assume complete surround-view observations, an assumption that rarely holds in real-world deployment due to occlusion, hardware malfunction, or communication failures. We study semantic occupancy prediction under incomplete multi-camera inputs and introduce $M^2$-Occ, a framework designed to preserve geometric structure and semantic coherence when views are missing. $M^2$-Occ addresses two complementary challenges. First, a Multi-view Masked Reconstruction (MMR) module leverages the spatial overlap among neighboring cameras to recover missing-view representations directly in the feature space. Second, a Feature Memory Module (FMM) introduces a learnable memory bank that stores class-level semantic prototypes. By retrieving and integrating these global priors, the FMM refines ambiguous voxel features, ensuring semantic consistency even when observational evidence is incomplete. We introduce a systematic missing-view evaluation protocol on the nuScenes-based SurroundOcc benchmark, encompassing both deterministic single-view failures and stochastic multi-view dropout scenarios. Under the safety-critical missing back-view setting, $M^2$-Occ improves the IoU by 4.93%. As the number of missing cameras increases, the robustness gap further widens; for instance, under the setting with five missing views, our method boosts the IoU by 5.01%. These gains are achieved without compromising full-view performance. The source code will be publicly released at https://github.com/qixi7up/M2-Occ.

$M^2$-Occ: Resilient 3D Semantic Occupancy Prediction for Autonomous Driving with Incomplete Camera Inputs

TL;DR

This work introduces M^2-Occ, a framework designed to preserve geometric structure and semantic coherence when views are missing, and introduces a systematic missing-view evaluation protocol encompassing both deterministic single-view failures and stochastic multi-view dropout scenarios.

Abstract

Semantic occupancy prediction enables dense 3D geometric and semantic understanding for autonomous driving. However, existing camera-based approaches implicitly assume complete surround-view observations, an assumption that rarely holds in real-world deployment due to occlusion, hardware malfunction, or communication failures. We study semantic occupancy prediction under incomplete multi-camera inputs and introduce -Occ, a framework designed to preserve geometric structure and semantic coherence when views are missing. -Occ addresses two complementary challenges. First, a Multi-view Masked Reconstruction (MMR) module leverages the spatial overlap among neighboring cameras to recover missing-view representations directly in the feature space. Second, a Feature Memory Module (FMM) introduces a learnable memory bank that stores class-level semantic prototypes. By retrieving and integrating these global priors, the FMM refines ambiguous voxel features, ensuring semantic consistency even when observational evidence is incomplete. We introduce a systematic missing-view evaluation protocol on the nuScenes-based SurroundOcc benchmark, encompassing both deterministic single-view failures and stochastic multi-view dropout scenarios. Under the safety-critical missing back-view setting, -Occ improves the IoU by 4.93%. As the number of missing cameras increases, the robustness gap further widens; for instance, under the setting with five missing views, our method boosts the IoU by 5.01%. These gains are achieved without compromising full-view performance. The source code will be publicly released at https://github.com/qixi7up/M2-Occ.
Paper Structure (25 sections, 9 equations, 6 figures, 4 tables)

This paper contains 25 sections, 9 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Existing methods rely on complete camera inputs and suffer from geometric gaps when a sensor fails, e.g., missing FRONT-view. Our $M^2$-Occ maintains perceptual integrity by hallucinating missing features from adjacent overlaps and stabilizing semantics via global memory, achieving superior performance, especially when multiple views are missing.
  • Figure 2: An overview of the proposed $M^2$-Occ framework. Multi-view images are first processed by a shared backbone to extract 2D features. To handle missing or corrupted views, the Multi-view Masked Reconstruction (MMR) module leverages spatial overlaps from adjacent cameras to reconstruct the lost features. These features are then lifted into a unified 3D volume. Finally, the Feature Memory Module (FMM) refines the 3D voxel representations by retrieving high-level global semantic prototypes, ensuring structural and semantic consistency before generating the dense 3D occupancy prediction.
  • Figure 3: An overview of Multi-view Masked Reconstruction (MMR). The MMR module extracts overlapping boundary features from adjacent unmasked views and concatenates them with a central learnable mask token. A lightweight transformer decoder then processes this structural prior to reconstruct the missing view's representations, preserving spatial continuity.
  • Figure 4: A comparison between the single-proto strategy and the multi-proto strategy. While the single-proto approach maintains one global centroid per semantic class, the multi-proto strategy captures intra-class variance by learning multiple sub-prototypes and dynamically retrieving them based on feature similarity.
  • Figure 5: Visualizations on the nuScenes validation set caesar2020nuscenes. Our method achieves promising results in the reconstruction of the missing $F$ view across various scenarios, even under weak lighting conditions.
  • ...and 1 more figures