Table of Contents
Fetching ...

SA-Occ: Satellite-Assisted 3D Occupancy Prediction in Real World

Chen Chen, Zhirui Wang, Taowei Sheng, Yi Jiang, Yundu Li, Peirui Cheng, Luning Zhang, Kaiqiang Chen, Yanfeng Hu, Xue Yang, Xian Sun

TL;DR

This work tackles the limitations of street-view-only 3D occupancy prediction by integrating satellite imagery aligned via GPS/IMU poses. The authors introduce SA-Occ, featuring a Satellite BEV branch with 3D-Proj Guidance, a Street BEV branch with Uniform Sampling Alignment, and a Dynamic-Decoupling Fusion module to handle temporal asynchrony and dynamic objects. They also curate the Occ3D-NuScenes Extension Dataset to enable real-time satellite-street cross-view evaluation. Empirically, SA-Occ achieves a new state-of-the-art $mIoU$ of $39.05\%$ for single-frame input with only a small latency increase ($6.93$ ms), demonstrating the practical value of satellite-aware cross-view perception for autonomous driving.

Abstract

Existing vision-based 3D occupancy prediction methods are inherently limited in accuracy due to their exclusive reliance on street-view imagery, neglecting the potential benefits of incorporating satellite views. We propose SA-Occ, the first Satellite-Assisted 3D occupancy prediction model, which leverages GPS & IMU to integrate historical yet readily available satellite imagery into real-time applications, effectively mitigating limitations of ego-vehicle perceptions, involving occlusions and degraded performance in distant regions. To address the core challenges of cross-view perception, we propose: 1) Dynamic-Decoupling Fusion, which resolves inconsistencies in dynamic regions caused by the temporal asynchrony between satellite and street views; 2) 3D-Proj Guidance, a module that enhances 3D feature extraction from inherently 2D satellite imagery; and 3) Uniform Sampling Alignment, which aligns the sampling density between street and satellite views. Evaluated on Occ3D-nuScenes, SA-Occ achieves state-of-the-art performance, especially among single-frame methods, with a 39.05% mIoU (a 6.97% improvement), while incurring only 6.93 ms of additional latency per frame. Our code and newly curated dataset are available at https://github.com/chenchen235/SA-Occ.

SA-Occ: Satellite-Assisted 3D Occupancy Prediction in Real World

TL;DR

This work tackles the limitations of street-view-only 3D occupancy prediction by integrating satellite imagery aligned via GPS/IMU poses. The authors introduce SA-Occ, featuring a Satellite BEV branch with 3D-Proj Guidance, a Street BEV branch with Uniform Sampling Alignment, and a Dynamic-Decoupling Fusion module to handle temporal asynchrony and dynamic objects. They also curate the Occ3D-NuScenes Extension Dataset to enable real-time satellite-street cross-view evaluation. Empirically, SA-Occ achieves a new state-of-the-art of for single-frame input with only a small latency increase ( ms), demonstrating the practical value of satellite-aware cross-view perception for autonomous driving.

Abstract

Existing vision-based 3D occupancy prediction methods are inherently limited in accuracy due to their exclusive reliance on street-view imagery, neglecting the potential benefits of incorporating satellite views. We propose SA-Occ, the first Satellite-Assisted 3D occupancy prediction model, which leverages GPS & IMU to integrate historical yet readily available satellite imagery into real-time applications, effectively mitigating limitations of ego-vehicle perceptions, involving occlusions and degraded performance in distant regions. To address the core challenges of cross-view perception, we propose: 1) Dynamic-Decoupling Fusion, which resolves inconsistencies in dynamic regions caused by the temporal asynchrony between satellite and street views; 2) 3D-Proj Guidance, a module that enhances 3D feature extraction from inherently 2D satellite imagery; and 3) Uniform Sampling Alignment, which aligns the sampling density between street and satellite views. Evaluated on Occ3D-nuScenes, SA-Occ achieves state-of-the-art performance, especially among single-frame methods, with a 39.05% mIoU (a 6.97% improvement), while incurring only 6.93 ms of additional latency per frame. Our code and newly curated dataset are available at https://github.com/chenchen235/SA-Occ.

Paper Structure

This paper contains 17 sections, 12 equations, 5 figures, 8 tables.

Figures (5)

  • Figure 1: Street views provide real-time observations but are significantly affected by occlusions from both static (blue regions occluded by the wall and door) and dynamic objects (yellow regions occluded by the vehicles). Additionally, perspective projections lead to sparse observations in distant regions. Integrating satellite imagery enhances perception, particularly in occluded areas and distant regions (orange boxes). However, a key challenge in fusing satellite and street views is the inconsistency of dynamic objects due to the temporal gap between observations (red boxes: absence of the dynamic vehicle in satellite view).
  • Figure 2: Comparison of BEV feature acquisition methods: satellite (a: natural alignment with BEV space) vs. street views (b: misalignment due to dense-near-sparse-far characteristic), with our supplement (c: preemptive alignment via predefined points.).
  • Figure 3: SA-Occ enhances perception with satellite view via GPS & IMU. It extends the street view branch with a Uniform Sampling Alignment (Uni-SA) module and creates a satellite BEV branch containing a U-shape Feature Extractor with 3D-Proj Guidance module and soft gating. The Dynamic Decoupling Fusion module follows, mitigating satellite interference in dynamic regions via obtaining dynamic region attention from street view with supervision and enhancing dynamic-static region interactions via dynamic-encoding spatial attention.
  • Figure 4: Qualitative comparison of BEV features at different stages. (a) The LSS-based 2D-to-3D forward-proj view transformation generates dense features in nearby regions but suffers from sparsity at distant regions due to its radial projection pattern. By adding 3D-to-2D uniform sampling that aligns with satellite view, we supplement features at distant regions, maximize street feature density, and thereby optimize the interaction and fusion of satellite and street-view features across both nearby and distant regions. (b) Even with maximized feature density, the street-view perspective remains limited by occlusions, which are effectively supplemented by the satellite viewpoint.
  • Figure 5: Qualitative Comparison of SA-Occ with Baselines. SA-Occ shows clearer boundaries and stronger robustness for both dynamic targets and static regions, especially in occluded areas.