Table of Contents
Fetching ...

SGFormer: Satellite-Ground Fusion for 3D Semantic Scene Completion

Xiyue Guo, Jiarui Hu, Junjie Hu, Hujun Bao, Guofeng Zhang

TL;DR

The paper tackles occlusion-limited 3D SSC by introducing SGFormer, a satellite-ground cooperative framework that encodes ground and satellite images in parallel branches and unifies them in a common domain. It introduces a ground-view guided satellite correction strategy, a deformable self-/cross-attention-based feature transformation, and an adaptive fusion module to balance contributions from both views, enabling robust occupancy and semantic predictions. Through experiments on SemanticKITTI and SSCBench-KITTI-360, SGFormer achieves state-of-the-art performance among camera-based methods and competitive results with LiDAR-based approaches, validating the value of integrating satellite imagery for global scene context and dynamic detail. The work demonstrates that low-cost satellite data can significantly alleviate occlusion-induced ambiguities and improve SSC, with practical implications for autonomous driving and remote sensing applications, backed by publicly available code.

Abstract

Recently, camera-based solutions have been extensively explored for scene semantic completion (SSC). Despite their success in visible areas, existing methods struggle to capture complete scene semantics due to frequent visual occlusions. To address this limitation, this paper presents the first satellite-ground cooperative SSC framework, i.e., SGFormer, exploring the potential of satellite-ground image pairs in the SSC task. Specifically, we propose a dual-branch architecture that encodes orthogonal satellite and ground views in parallel, unifying them into a common domain. Additionally, we design a ground-view guidance strategy that corrects satellite image biases during feature encoding, addressing misalignment between satellite and ground views. Moreover, we develop an adaptive weighting strategy that balances contributions from satellite and ground views. Experiments demonstrate that SGFormer outperforms the state of the art on SemanticKITTI and SSCBench-KITTI-360 datasets. Our code is available on https://github.com/gxytcrc/SGFormer.

SGFormer: Satellite-Ground Fusion for 3D Semantic Scene Completion

TL;DR

The paper tackles occlusion-limited 3D SSC by introducing SGFormer, a satellite-ground cooperative framework that encodes ground and satellite images in parallel branches and unifies them in a common domain. It introduces a ground-view guided satellite correction strategy, a deformable self-/cross-attention-based feature transformation, and an adaptive fusion module to balance contributions from both views, enabling robust occupancy and semantic predictions. Through experiments on SemanticKITTI and SSCBench-KITTI-360, SGFormer achieves state-of-the-art performance among camera-based methods and competitive results with LiDAR-based approaches, validating the value of integrating satellite imagery for global scene context and dynamic detail. The work demonstrates that low-cost satellite data can significantly alleviate occlusion-induced ambiguities and improve SSC, with practical implications for autonomous driving and remote sensing applications, backed by publicly available code.

Abstract

Recently, camera-based solutions have been extensively explored for scene semantic completion (SSC). Despite their success in visible areas, existing methods struggle to capture complete scene semantics due to frequent visual occlusions. To address this limitation, this paper presents the first satellite-ground cooperative SSC framework, i.e., SGFormer, exploring the potential of satellite-ground image pairs in the SSC task. Specifically, we propose a dual-branch architecture that encodes orthogonal satellite and ground views in parallel, unifying them into a common domain. Additionally, we design a ground-view guidance strategy that corrects satellite image biases during feature encoding, addressing misalignment between satellite and ground views. Moreover, we develop an adaptive weighting strategy that balances contributions from satellite and ground views. Experiments demonstrate that SGFormer outperforms the state of the art on SemanticKITTI and SSCBench-KITTI-360 datasets. Our code is available on https://github.com/gxytcrc/SGFormer.

Paper Structure

This paper contains 22 sections, 8 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: SGFormer, which adopts satellite-ground cooperative fusion, can achieve state-of-the-art performance in scene completion and semantic prediction. Benefiting from informative satellite images and a well-designed dual-branch pipeline, SGFormer can effectively improve semantic prediction accuracy and solve the long-standing visual occlusion bottleneck suffered by purely ground-view methods.
  • Figure 2: Overview of SGFormer. Overall, SGFormer feeds a satellite-ground image pair into similar backbone networks in different branches to extract multi-level feature maps respectively (Left Part). Then, leveraging deformable attention, it transforms satellite and ground features into volume and BEV spaces (Middle Part) for following feature fusion and decoding (Right Part). Specifically, in the ground branch, we use a depth estimator to produce voxel proposals for targeted querying on non-empty feature volumes. In the satellite branch, we fuse vertically squeezed ground-view features into BEV queries to warm up satellite features (Red Line). Before final fusion, encoded features from both branches are enhanced through 2D/3D convolution networks. Our proposed fusion module, which is detailed in Figure \ref{['fig:fusion']}, is able to adaptively fuse satellite and ground features, followed by a seg head to output semantic reconstruction results.
  • Figure 3: Fusion Module. Our fusion module (Top) mainly consists of two parts: the weight module and the probability network. This weight module (Bottom Left) can facilitate mutual perception between the BEV feature $\mathbf{F'}_s^{BEV}$ and voxel feature $\mathbf{F'}_g^{3D}$ in both 2D and 3D spaces to produce channel-wise weight vectors. This probability network (Bottom Right) takes the weighted average 3D voxel feature as input to infer the approximate occupancy probability per voxel to achieve better geometry reconstruction.
  • Figure 4: Scene Semantic Completion on SemanticKITTI semantickitti.Left: we show satellite-ground image pairs and indicate the vehicle's travel direction as a yellow arrow. Right: we qualitatively compare scene semantic completion results from SGFormer and other baselines, where SGFormer can produce more complete and accurate semantic reconstruction relying on the satellite-ground fusion.