SG-BEV: Satellite-Guided BEV Fusion for Cross-View Semantic Segmentation

Junyan Ye; Qiyan Luo; Jinhua Yu; Huaping Zhong; Zhimeng Zheng; Conghui He; Weijia Li

SG-BEV: Satellite-Guided BEV Fusion for Cross-View Semantic Segmentation

Junyan Ye, Qiyan Luo, Jinhua Yu, Huaping Zhong, Zhimeng Zheng, Conghui He, Weijia Li

TL;DR

This work tackles cross-view semantic segmentation of fine-grained building attributes by fusing satellite and street-view data through a BEV-inspired mapping. The proposed SG-BEV framework introduces a Satellite-Guided Reprojection (SGR) module to overcome uneven BEV feature distribution and to continuously map street-view facade details into a top-down satellite space, complemented by a learnable cross-view fusion mechanism. Empirical results across four city datasets show significant improvements over both satellite-only and existing cross-view methods, validating the method's effectiveness in capturing interior building attributes like land use and floor count. The approach offers robust, multi-perspective building understanding with practical implications for urban planning and monitoring. $\Delta$-style equations and depth-guided reprojection play key roles in aligning features across views and ensuring dense interior reconstruction of facades.

Abstract

This paper aims at achieving fine-grained building attribute segmentation in a cross-view scenario, i.e., using satellite and street-view image pairs. The main challenge lies in overcoming the significant perspective differences between street views and satellite views. In this work, we introduce SG-BEV, a novel approach for satellite-guided BEV fusion for cross-view semantic segmentation. To overcome the limitations of existing cross-view projection methods in capturing the complete building facade features, we innovatively incorporate Bird's Eye View (BEV) method to establish a spatially explicit mapping of street-view features. Moreover, we fully leverage the advantages of multiple perspectives by introducing a novel satellite-guided reprojection module, optimizing the uneven feature distribution issues associated with traditional BEV methods. Our method demonstrates significant improvements on four cross-view datasets collected from multiple cities, including New York, San Francisco, and Boston. On average across these datasets, our method achieves an increase in mIOU by 10.13% and 5.21% compared with the state-of-the-art satellite-based and cross-view methods. The code and datasets of this work will be released at https://github.com/yejy53/SG-BEV.

SG-BEV: Satellite-Guided BEV Fusion for Cross-View Semantic Segmentation

TL;DR

-style equations and depth-guided reprojection play key roles in aligning features across views and ensuring dense interior reconstruction of facades.

Abstract

Paper Structure (28 sections, 11 equations, 14 figures, 13 tables)

This paper contains 28 sections, 11 equations, 14 figures, 13 tables.

Introduction
Related work
Semantic Segmentation of Ground Objects
Cross-View Projection Methods
Bird's Eye View methods
Methods
Satellite Feature Extraction
Street-View to BEV Conversion
Cross-View Feature Fusion
Experiments
Datasets
Experimental Settings
Performance Comparison
Ablation study
Conclusion
...and 13 more sections

Figures (14)

Figure 1: Illustration of cross-view semantic segmentation of fine-grained building. (a) Satellite imagery lacks information on building facades, making it difficult to distinguish detailed building attributes. (b) Existing cross-view transformation methods face issues with incomplete feature capture and uneven feature distribution. (c) Our method integrates satellite and street-view features to precisely segment building attributes and floor numbers.
Figure 2: Overview of our proposed SG-BEV framework. In Satellite Feature Extraction branch, we extract features of input satellite imagery, meanwhile output building footprint segmentation results for further processing. In Street-View to BEV Conversion branch, we map street-view features to BEV space using estimated depth information combined with building footprints. In Cross-View Feature Fusion module, we align and fuse satellite features with BEV features to achieve fine-grained segmentation of building attributes.
Figure 3: Illustration of Satellite-Guided Reprojection Module. We utilize satellite features to generate building footprint information, followed by calculating $\alpha$. Based on depth information $d$ and $\alpha$, we calculate magnitude of the offset $\Delta$ to adjust the initial point cloud for uniform distribution within the building area and discard points that exceed the building's footprint.
Figure 4: Comparisons of SG-BEV (Ours) and Satellite-Based Methods for Fine-Grained Segmentation. The first two rows show results of OmniCity on land use (first row) and floor level (second row) segmentation tasks. The third row presents land use predictions of Vigor. The street-view panoramas, from left to right, correspond to a 360-degree clockwise rotation starting from the north direction in the satellite imagery.
Figure 5: Comparisons of SG-BEV (Ours) and Other Cross-View Methods for Fine-Grained Segmentation. The first two rows display results of Brooklyn on land use (first row) and floor level (second row) segmentation tasks. The bottom row illustrates land use segmentation predictions of Boston. The street-view panoramas, from left to right, correspond to a 360-degree clockwise rotation starting from the north direction in the satellite imagery.
...and 9 more figures

SG-BEV: Satellite-Guided BEV Fusion for Cross-View Semantic Segmentation

TL;DR

Abstract

SG-BEV: Satellite-Guided BEV Fusion for Cross-View Semantic Segmentation

Authors

TL;DR

Abstract

Table of Contents

Figures (14)