Table of Contents
Fetching ...

Sky2Ground: A Benchmark for Site Modeling under Varying Altitude

Zengyan Wang, Sirshapan Mitra, Rajat Modi, Grace Lim, Yogesh Rawat

Abstract

We introduce Sky2Ground, a three-view dataset designed for varying altitude camera localization, correspondence learning, and reconstruction. The dataset combines structured synthetic imagery with real, in-the-wild images, providing both controlled multi-view geometry and realistic scene noise. Each of the 51 sites contains thousands of satellite, aerial, and ground images spanning wide altitude ranges and nearly orthogonal viewing angles, enabling rigorous evaluation across global-to-local contexts. We benchmark state of the art pose estimation models, including MASt3R, DUSt3R, Map Anything, and VGGT, and observe that the use of satellite imagery often degrades performance, highlighting the challenges under large altitude variations. We also examine reconstruction methods, highlighting the challenges introduced by sparse geometric overlap, varying perspectives, and the use of real imagery, which often introduces noise and reduces rendering quality. To address some of these challenges, we propose SkyNet, a model which enhances cross-view consistency when incorporating satellite imagery with a curriculum-based training strategy to progressively incorporate more satellite views. SkyNet significantly strengthens multi-view alignment and outperforms existing methods by 9.6% on RRA@5 and 18.1% on RTA@5 in terms of absolute performance. Sky2Ground and SkyNet together establish a comprehensive testbed and baseline for advancing large-scale, multi-altitude 3D perception and generalizable camera localization. Code and models will be released publicly for future research.

Sky2Ground: A Benchmark for Site Modeling under Varying Altitude

Abstract

We introduce Sky2Ground, a three-view dataset designed for varying altitude camera localization, correspondence learning, and reconstruction. The dataset combines structured synthetic imagery with real, in-the-wild images, providing both controlled multi-view geometry and realistic scene noise. Each of the 51 sites contains thousands of satellite, aerial, and ground images spanning wide altitude ranges and nearly orthogonal viewing angles, enabling rigorous evaluation across global-to-local contexts. We benchmark state of the art pose estimation models, including MASt3R, DUSt3R, Map Anything, and VGGT, and observe that the use of satellite imagery often degrades performance, highlighting the challenges under large altitude variations. We also examine reconstruction methods, highlighting the challenges introduced by sparse geometric overlap, varying perspectives, and the use of real imagery, which often introduces noise and reduces rendering quality. To address some of these challenges, we propose SkyNet, a model which enhances cross-view consistency when incorporating satellite imagery with a curriculum-based training strategy to progressively incorporate more satellite views. SkyNet significantly strengthens multi-view alignment and outperforms existing methods by 9.6% on RRA@5 and 18.1% on RTA@5 in terms of absolute performance. Sky2Ground and SkyNet together establish a comprehensive testbed and baseline for advancing large-scale, multi-altitude 3D perception and generalizable camera localization. Code and models will be released publicly for future research.
Paper Structure (12 sections, 1 equation, 8 figures, 6 tables)

This paper contains 12 sections, 1 equation, 8 figures, 6 tables.

Figures (8)

  • Figure 1: Cross-view examples from the Sky2Ground dataset. Satellite, aerial, and ground-level images for a variety of urban scenes in Sky2Ground, where each column corresponds to a unique site. These examples highlight strong viewpoint and appearance variations across modalities, revealing the challenges of cross-view matching and multi-scale scene understanding. Real images additionally introduce diverse lighting conditions, weather effects, and natural scene noise, further emphasizing the complexity of real-world cross-view perception.
  • Figure 2: Overview of the Sky2Ground dataset. The middle trajectory illustrates camera poses from one of our collected sites. Dots indicate ground-truth camera positions for synthetic images, while red frustums represent the estimated camera poses for real images. The surrounding images showcase example satellite, aerial, and ground views—where the real images demonstrate more diverse illumination conditions and realistic noise. The top-right map depicts the geographic distribution of all collected landmarks, highlighting the dataset’s global coverage.
  • Figure 3: Benchmark splits and modality setups. (a) Image counts per split for synthetic CR - Core, D1 - Dense 1, D2 - Dense 2, D3 - Dense3 and D4 - Dense 4, across ground, aerial, and satellite views. (b) View combinations used in each benchmark setup: Ground (G), Ground+Aerial (GA), Ground+Satellite (GS), and Ground+Aerial+Satellite (GAS).
  • Figure 4: Comparison of RRA@5 and RTA@5 metrics for four methods (Dust3r, Mast3r, Map Anything, and VGGT).
  • Figure 5: Comparison of models across view combinations.
  • ...and 3 more figures