Table of Contents
Fetching ...

SatDreamer360: Multiview-Consistent Generation of Ground-Level Scenes from Satellite Imagery

Xianghui Ze, Beiyi Zhu, Zhenbo Song, Jianfeng Lu, Yujiao Shi

TL;DR

SatDreamer360 tackles generating continuous, multiview ground-level panoramas from a single satellite image along a predefined trajectory. It combines a tri-plane scene representation with ray-guided cross-view feature conditioning and an epipolar-constrained attention mechanism to enforce geometry-aware, temporally coherent outputs, validated on the new VIGOR++ dataset. Key contributions include the unified diffusion-based framework, the ray-based attention for view-specific feature retrieval, the panoramic inter-frame alignment strategy, and the large-scale VIGOR++ benchmark. The approach advances practical cross-view synthesis for simulation, autonomous navigation, and digital twin applications by achieving improved satellite-to-ground alignment and multiview consistency across diverse urban and rural scenes.

Abstract

Generating multiview-consistent $360^\circ$ ground-level scenes from satellite imagery is a challenging task with broad applications in simulation, autonomous navigation, and digital twin cities. Existing approaches primarily focus on synthesizing individual ground-view panoramas, often relying on auxiliary inputs like height maps or handcrafted projections, and struggle to produce multiview consistent sequences. In this paper, we propose SatDreamer360, a framework that generates geometrically consistent multi-view ground-level panoramas from a single satellite image, given a predefined pose trajectory. To address the large viewpoint discrepancy between ground and satellite images, we adopt a triplane representation to encode scene features and design a ray-based pixel attention mechanism that retrieves view-specific features from the triplane. To maintain multi-frame consistency, we introduce a panoramic epipolar-constrained attention module that aligns features across frames based on known relative poses. To support the evaluation, we introduce {VIGOR++}, a large-scale dataset for generating multi-view ground panoramas from a satellite image, by augmenting the original VIGOR dataset with more ground-view images and their pose annotations. Experiments show that SatDreamer360 outperforms existing methods in both satellite-to-ground alignment and multiview consistency.

SatDreamer360: Multiview-Consistent Generation of Ground-Level Scenes from Satellite Imagery

TL;DR

SatDreamer360 tackles generating continuous, multiview ground-level panoramas from a single satellite image along a predefined trajectory. It combines a tri-plane scene representation with ray-guided cross-view feature conditioning and an epipolar-constrained attention mechanism to enforce geometry-aware, temporally coherent outputs, validated on the new VIGOR++ dataset. Key contributions include the unified diffusion-based framework, the ray-based attention for view-specific feature retrieval, the panoramic inter-frame alignment strategy, and the large-scale VIGOR++ benchmark. The approach advances practical cross-view synthesis for simulation, autonomous navigation, and digital twin applications by achieving improved satellite-to-ground alignment and multiview consistency across diverse urban and rural scenes.

Abstract

Generating multiview-consistent ground-level scenes from satellite imagery is a challenging task with broad applications in simulation, autonomous navigation, and digital twin cities. Existing approaches primarily focus on synthesizing individual ground-view panoramas, often relying on auxiliary inputs like height maps or handcrafted projections, and struggle to produce multiview consistent sequences. In this paper, we propose SatDreamer360, a framework that generates geometrically consistent multi-view ground-level panoramas from a single satellite image, given a predefined pose trajectory. To address the large viewpoint discrepancy between ground and satellite images, we adopt a triplane representation to encode scene features and design a ray-based pixel attention mechanism that retrieves view-specific features from the triplane. To maintain multi-frame consistency, we introduce a panoramic epipolar-constrained attention module that aligns features across frames based on known relative poses. To support the evaluation, we introduce {VIGOR++}, a large-scale dataset for generating multi-view ground panoramas from a satellite image, by augmenting the original VIGOR dataset with more ground-view images and their pose annotations. Experiments show that SatDreamer360 outperforms existing methods in both satellite-to-ground alignment and multiview consistency.

Paper Structure

This paper contains 21 sections, 16 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Given a satellite image and a sequence of query poses (colored stars), our goal is to synthesize coherent panoramic views along the trajectory. The proposed SatDreamer360 generates more realistic and geometrically consistent ground-level scenes compared to state-of-the-art methods, faithfully capturing spatial layouts and structural continuity across diverse environments.
  • Figure 2: Overview of the proposed SatDreamer360 framework. Given a single satellite image and a target trajectory, our model synthesizes continuous ground-level panoramas along the path. A Ray-Based Pixel Attention mechanism retrieves view-specific features through cross-view geometric reasoning, guided by a tri-plane representation of the scene. An Epipolar-Constrained Attention module aligns features across frames using relative camera poses.
  • Figure 3: Overview of the VIGOR++ dataset. (a) The map of Seattle, USA, serves as an example of the ten cities in the dataset. The red boxes and blue boxes represent the districts for the training set and test set, respectively. (b) shows a road map. Dots and stars along the road represent locations of ground images and satellite images. Two of them, marked with the red star and green star, are shown in (c). (d) shows the continuous ground sequence within one satellite image.
  • Figure 4: Correspondence between image pixel coordinates and camera ray angles.
  • Figure 5: Using the same satellite image as a condition, different trajectories are input to generate corresponding images. From top to bottom: results for trajectory 1, results for trajectory 2, and results for trajectory 2 after updating the triplane with images generated from trajectory 1. Updating the triplane ensures that newly generated results are related to prior sequences.
  • ...and 2 more figures