UrbanCraft: Urban View Extrapolation via Hierarchical Sem-Geometric Priors
Tianhang Wang, Fan Lu, Sanqing Qu, Guo Yu, Shihang Du, Ya Wu, Yuan Huang, Guang Chen
TL;DR
UrbanCraft tackles Extrapolated View Synthesis (EVS) in urban scenes by introducing hierarchical sem-geometric priors that guide diffusion-based view synthesis beyond the training camera distribution. The approach combines UrbanCraft2D, a ControlNet-enhanced diffusion model with scene priors, and Hierarchical Semantic-Geometric-Guided Variational Score Distillation (HSG-VSD) to enforce consistency with observed views, paired with a 3D Gaussian Splatting initialization. Experimental results on KITTI-360 and NuScenes show state-of-the-art EVS performance, improved geometry-texture fidelity, and robust extrapolation under challenging camera poses. The framework also supports instance-level control and editable synthesis, with detailed implementation and ablation studies validating the contributions and limitations. Overall, UrbanCraft advances practical large-scale urban reconstruction by enabling accurate, controllable extrapolated view rendering.
Abstract
Existing neural rendering-based urban scene reconstruction methods mainly focus on the Interpolated View Synthesis (IVS) setting that synthesizes from views close to training camera trajectory. However, IVS can not guarantee the on-par performance of the novel view outside the training camera distribution (\textit{e.g.}, looking left, right, or downwards), which limits the generalizability of the urban reconstruction application. Previous methods have optimized it via image diffusion, but they fail to handle text-ambiguous or large unseen view angles due to coarse-grained control of text-only diffusion. In this paper, we design UrbanCraft, which surmounts the Extrapolated View Synthesis (EVS) problem using hierarchical sem-geometric representations serving as additional priors. Specifically, we leverage the partially observable scene to reconstruct coarse semantic and geometric primitives, establishing a coarse scene-level prior through an occupancy grid as the base representation. Additionally, we incorporate fine instance-level priors from 3D bounding boxes to enhance object-level details and spatial relationships. Building on this, we propose the \textbf{H}ierarchical \textbf{S}emantic-Geometric-\textbf{G}uided Variational Score Distillation (HSG-VSD), which integrates semantic and geometric constraints from pretrained UrbanCraft2D into the score distillation sampling process, forcing the distribution to be consistent with the observable scene. Qualitative and quantitative comparisons demonstrate the effectiveness of our methods on EVS problem.
