Table of Contents
Fetching ...

UrbanCraft: Urban View Extrapolation via Hierarchical Sem-Geometric Priors

Tianhang Wang, Fan Lu, Sanqing Qu, Guo Yu, Shihang Du, Ya Wu, Yuan Huang, Guang Chen

TL;DR

UrbanCraft tackles Extrapolated View Synthesis (EVS) in urban scenes by introducing hierarchical sem-geometric priors that guide diffusion-based view synthesis beyond the training camera distribution. The approach combines UrbanCraft2D, a ControlNet-enhanced diffusion model with scene priors, and Hierarchical Semantic-Geometric-Guided Variational Score Distillation (HSG-VSD) to enforce consistency with observed views, paired with a 3D Gaussian Splatting initialization. Experimental results on KITTI-360 and NuScenes show state-of-the-art EVS performance, improved geometry-texture fidelity, and robust extrapolation under challenging camera poses. The framework also supports instance-level control and editable synthesis, with detailed implementation and ablation studies validating the contributions and limitations. Overall, UrbanCraft advances practical large-scale urban reconstruction by enabling accurate, controllable extrapolated view rendering.

Abstract

Existing neural rendering-based urban scene reconstruction methods mainly focus on the Interpolated View Synthesis (IVS) setting that synthesizes from views close to training camera trajectory. However, IVS can not guarantee the on-par performance of the novel view outside the training camera distribution (\textit{e.g.}, looking left, right, or downwards), which limits the generalizability of the urban reconstruction application. Previous methods have optimized it via image diffusion, but they fail to handle text-ambiguous or large unseen view angles due to coarse-grained control of text-only diffusion. In this paper, we design UrbanCraft, which surmounts the Extrapolated View Synthesis (EVS) problem using hierarchical sem-geometric representations serving as additional priors. Specifically, we leverage the partially observable scene to reconstruct coarse semantic and geometric primitives, establishing a coarse scene-level prior through an occupancy grid as the base representation. Additionally, we incorporate fine instance-level priors from 3D bounding boxes to enhance object-level details and spatial relationships. Building on this, we propose the \textbf{H}ierarchical \textbf{S}emantic-Geometric-\textbf{G}uided Variational Score Distillation (HSG-VSD), which integrates semantic and geometric constraints from pretrained UrbanCraft2D into the score distillation sampling process, forcing the distribution to be consistent with the observable scene. Qualitative and quantitative comparisons demonstrate the effectiveness of our methods on EVS problem.

UrbanCraft: Urban View Extrapolation via Hierarchical Sem-Geometric Priors

TL;DR

UrbanCraft tackles Extrapolated View Synthesis (EVS) in urban scenes by introducing hierarchical sem-geometric priors that guide diffusion-based view synthesis beyond the training camera distribution. The approach combines UrbanCraft2D, a ControlNet-enhanced diffusion model with scene priors, and Hierarchical Semantic-Geometric-Guided Variational Score Distillation (HSG-VSD) to enforce consistency with observed views, paired with a 3D Gaussian Splatting initialization. Experimental results on KITTI-360 and NuScenes show state-of-the-art EVS performance, improved geometry-texture fidelity, and robust extrapolation under challenging camera poses. The framework also supports instance-level control and editable synthesis, with detailed implementation and ablation studies validating the contributions and limitations. Overall, UrbanCraft advances practical large-scale urban reconstruction by enabling accurate, controllable extrapolated view rendering.

Abstract

Existing neural rendering-based urban scene reconstruction methods mainly focus on the Interpolated View Synthesis (IVS) setting that synthesizes from views close to training camera trajectory. However, IVS can not guarantee the on-par performance of the novel view outside the training camera distribution (\textit{e.g.}, looking left, right, or downwards), which limits the generalizability of the urban reconstruction application. Previous methods have optimized it via image diffusion, but they fail to handle text-ambiguous or large unseen view angles due to coarse-grained control of text-only diffusion. In this paper, we design UrbanCraft, which surmounts the Extrapolated View Synthesis (EVS) problem using hierarchical sem-geometric representations serving as additional priors. Specifically, we leverage the partially observable scene to reconstruct coarse semantic and geometric primitives, establishing a coarse scene-level prior through an occupancy grid as the base representation. Additionally, we incorporate fine instance-level priors from 3D bounding boxes to enhance object-level details and spatial relationships. Building on this, we propose the \textbf{H}ierarchical \textbf{S}emantic-Geometric-\textbf{G}uided Variational Score Distillation (HSG-VSD), which integrates semantic and geometric constraints from pretrained UrbanCraft2D into the score distillation sampling process, forcing the distribution to be consistent with the observable scene. Qualitative and quantitative comparisons demonstrate the effectiveness of our methods on EVS problem.

Paper Structure

This paper contains 31 sections, 3 equations, 17 figures, 1 table.

Figures (17)

  • Figure 1: UrbanCraft for urban view extrapolation. (a) Illustration of Extrapolated View Synthesis (EVS) problem in urban scenes reconstructed with forward-facing cameras. Unlike traditional test cameras that resemble training camera poses, we access view synthesis using cameras that are remote from the training camera distribution. (b) Qualitative comparisons on EVS to baselines.
  • Figure 2: Overview of UrbanCraft. We introduce UrbanCraft, a method that repairs unseen extrapolated views with hierarchical sem-geometric priors. Our framework contains three stages: (a) pretrained of a 2D diffusion model, named UrbanCraft2D, including stable diffusion model $\epsilon_{p}$ and corresponding ControlNet $\psi(\cdot)$ and (b) distillation of the UrbanCraft2D by proposed HSG-VSD to enforce the optimization process to be consistent with the observable scene and (c) initialization for urban 3D representation.
  • Figure 3: UrbanCraft2D: Diffusion pre-training process. Specifically, we utilize the voxel render liu2020neural to generate the scene-level control of corresponding GT images and regular the instance-level control by box2camera coordinate-based rotation maps.
  • Figure 4: Illustration of the effectiveness of the proposed Hierarchical Sem-Geometric Priors for UrbanCraft2D. Note that: i) the scene-level-only control signal can guide UrbanCraft2D to generate urban scenes under reasonable distribution, and ii) adding the extra instance-level control signal enables more precise optimization of vehicles' spatial relationships.
  • Figure 5: Qualitative comparison on KITTI-360 liao2022kitti for extrapolated view synthesis under different difficulty levels (Easy, Middle and Hard) across three settings. EVS-D and EVS-LR refer to extrapolated views facing downwards and left/right, respectively, while EVS-LR-D represents a combination of both. Our method effectively reconstructs foreground and background regions, preserving object structures and scene consistency. We also report training images for reference that maximally cover the view space of EVS from another location for comparison. Notably, our proposed UrbanCraft outperforms the baselines regarding geometry and visual sanity.
  • ...and 12 more figures