Table of Contents
Fetching ...

Scene Grounding In the Wild

Tamir Cohen, Leo Segre, Shay Shomer-Chai, Shai Avidan, Hadar Averbuch-Elor

Abstract

Reconstructing accurate 3D models of large-scale real-world scenes from unstructured, in-the-wild imagery remains a core challenge in computer vision, especially when the input views have little or no overlap. In such cases, existing reconstruction pipelines often produce multiple disconnected partial reconstructions or erroneously merge non-overlapping regions into overlapping geometry. In this work, we propose a framework that grounds each partial reconstruction to a complete reference model of the scene, enabling globally consistent alignment even in the absence of visual overlap. We obtain reference models from dense, geospatially accurate pseudo-synthetic renderings derived from Google Earth Studio. These renderings provide full scene coverage but differ substantially in appearance from real-world photographs. Our key insight is that, despite this significant domain gap, both domains share the same underlying scene semantics. We represent the reference model using 3D Gaussian Splatting, augmenting each Gaussian with semantic features, and formulate alignment as an inverse feature-based optimization scheme that estimates a global 6DoF pose and scale while keeping the reference model fixed. Furthermore, we introduce the WikiEarth dataset, which registers existing partial 3D reconstructions with pseudo-synthetic reference models. We demonstrate that our approach consistently improves global alignment when initialized with various classical and learning-based pipelines, while mitigating failure modes of state-of-the-art end-to-end models. All code and data will be released.

Scene Grounding In the Wild

Abstract

Reconstructing accurate 3D models of large-scale real-world scenes from unstructured, in-the-wild imagery remains a core challenge in computer vision, especially when the input views have little or no overlap. In such cases, existing reconstruction pipelines often produce multiple disconnected partial reconstructions or erroneously merge non-overlapping regions into overlapping geometry. In this work, we propose a framework that grounds each partial reconstruction to a complete reference model of the scene, enabling globally consistent alignment even in the absence of visual overlap. We obtain reference models from dense, geospatially accurate pseudo-synthetic renderings derived from Google Earth Studio. These renderings provide full scene coverage but differ substantially in appearance from real-world photographs. Our key insight is that, despite this significant domain gap, both domains share the same underlying scene semantics. We represent the reference model using 3D Gaussian Splatting, augmenting each Gaussian with semantic features, and formulate alignment as an inverse feature-based optimization scheme that estimates a global 6DoF pose and scale while keeping the reference model fixed. Furthermore, we introduce the WikiEarth dataset, which registers existing partial 3D reconstructions with pseudo-synthetic reference models. We demonstrate that our approach consistently improves global alignment when initialized with various classical and learning-based pipelines, while mitigating failure modes of state-of-the-art end-to-end models. All code and data will be released.

Paper Structure

This paper contains 27 sections, 2 equations, 16 figures, 10 tables.

Figures (16)

  • Figure 1: Given a partial 3D reconstruction produced by running structure from motion on Internet images capturing large-scale landmarks, such as the front or the rear façade of the Milan Cathedral depicted above, we present a technique for grounding this reconstruction in a complete 3D reference model of the scene. Reference models are constructed from pseudo-synthetic renderings extracted from Google Earth Studio. As illustrated above, our approach allows for merging partial, disjoint 3D reconstructions into a unified model.
  • Figure 2: Scene Grounding via Semantic Feature-based Robust Optimization. Given a 3DGS reference model ${\cal M}$ (left) and a set of Internet images ${\cal I}$ (right), we propose an inverse optimization scheme that predicts a global 6DoF+scale alignment $T$ while keeping the parameters of ${\cal M}$ fixed. We obtain an initial transformation $T$ (in red) using a traditional SfM technique. During optimization, we calculate a semantic feature loss ${L_{sem}}$ and backpropagate it to update $T$ (converging to the rendered view in green after $N$ steps).
  • Figure 3: Challenges of aligning internet photos to the reference model. Visualization of input Internet images (first and third columns) and views rendered from the reference model at the ground-truth locations (second and fourth columns). As illustrated above, high $L_{\text{sem}}$ values (bottom row) often indicate outlier images, which our approach overcomes via a robust optimization scheme, as further detailed in Section \ref{['sec:robust']}.
  • Figure 4: The WikiEarth Benchmark. Reconstruction of four landmarks from WikiEarth. The blue frustums depicts the rendered images from Google Earth Studio, and the red frustums the images from WikiScenes.
  • Figure 5: Qualitative Comparison. A visualization of the alignment results for our method compared to the three baselines. Each image shows the ground truth in the lower half and the rendered image from the reference model $\mathcal{M}$ after alignment in the top half. As demonstrated, our inverse optimization-based approach predicts precise transformations, even in the presence of challenging, inaccurate initializations.
  • ...and 11 more figures