Table of Contents
Fetching ...

Extend3D: Town-Scale 3D Generation

Seungwoo Yoon, Jinmo Kim, Jaesik Park

Abstract

In this paper, we propose Extend3D, a training-free pipeline for 3D scene generation from a single image, built upon an object-centric 3D generative model. To overcome the limitations of fixed-size latent spaces in object-centric models for representing wide scenes, we extend the latent space in the $x$ and $y$ directions. Then, by dividing the extended latent space into overlapping patches, we apply the object-centric 3D generative model to each patch and couple them at each time step. Since patch-wise 3D generation with image conditioning requires strict spatial alignment between image and latent patches, we initialize the scene using a point cloud prior from a monocular depth estimator and iteratively refine occluded regions through SDEdit. We discovered that treating the incompleteness of 3D structure as noise during 3D refinement enables 3D completion via a concept, which we term under-noising. Furthermore, to address the sub-optimality of object-centric models for sub-scene generation, we optimize the extended latent during denoising, ensuring that the denoising trajectories remain consistent with the sub-scene dynamics. To this end, we introduce 3D-aware optimization objectives for improved geometric structure and texture fidelity. We demonstrate that our method yields better results than prior methods, as evidenced by human preference and quantitative experiments.

Extend3D: Town-Scale 3D Generation

Abstract

In this paper, we propose Extend3D, a training-free pipeline for 3D scene generation from a single image, built upon an object-centric 3D generative model. To overcome the limitations of fixed-size latent spaces in object-centric models for representing wide scenes, we extend the latent space in the and directions. Then, by dividing the extended latent space into overlapping patches, we apply the object-centric 3D generative model to each patch and couple them at each time step. Since patch-wise 3D generation with image conditioning requires strict spatial alignment between image and latent patches, we initialize the scene using a point cloud prior from a monocular depth estimator and iteratively refine occluded regions through SDEdit. We discovered that treating the incompleteness of 3D structure as noise during 3D refinement enables 3D completion via a concept, which we term under-noising. Furthermore, to address the sub-optimality of object-centric models for sub-scene generation, we optimize the extended latent during denoising, ensuring that the denoising trajectories remain consistent with the sub-scene dynamics. To this end, we introduce 3D-aware optimization objectives for improved geometric structure and texture fidelity. We demonstrate that our method yields better results than prior methods, as evidenced by human preference and quantitative experiments.

Paper Structure

This paper contains 25 sections, 14 equations, 21 figures, 8 tables, 3 algorithms.

Figures (21)

  • Figure 1: The result of Extend3D. We generated a large-scale 3D scene from an image of Vatican City captured from Google Earth google-earth.
  • Figure 2: An overall pipeline of our Extend3D. Extend3D consists of two parts: sparse structure generation and structured latent generation. In the denoising part of both steps, an overlapping patch-wise flow was used (\ref{['Overlapping Patch-wise Flow']} and \ref{['fig:patch']}). In sparse structure generation, iterative SDEdit is used to initialize the structure (\ref{['Initialize with Prior']}). Vector fields in both steps are optimized with priors (\ref{['Optimize with Prior']}).
  • Figure 3: Overlapping patch-wise flow. The extended latent is divided into latent patches with the sliding window. We then obtain the patch vector for each latent patch and merge them into a single extended latent vector, thereby coupling the patches.
  • Figure 4: Motivation of under-noising. The blue arrows represent actual noising or denoising, while the purple arrow illustrates how the model is presumed to perceive.
  • Figure 5: Qualitative result of our Extend3D. Our 3D scene generation result (with $a=b=2$) is compared to the results of state-of-the-art 3D generative models. While previous methods may not accurately represent the image or lose scene details, our method effectively expresses the image condition in 3D. The input image is generated using Flux.1 [dev] flux2024. We provide additional results in \ref{['More Results']}.
  • ...and 16 more figures