Table of Contents
Fetching ...

Self-Evolving 3D Scene Generation from a Single Image

Kaizhi Zheng, Yue Fan, Jing Gu, Zishuo Xu, Xuehai He, Xin Eric Wang

TL;DR

EvoScene introduces a training-free, self-evolving pipeline that progressively converts a single image into a complete textured 3D scene by cycling geometry and appearance across three stages. It combines depth-based priors and 3D diffusion for mesh completion with depth-conditioned video diffusion for photorealistic textures, iteratively expanding scene coverage. Through iterative refinement and depth conditioning, EvoScene achieves superior geometry, layout coherence, and texture fidelity compared to state-of-the-art baselines, with strong human and GPT-4o evaluations and automatic metrics. The approach demonstrates practical potential for automated 3D content creation from minimal input, without requiring additional training data or fine-tuning of models.

Abstract

Generating high-quality, textured 3D scenes from a single image remains a fundamental challenge in vision and graphics. Recent image-to-3D generators recover reasonable geometry from single views, but their object-centric training limits generalization to complex, large-scale scenes with faithful structure and texture. We present EvoScene, a self-evolving, training-free framework that progressively reconstructs complete 3D scenes from single images. The key idea is combining the complementary strengths of existing models: geometric reasoning from 3D generation models and visual knowledge from video generation models. Through three iterative stages--Spatial Prior Initialization, Visual-guided 3D Scene Mesh Generation, and Spatial-guided Novel View Generation--EvoScene alternates between 2D and 3D domains, gradually improving both structure and appearance. Experiments on diverse scenes demonstrate that EvoScene achieves superior geometric stability, view-consistent textures, and unseen-region completion compared to strong baselines, producing ready-to-use 3D meshes for practical applications.

Self-Evolving 3D Scene Generation from a Single Image

TL;DR

EvoScene introduces a training-free, self-evolving pipeline that progressively converts a single image into a complete textured 3D scene by cycling geometry and appearance across three stages. It combines depth-based priors and 3D diffusion for mesh completion with depth-conditioned video diffusion for photorealistic textures, iteratively expanding scene coverage. Through iterative refinement and depth conditioning, EvoScene achieves superior geometry, layout coherence, and texture fidelity compared to state-of-the-art baselines, with strong human and GPT-4o evaluations and automatic metrics. The approach demonstrates practical potential for automated 3D content creation from minimal input, without requiring additional training data or fine-tuning of models.

Abstract

Generating high-quality, textured 3D scenes from a single image remains a fundamental challenge in vision and graphics. Recent image-to-3D generators recover reasonable geometry from single views, but their object-centric training limits generalization to complex, large-scale scenes with faithful structure and texture. We present EvoScene, a self-evolving, training-free framework that progressively reconstructs complete 3D scenes from single images. The key idea is combining the complementary strengths of existing models: geometric reasoning from 3D generation models and visual knowledge from video generation models. Through three iterative stages--Spatial Prior Initialization, Visual-guided 3D Scene Mesh Generation, and Spatial-guided Novel View Generation--EvoScene alternates between 2D and 3D domains, gradually improving both structure and appearance. Experiments on diverse scenes demonstrate that EvoScene achieves superior geometric stability, view-consistent textures, and unseen-region completion compared to strong baselines, producing ready-to-use 3D meshes for practical applications.

Paper Structure

This paper contains 31 sections, 5 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Self-evolving 3D scene generation from a single image. Given a single input photograph, EvoScene progressively builds a complete 3D scene through a virtuous cycle where geometry and appearance mutually refine each other. Our method synthesizes photorealistic novel views and produces high-quality 3D representations (Gaussians and meshes) with substantially improved angular coverage and completeness. Project page: https://eric-ai-lab.github.io/evoscene.github.io/
  • Figure 2: Overview of EvoScene. Our self-evolving framework consists of three coupled stages that form a virtuous cycle: (A) Spatial Prior Initialization extracts depth from 2D observations and back-projects to 3D point clouds, providing geometric constraints. (B) Visual-guided 3D Scene Mesh Generation uses a 3D diffusion model with test-time rendering optimization to complete the mesh guided by point cloud priors. (C) Spatial-guided Novel View Generation renders depth from the mesh to guide depth-conditioned video diffusion, synthesizing photorealistic multi-view images. These new views are converted back to point clouds (via depth re-estimation and multi-view filtering) and fed into the next iteration. This cycle progressively refines geometry and appearance to produce a complete 3D scene from a single input image.
  • Figure 3: Qualitative comparisons across diverse scene types. From left to right: reference images, EvoScene (ours), Trellis, Hunyuan3D-2.1, and TripoSG. Scenes include residential neighborhoods (rows 1-2), urban architecture (row 3), and historical landmarks (row 4). EvoScene produces complete geometry with preserved architectural details and photorealistic textures, while baselines exhibit fragmentation, flattened representations, or severe geometric distortions.
  • Figure 4: Ablation: Impact of iterative refinement. Comparison between single-iteration (Mesh 0) and multi-iteration (Mesh T) reconstruction. The single-iteration mesh exhibits geometric errors in the highlighted region (distorted architectural structures), which are corrected after iterative refinement through accumulated multi-view observations and geometry-appearance co-evolution.
  • Figure 5: Ablation: Impact of depth-conditioned video guidance. Comparison between pose-only world model (FlashWorld li2025flashworld) and our depth-conditioned video generation. The pose-only baseline generates visually plausible frames but produces geometrically inconsistent 3D meshes with severe distortions and broken surfaces (right). Our depth conditioning provides geometric scaffolding that ensures multi-view consistency and stable 3D reconstruction.
  • ...and 2 more figures