ZeroScene: A Zero-Shot Framework for 3D Scene Generation from a Single Image and Controllable Texture Editing

Xiang Tang; Ruotong Li; Xiaopeng Fan

ZeroScene: A Zero-Shot Framework for 3D Scene Generation from a Single Image and Controllable Texture Editing

Xiang Tang, Ruotong Li, Xiaopeng Fan

TL;DR

ZeroScene addresses the challenge of turning a single image into a coherent multi-object 3D scene with editable textures. It decouples foreground and background generation, uses segmentation, inpainting, and pseudo-stereo depth to create per-object point clouds, and jointly optimizes 3D and 2D layout cues to align with the input. Texture editing is achieved through a geometry-conditioned diffusion process with a mask-guided progressive view synthesis and a back-projection step, plus PBR material estimation for realistic rendering. The framework yields explicit triangle meshes with improved geometric and texture quality compared with state-of-the-art baselines, enabling rapid asset generation for applications like digital twins and virtual environments.

Abstract

In the field of 3D content generation, single image scene reconstruction methods still struggle to simultaneously ensure the quality of individual assets and the coherence of the overall scene in complex environments, while texture editing techniques often fail to maintain both local continuity and multi-view consistency. In this paper, we propose a novel system ZeroScene, which leverages the prior knowledge of large vision models to accomplish both single image-to-3D scene reconstruction and texture editing in a zero-shot manner. ZeroScene extracts object-level 2D segmentation and depth information from input images to infer spatial relationships within the scene. It then jointly optimizes 3D and 2D projection losses of the point cloud to update object poses for precise scene alignment, ultimately constructing a coherent and complete 3D scene that encompasses both foreground and background. Moreover, ZeroScene supports texture editing of objects in the scene. By imposing constraints on the diffusion model and introducing a mask-guided progressive image generation strategy, we effectively maintain texture consistency across multiple viewpoints and further enhance the realism of rendered results through Physically Based Rendering (PBR) material estimation. Experimental results demonstrate that our framework not only ensures the geometric and appearance accuracy of generated assets, but also faithfully reconstructs scene layouts and produces highly detailed textures that closely align with text prompts.

ZeroScene: A Zero-Shot Framework for 3D Scene Generation from a Single Image and Controllable Texture Editing

TL;DR

Abstract

ZeroScene: A Zero-Shot Framework for 3D Scene Generation from a Single Image and Controllable Texture Editing

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (14)