Table of Contents
Fetching ...

ZeroScene: A Zero-Shot Framework for 3D Scene Generation from a Single Image and Controllable Texture Editing

Xiang Tang, Ruotong Li, Xiaopeng Fan

TL;DR

ZeroScene addresses the challenge of turning a single image into a coherent multi-object 3D scene with editable textures. It decouples foreground and background generation, uses segmentation, inpainting, and pseudo-stereo depth to create per-object point clouds, and jointly optimizes 3D and 2D layout cues to align with the input. Texture editing is achieved through a geometry-conditioned diffusion process with a mask-guided progressive view synthesis and a back-projection step, plus PBR material estimation for realistic rendering. The framework yields explicit triangle meshes with improved geometric and texture quality compared with state-of-the-art baselines, enabling rapid asset generation for applications like digital twins and virtual environments.

Abstract

In the field of 3D content generation, single image scene reconstruction methods still struggle to simultaneously ensure the quality of individual assets and the coherence of the overall scene in complex environments, while texture editing techniques often fail to maintain both local continuity and multi-view consistency. In this paper, we propose a novel system ZeroScene, which leverages the prior knowledge of large vision models to accomplish both single image-to-3D scene reconstruction and texture editing in a zero-shot manner. ZeroScene extracts object-level 2D segmentation and depth information from input images to infer spatial relationships within the scene. It then jointly optimizes 3D and 2D projection losses of the point cloud to update object poses for precise scene alignment, ultimately constructing a coherent and complete 3D scene that encompasses both foreground and background. Moreover, ZeroScene supports texture editing of objects in the scene. By imposing constraints on the diffusion model and introducing a mask-guided progressive image generation strategy, we effectively maintain texture consistency across multiple viewpoints and further enhance the realism of rendered results through Physically Based Rendering (PBR) material estimation. Experimental results demonstrate that our framework not only ensures the geometric and appearance accuracy of generated assets, but also faithfully reconstructs scene layouts and produces highly detailed textures that closely align with text prompts.

ZeroScene: A Zero-Shot Framework for 3D Scene Generation from a Single Image and Controllable Texture Editing

TL;DR

ZeroScene addresses the challenge of turning a single image into a coherent multi-object 3D scene with editable textures. It decouples foreground and background generation, uses segmentation, inpainting, and pseudo-stereo depth to create per-object point clouds, and jointly optimizes 3D and 2D layout cues to align with the input. Texture editing is achieved through a geometry-conditioned diffusion process with a mask-guided progressive view synthesis and a back-projection step, plus PBR material estimation for realistic rendering. The framework yields explicit triangle meshes with improved geometric and texture quality compared with state-of-the-art baselines, enabling rapid asset generation for applications like digital twins and virtual environments.

Abstract

In the field of 3D content generation, single image scene reconstruction methods still struggle to simultaneously ensure the quality of individual assets and the coherence of the overall scene in complex environments, while texture editing techniques often fail to maintain both local continuity and multi-view consistency. In this paper, we propose a novel system ZeroScene, which leverages the prior knowledge of large vision models to accomplish both single image-to-3D scene reconstruction and texture editing in a zero-shot manner. ZeroScene extracts object-level 2D segmentation and depth information from input images to infer spatial relationships within the scene. It then jointly optimizes 3D and 2D projection losses of the point cloud to update object poses for precise scene alignment, ultimately constructing a coherent and complete 3D scene that encompasses both foreground and background. Moreover, ZeroScene supports texture editing of objects in the scene. By imposing constraints on the diffusion model and introducing a mask-guided progressive image generation strategy, we effectively maintain texture consistency across multiple viewpoints and further enhance the realism of rendered results through Physically Based Rendering (PBR) material estimation. Experimental results demonstrate that our framework not only ensures the geometric and appearance accuracy of generated assets, but also faithfully reconstructs scene layouts and produces highly detailed textures that closely align with text prompts.

Paper Structure

This paper contains 14 sections, 4 equations, 14 figures, 5 tables.

Figures (14)

  • Figure 1: Overview of 3D Scene Generation. We decouple the foreground and background of a given image. The assembly of foreground objects is achieved through three steps: instance segmentation and generation, scene point cloud extraction, and layout optimization. For the background environment, we fit planes from point clouds with color information. Finally, the foreground and background are integrated to construct a complete 3D scene that is multi-view consistent and spatially coherent.
  • Figure 2: Layout optimization process. Taking the bear doll as an example, the point cloud $\mathcal{M}_i$ of the generated model is depicted in blue, i.e., the object to be optimized, while the extracted instance point cloud $\mathcal{PC}_i$ is depicted in green, indicating the target. We visualize the optimization process in both 3D space and 2D projection space, integrating dual spatial information to achieve superior layout parameters.
  • Figure 3: Overview of Texture Editing. We utilize generated images for texture synthesis to enable editing. Given a mesh, we render its geometry-aware conditions, which are then injected into a diffusion model along with a text prompt. After obtaining a single-view image aligned with the geometric structure, a mask-guided progressive image generation strategy is employed to synthesize a sequence of RGB images with multiview consistency. The resulting image set is preprocessed with lighting elimination and super-resolution, after which texture is synthesized via a back projection module. Finally, PBR material estimation is incorporated to enhance rendering realism.
  • Figure 4: Mask-guided progressive image generation strategy. "Project" refers to the process of projecting all currently known regions into the latent space of the next viewpoint. "DM" denotes a pre-trained text-to-image diffusion model team2024kolors equipped with ControlNet zhang2023adding.
  • Figure 5: Back projection module. The symbol $\oplus$ denotes weighted fusion. We integrate the confidence maps from multiple views with the corresponding color values of valid pixels through weighted fusion. By back projecting multi-view RGB images onto the UV space using the 3D model's UV unwrapping coordinates, we synthesize the albedo map.
  • ...and 9 more figures