Table of Contents
Fetching ...

Wonder3D++: Cross-domain Diffusion for High-fidelity 3D Generation from a Single Image

Yuxiao Yang, Xiao-Xiao Long, Zhiyang Dou, Cheng Lin, Yuan Liu, Qingsong Yan, Yuexin Ma, Haoqian Wang, Zhiqiang Wu, Wei Yin

TL;DR

Wonder3D++ tackles single-image 3D reconstruction by jointly modeling multi-view normals and colors through a cross-domain diffusion framework, enabling consistent 2D-to-3D mappings. It introduces a domain switcher, cross-domain attention, and a camera type switcher to couple geometry and texture across views, followed by a cascaded mesh extraction pipeline that initializes, coarsely reconstructs, and iteratively refines a textured mesh. The approach achieves high geometric and textural fidelity with strong generalization and efficiency, outperforming SDS-based and MV-based baselines on standard benchmarks and in-the-wild images. This framework offers a practical, zero-shot-capable route from a single image to high-quality 3D assets, with potential for text-driven 3D generation and broader adoption in 3D content creation workflows.

Abstract

In this work, we introduce \textbf{Wonder3D++}, a novel method for efficiently generating high-fidelity textured meshes from single-view images. Recent methods based on Score Distillation Sampling (SDS) have shown the potential to recover 3D geometry from 2D diffusion priors, but they typically suffer from time-consuming per-shape optimization and inconsistent geometry. In contrast, certain works directly produce 3D information via fast network inferences, but their results are often of low quality and lack geometric details. To holistically improve the quality, consistency, and efficiency of single-view reconstruction tasks, we propose a cross-domain diffusion model that generates multi-view normal maps and the corresponding color images. To ensure the consistency of generation, we employ a multi-view cross-domain attention mechanism that facilitates information exchange across views and modalities. Lastly, we introduce a cascaded 3D mesh extraction algorithm that drives high-quality surfaces from the multi-view 2D representations in only about $3$ minute in a coarse-to-fine manner. Our extensive evaluations demonstrate that our method achieves high-quality reconstruction results, robust generalization, and good efficiency compared to prior works. Code available at https://github.com/xxlong0/Wonder3D/tree/Wonder3D_Plus.

Wonder3D++: Cross-domain Diffusion for High-fidelity 3D Generation from a Single Image

TL;DR

Wonder3D++ tackles single-image 3D reconstruction by jointly modeling multi-view normals and colors through a cross-domain diffusion framework, enabling consistent 2D-to-3D mappings. It introduces a domain switcher, cross-domain attention, and a camera type switcher to couple geometry and texture across views, followed by a cascaded mesh extraction pipeline that initializes, coarsely reconstructs, and iteratively refines a textured mesh. The approach achieves high geometric and textural fidelity with strong generalization and efficiency, outperforming SDS-based and MV-based baselines on standard benchmarks and in-the-wild images. This framework offers a practical, zero-shot-capable route from a single image to high-quality 3D assets, with potential for text-driven 3D generation and broader adoption in 3D content creation workflows.

Abstract

In this work, we introduce \textbf{Wonder3D++}, a novel method for efficiently generating high-fidelity textured meshes from single-view images. Recent methods based on Score Distillation Sampling (SDS) have shown the potential to recover 3D geometry from 2D diffusion priors, but they typically suffer from time-consuming per-shape optimization and inconsistent geometry. In contrast, certain works directly produce 3D information via fast network inferences, but their results are often of low quality and lack geometric details. To holistically improve the quality, consistency, and efficiency of single-view reconstruction tasks, we propose a cross-domain diffusion model that generates multi-view normal maps and the corresponding color images. To ensure the consistency of generation, we employ a multi-view cross-domain attention mechanism that facilitates information exchange across views and modalities. Lastly, we introduce a cascaded 3D mesh extraction algorithm that drives high-quality surfaces from the multi-view 2D representations in only about minute in a coarse-to-fine manner. Our extensive evaluations demonstrate that our method achieves high-quality reconstruction results, robust generalization, and good efficiency compared to prior works. Code available at https://github.com/xxlong0/Wonder3D/tree/Wonder3D_Plus.

Paper Structure

This paper contains 48 sections, 13 equations, 19 figures, 5 tables.

Figures (19)

  • Figure 1: Wonder3D++ reconstructs highly-detailed textured meshes from a single-view image in only 3 minutes. Wonder3D++ first generates consistent multi-view normal maps with corresponding color images via a cross-domain diffusion model and then leverages a cascaded 3D mesh extraction method to achieve fast and high-quality reconstruction.
  • Figure 2: Overview of Wonder3D++. Given a single image, Wonder3D++ takes the input image feature produced by VAE encoder, the image embedding produced by CLIP model radford2021learning, the camera parameters of multiple views, a camera type switcher and a domain switcher as conditioning to generate consistent multi-view normal maps and color images. Subsequently, Wonder3D++ applies an innovative cascaded 3D mesh reconstruction algorithm, which utilizes a coarse-to-fine strategy consisting of geometric initialization, coarse mesh reconstruction, and iterative mesh refinement to produce a 3D mesh with high-quality geometry and high-fidelity textures.
  • Figure 3: The illustration of the structure of the multi-view cross-domain transformer block, where the multi-view attention layer and cross-domain attention layer facilitate information exchange across different views and domains, respectively.
  • Figure 4: The illustration of our geometric initialization strategy. We employ the Poisson reconstruction kazhdan2006poisson method for geometric initialization using normal maps from front and back views, while estimated depth maps are used to detect and correct potential initialization errors.
  • Figure 5: The qualitative comparisons with baseline models on synthesized multi-view color images.
  • ...and 14 more figures