Table of Contents
Fetching ...

Optimized View and Geometry Distillation from Multi-view Diffuser

Youjia Zhang, Zikai Song, Junqing Yu, Yawei Luo, Wei Yang

TL;DR

This work tackles the challenge of generating consistent multi-view imagery and accurate 3D geometry from a single image using diffusion models. It introduces Unbiased Score Distillation (USD) to remove bias in unconditional noise predictions from multi-view diffusers, and pairs this with a NeRF-based view-consistency prior and a two-stage DreamBooth specialization to produce high-quality, pose-flexible renderings. Geometry is recovered via NeuS from denoised views, yielding faithful surfaces and textures with competitive performance against state-of-the-art methods that rely on large-scale data. The approach demonstrates strong multi-view consistency, robust novel-view synthesis, and effective text-to-3D translation, highlighting practical potential for flexible 3D generation from single views.

Abstract

Generating multi-view images from a single input view using image-conditioned diffusion models is a recent advancement and has shown considerable potential. However, issues such as the lack of consistency in synthesized views and over-smoothing in extracted geometry persist. Previous methods integrate multi-view consistency modules or impose additional supervisory to enhance view consistency while compromising on the flexibility of camera positioning and limiting the versatility of view synthesis. In this study, we consider the radiance field optimized during geometry extraction as a more rigid consistency prior, compared to volume and ray aggregation used in previous works. We further identify and rectify a critical bias in the traditional radiance field optimization process through score distillation from a multi-view diffuser. We introduce an Unbiased Score Distillation (USD) that utilizes unconditioned noises from a 2D diffusion model, greatly refining the radiance field fidelity. We leverage the rendered views from the optimized radiance field as the basis and develop a two-step specialization process of a 2D diffusion model, which is adept at conducting object-specific denoising and generating high-quality multi-view images. Finally, we recover faithful geometry and texture directly from the refined multi-view images. Empirical evaluations demonstrate that our optimized geometry and view distillation technique generates comparable results to the state-of-the-art models trained on extensive datasets, all while maintaining freedom in camera positioning. Please see our project page at https://youjiazhang.github.io/USD/.

Optimized View and Geometry Distillation from Multi-view Diffuser

TL;DR

This work tackles the challenge of generating consistent multi-view imagery and accurate 3D geometry from a single image using diffusion models. It introduces Unbiased Score Distillation (USD) to remove bias in unconditional noise predictions from multi-view diffusers, and pairs this with a NeRF-based view-consistency prior and a two-stage DreamBooth specialization to produce high-quality, pose-flexible renderings. Geometry is recovered via NeuS from denoised views, yielding faithful surfaces and textures with competitive performance against state-of-the-art methods that rely on large-scale data. The approach demonstrates strong multi-view consistency, robust novel-view synthesis, and effective text-to-3D translation, highlighting practical potential for flexible 3D generation from single views.

Abstract

Generating multi-view images from a single input view using image-conditioned diffusion models is a recent advancement and has shown considerable potential. However, issues such as the lack of consistency in synthesized views and over-smoothing in extracted geometry persist. Previous methods integrate multi-view consistency modules or impose additional supervisory to enhance view consistency while compromising on the flexibility of camera positioning and limiting the versatility of view synthesis. In this study, we consider the radiance field optimized during geometry extraction as a more rigid consistency prior, compared to volume and ray aggregation used in previous works. We further identify and rectify a critical bias in the traditional radiance field optimization process through score distillation from a multi-view diffuser. We introduce an Unbiased Score Distillation (USD) that utilizes unconditioned noises from a 2D diffusion model, greatly refining the radiance field fidelity. We leverage the rendered views from the optimized radiance field as the basis and develop a two-step specialization process of a 2D diffusion model, which is adept at conducting object-specific denoising and generating high-quality multi-view images. Finally, we recover faithful geometry and texture directly from the refined multi-view images. Empirical evaluations demonstrate that our optimized geometry and view distillation technique generates comparable results to the state-of-the-art models trained on extensive datasets, all while maintaining freedom in camera positioning. Please see our project page at https://youjiazhang.github.io/USD/.
Paper Structure (20 sections, 10 equations, 17 figures, 6 tables)

This paper contains 20 sections, 10 equations, 17 figures, 6 tables.

Figures (17)

  • Figure 1: The unconditional noise predicted by Zero-1-to-3 model tends to be biased. As a demonstration, we use the '$\textit{Mario}$' image as a toy example and add various levels of noise to the image (larger $\mathbf{T}$ means more noise has been added). We use the predicted unconditional noise to recover the original image from noisy input and find the results of Zero-1-to-3 deviate from the input image greatly even for very small amount of noise. The right sub-figure shows the averaged difference between the predicted noise and the added noise.
  • Figure 2: The overall pipeline of our approach. We first use our Unbiased Score Distillation to extract an optimized underlying radiance field. And then we use the NeRF as our consistency prior, i.e., the generated views should be consistent with the NeRF renderings. We propose a two-stage specialization scheme to obtain a specified DreamBooth specifically for the target. We then denoise the NeRF renderings to obtain high-quality views and subsequently use NeuS technique to recover the geometry. Our optimized scheme generates comparable, sometimes better particularly for irregular camera poses, results to the SOTA works without training on large-scale data.
  • Figure 3: The qualitative comparisons with baseline models on multi-view color images. Our approach generates consistent multi-view images while preserving the image details.
  • Figure 4: Training and inference process of Zero-1-to-3. Training for predicting unconditional noise involves setting the $c_I$ conditions to 0 at regular intervals.
  • Figure 5: Qualitative comparisons of our method with baseline approaches, namely Wonder3D, SyncDreamer, and One-2-3-45, on the GSO dataset, focusing on the quality of the reconstructed textured meshes.
  • ...and 12 more figures