Table of Contents
Fetching ...

3D-Adapter: Geometry-Consistent Multi-View Diffusion for High-Quality 3D Generation

Hansheng Chen, Bokui Shen, Yulin Liu, Ruoxi Shi, Linqi Zhou, Connor Z. Lin, Jiayuan Gu, Hao Su, Gordon Wetzstein, Leonidas Guibas

TL;DR

3D-Adapter introduces a geometry-aware plug-in that injects 3D priors into pretrained image diffusion models via 3D feedback augmentation, addressing local 2D-3D misalignment without rearchitecting the base model. It offers two variants: a fast GRM-based feed-forward version and a flexible optimization-based 3D backprojection approach, both leveraging intermediate 3D reconstructions rendered as RGBD views to guide denoising through added feedback paths. The method improves 3D geometry consistency across text-to-3D, image-to-3D, text-to-texture, and text-to-avatar tasks, outperforming prior I/O sync and two-stage baselines in metrics like MDD, CLIP, and texture realism, while maintaining competitive visual quality. The framework is adaptable to multiple base models and reconstruction methods, highlighting its broad applicability and potential for future efficiency gains and broader 3D content creation use cases.

Abstract

Multi-view image diffusion models have significantly advanced open-domain 3D object generation. However, most existing models rely on 2D network architectures that lack inherent 3D biases, resulting in compromised geometric consistency. To address this challenge, we introduce 3D-Adapter, a plug-in module designed to infuse 3D geometry awareness into pretrained image diffusion models. Central to our approach is the idea of 3D feedback augmentation: for each denoising step in the sampling loop, 3D-Adapter decodes intermediate multi-view features into a coherent 3D representation, then re-encodes the rendered RGBD views to augment the pretrained base model through feature addition. We study two variants of 3D-Adapter: a fast feed-forward version based on Gaussian splatting and a versatile training-free version utilizing neural fields and meshes. Our extensive experiments demonstrate that 3D-Adapter not only greatly enhances the geometry quality of text-to-multi-view models such as Instant3D and Zero123++, but also enables high-quality 3D generation using the plain text-to-image Stable Diffusion. Furthermore, we showcase the broad application potential of 3D-Adapter by presenting high quality results in text-to-3D, image-to-3D, text-to-texture, and text-to-avatar tasks.

3D-Adapter: Geometry-Consistent Multi-View Diffusion for High-Quality 3D Generation

TL;DR

3D-Adapter introduces a geometry-aware plug-in that injects 3D priors into pretrained image diffusion models via 3D feedback augmentation, addressing local 2D-3D misalignment without rearchitecting the base model. It offers two variants: a fast GRM-based feed-forward version and a flexible optimization-based 3D backprojection approach, both leveraging intermediate 3D reconstructions rendered as RGBD views to guide denoising through added feedback paths. The method improves 3D geometry consistency across text-to-3D, image-to-3D, text-to-texture, and text-to-avatar tasks, outperforming prior I/O sync and two-stage baselines in metrics like MDD, CLIP, and texture realism, while maintaining competitive visual quality. The framework is adaptable to multiple base models and reconstruction methods, highlighting its broad applicability and potential for future efficiency gains and broader 3D content creation use cases.

Abstract

Multi-view image diffusion models have significantly advanced open-domain 3D object generation. However, most existing models rely on 2D network architectures that lack inherent 3D biases, resulting in compromised geometric consistency. To address this challenge, we introduce 3D-Adapter, a plug-in module designed to infuse 3D geometry awareness into pretrained image diffusion models. Central to our approach is the idea of 3D feedback augmentation: for each denoising step in the sampling loop, 3D-Adapter decodes intermediate multi-view features into a coherent 3D representation, then re-encodes the rendered RGBD views to augment the pretrained base model through feature addition. We study two variants of 3D-Adapter: a fast feed-forward version based on Gaussian splatting and a versatile training-free version utilizing neural fields and meshes. Our extensive experiments demonstrate that 3D-Adapter not only greatly enhances the geometry quality of text-to-multi-view models such as Instant3D and Zero123++, but also enables high-quality 3D generation using the plain text-to-image Stable Diffusion. Furthermore, we showcase the broad application potential of 3D-Adapter by presenting high quality results in text-to-3D, image-to-3D, text-to-texture, and text-to-avatar tasks.

Paper Structure

This paper contains 48 sections, 14 equations, 15 figures, 4 tables.

Figures (15)

  • Figure 1: Comparison between the results generated by different architectures. Texture refinement is enabled for text-to-3D, image-to-3D, and text-to-avatar.
  • Figure 2: Comparison between different architectures. For brevity, we omit the condition encoders (e.g., text encoders), the rendered alpha channel, and the noisy RGB input for the ControlNet. For LDMs stablediffusion, VAE encoders and decoders are required, and * denotes RGB latents.
  • Figure 3: Comparison on text-to-3D generation. Both 3D-Adapter and I/O sync fix the broken geometry and floaters present in the two-stage method, but I/O sync suffers from blurriness.
  • Figure 4: Comparison of mesh-based image-to-3D methods on the GSO test set.
  • Figure 5: Comparison on text-to-texture generation.
  • ...and 10 more figures