3D-Adapter: Geometry-Consistent Multi-View Diffusion for High-Quality 3D Generation

Hansheng Chen; Bokui Shen; Yulin Liu; Ruoxi Shi; Linqi Zhou; Connor Z. Lin; Jiayuan Gu; Hao Su; Gordon Wetzstein; Leonidas Guibas

3D-Adapter: Geometry-Consistent Multi-View Diffusion for High-Quality 3D Generation

Hansheng Chen, Bokui Shen, Yulin Liu, Ruoxi Shi, Linqi Zhou, Connor Z. Lin, Jiayuan Gu, Hao Su, Gordon Wetzstein, Leonidas Guibas

TL;DR

3D-Adapter introduces a geometry-aware plug-in that injects 3D priors into pretrained image diffusion models via 3D feedback augmentation, addressing local 2D-3D misalignment without rearchitecting the base model. It offers two variants: a fast GRM-based feed-forward version and a flexible optimization-based 3D backprojection approach, both leveraging intermediate 3D reconstructions rendered as RGBD views to guide denoising through added feedback paths. The method improves 3D geometry consistency across text-to-3D, image-to-3D, text-to-texture, and text-to-avatar tasks, outperforming prior I/O sync and two-stage baselines in metrics like MDD, CLIP, and texture realism, while maintaining competitive visual quality. The framework is adaptable to multiple base models and reconstruction methods, highlighting its broad applicability and potential for future efficiency gains and broader 3D content creation use cases.

Abstract

Multi-view image diffusion models have significantly advanced open-domain 3D object generation. However, most existing models rely on 2D network architectures that lack inherent 3D biases, resulting in compromised geometric consistency. To address this challenge, we introduce 3D-Adapter, a plug-in module designed to infuse 3D geometry awareness into pretrained image diffusion models. Central to our approach is the idea of 3D feedback augmentation: for each denoising step in the sampling loop, 3D-Adapter decodes intermediate multi-view features into a coherent 3D representation, then re-encodes the rendered RGBD views to augment the pretrained base model through feature addition. We study two variants of 3D-Adapter: a fast feed-forward version based on Gaussian splatting and a versatile training-free version utilizing neural fields and meshes. Our extensive experiments demonstrate that 3D-Adapter not only greatly enhances the geometry quality of text-to-multi-view models such as Instant3D and Zero123++, but also enables high-quality 3D generation using the plain text-to-image Stable Diffusion. Furthermore, we showcase the broad application potential of 3D-Adapter by presenting high quality results in text-to-3D, image-to-3D, text-to-texture, and text-to-avatar tasks.

3D-Adapter: Geometry-Consistent Multi-View Diffusion for High-Quality 3D Generation

TL;DR

Abstract

3D-Adapter: Geometry-Consistent Multi-View Diffusion for High-Quality 3D Generation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (15)