Direct2.5: Diverse Text-to-3D Generation via Multi-view 2.5D Diffusion
Yuanxun Lu, Jingyang Zhang, Shiwei Li, Tian Fang, David McKinnon, Yanghai Tsin, Long Quan, Xun Cao, Yao Yao
TL;DR
This work introduces Direct2.5, a fast and diverse text-to-3D generation framework that fine-tunes a multi-view 2.5D diffusion model from a pre-trained 2D diffusion model. It generates multi-view normal maps, fuses them into a coherent textured mesh via differentiable rasterization, and then synthesize textures with a normal-conditioned diffusion model, all in a single pass without SDS optimization. Key innovations include cross-view attention to enforce multi-view consistency, explicit geometry fusion using space carving and differentiable rendering, and texture synthesis conditioned on 2.5D geometry for efficient, high-fidelity 3D content in about 10 seconds. Extensive experiments show strong generalization to unseen prompts, diverse outputs, and competitive quality against SDS-based methods while dramatically reducing generation time.
Abstract
Recent advances in generative AI have unveiled significant potential for the creation of 3D content. However, current methods either apply a pre-trained 2D diffusion model with the time-consuming score distillation sampling (SDS), or a direct 3D diffusion model trained on limited 3D data losing generation diversity. In this work, we approach the problem by employing a multi-view 2.5D diffusion fine-tuned from a pre-trained 2D diffusion model. The multi-view 2.5D diffusion directly models the structural distribution of 3D data, while still maintaining the strong generalization ability of the original 2D diffusion model, filling the gap between 2D diffusion-based and direct 3D diffusion-based methods for 3D content generation. During inference, multi-view normal maps are generated using the 2.5D diffusion, and a novel differentiable rasterization scheme is introduced to fuse the almost consistent multi-view normal maps into a consistent 3D model. We further design a normal-conditioned multi-view image generation module for fast appearance generation given the 3D geometry. Our method is a one-pass diffusion process and does not require any SDS optimization as post-processing. We demonstrate through extensive experiments that, our direct 2.5D generation with the specially-designed fusion scheme can achieve diverse, mode-seeking-free, and high-fidelity 3D content generation in only 10 seconds. Project page: https://nju-3dv.github.io/projects/direct25.
