Direct2.5: Diverse Text-to-3D Generation via Multi-view 2.5D Diffusion

Yuanxun Lu; Jingyang Zhang; Shiwei Li; Tian Fang; David McKinnon; Yanghai Tsin; Long Quan; Xun Cao; Yao Yao

Direct2.5: Diverse Text-to-3D Generation via Multi-view 2.5D Diffusion

Yuanxun Lu, Jingyang Zhang, Shiwei Li, Tian Fang, David McKinnon, Yanghai Tsin, Long Quan, Xun Cao, Yao Yao

TL;DR

This work introduces Direct2.5, a fast and diverse text-to-3D generation framework that fine-tunes a multi-view 2.5D diffusion model from a pre-trained 2D diffusion model. It generates multi-view normal maps, fuses them into a coherent textured mesh via differentiable rasterization, and then synthesize textures with a normal-conditioned diffusion model, all in a single pass without SDS optimization. Key innovations include cross-view attention to enforce multi-view consistency, explicit geometry fusion using space carving and differentiable rendering, and texture synthesis conditioned on 2.5D geometry for efficient, high-fidelity 3D content in about 10 seconds. Extensive experiments show strong generalization to unseen prompts, diverse outputs, and competitive quality against SDS-based methods while dramatically reducing generation time.

Abstract

Recent advances in generative AI have unveiled significant potential for the creation of 3D content. However, current methods either apply a pre-trained 2D diffusion model with the time-consuming score distillation sampling (SDS), or a direct 3D diffusion model trained on limited 3D data losing generation diversity. In this work, we approach the problem by employing a multi-view 2.5D diffusion fine-tuned from a pre-trained 2D diffusion model. The multi-view 2.5D diffusion directly models the structural distribution of 3D data, while still maintaining the strong generalization ability of the original 2D diffusion model, filling the gap between 2D diffusion-based and direct 3D diffusion-based methods for 3D content generation. During inference, multi-view normal maps are generated using the 2.5D diffusion, and a novel differentiable rasterization scheme is introduced to fuse the almost consistent multi-view normal maps into a consistent 3D model. We further design a normal-conditioned multi-view image generation module for fast appearance generation given the 3D geometry. Our method is a one-pass diffusion process and does not require any SDS optimization as post-processing. We demonstrate through extensive experiments that, our direct 2.5D generation with the specially-designed fusion scheme can achieve diverse, mode-seeking-free, and high-fidelity 3D content generation in only 10 seconds. Project page: https://nju-3dv.github.io/projects/direct25.

Direct2.5: Diverse Text-to-3D Generation via Multi-view 2.5D Diffusion

TL;DR

Abstract

Paper Structure (30 sections, 2 equations, 13 figures, 3 tables, 1 algorithm)

This paper contains 30 sections, 2 equations, 13 figures, 3 tables, 1 algorithm.

Introduction
Related Work
3D Generation by Score Distillation
Direct 3D Diffusion
Multi-view Diffusion
Method
Diffusion Models and 2.5D Adaptation
Cross-view Attention
Explicit Multi-view 2.5D Fusion
Texture Synthesis
Implementation Details
Dataset Preparation
Training Setup
Experiments
Text-to-3D contents generation
...and 15 more sections

Figures (13)

Figure 1: Overview of our text-to-3D content generation system. The generation is a two-stage process, first generating geoemtry and then appearance. Specifically, the system is composed of the following steps: 1) a single denoising process to simultaneously generate 4 normal maps; 2) fast mesh optimization by differentiable rasterization; 3) a single denoising process to generate 4 images conditioned on rendered normal maps; 4) texture construction from multi-view images. The whole generation process only takes 10 seconds.
Figure 1: Visualization of more than 200 optimization steps.
Figure 2: Illustration of explicit geometry optimization. (a) is the generated normal images given a prompt "a DSLR photo of a pirate collie dog, high resolution”. (b) shows the space carving initialization results mesh in the front and side views. (c), (d), (e) present the intermediate optimization states at 50, 100, 200 steps, separately. As shown, 200 steps are enough to reconstruct the fine details like the skin folds of the dog's face and the thin dog tail.
Figure 2: Demonstration of the iterative updating. (a) is the single-pass generated multi-view RGB images given a prompt "a freshly baked loaf of sourdough bread on a cutting board”. (b) shows the rendered results of the single-pass generated model. As seen, the top area remains uncolored. (c) shows the generated inpainting mask under the new view, where the white areas denote the areas that are invisible and need to be inpainted. (d) is the inpainted results under the new view given the previously rendered results and the visibility mask. (e) demonstrates the final generated mesh under the top view and two side-top views. The previous uncolored areas now have been inpainted with reasonable and coherent colors.
Figure 3: A gallery of our text-to-3d generation results. Given text prompts as description input, our method outputs high-quality textured triangle mesh in only 10 seconds. Note that the prompts are not from the training set. Best viewed zoomed in.
...and 8 more figures

Direct2.5: Diverse Text-to-3D Generation via Multi-view 2.5D Diffusion

TL;DR

Abstract

Direct2.5: Diverse Text-to-3D Generation via Multi-view 2.5D Diffusion

Authors

TL;DR

Abstract

Table of Contents

Figures (13)