Dual3D: Efficient and Consistent Text-to-3D Generation with Dual-mode Multi-view Latent Diffusion

Xinyang Li; Zhangyu Lai; Linning Xu; Jianfei Guo; Liujuan Cao; Shengchuan Zhang; Bo Dai; Rongrong Ji

Dual3D: Efficient and Consistent Text-to-3D Generation with Dual-mode Multi-view Latent Diffusion

Xinyang Li, Zhangyu Lai, Linning Xu, Jianfei Guo, Liujuan Cao, Shengchuan Zhang, Bo Dai, Rongrong Ji

TL;DR

Dual3D tackles fast and consistent text-to-3D generation by introducing a dual-mode multi-view latent diffusion model that can denoise multi-view latents and directly produce 3D surfaces. It leverages a pre-trained 2D LDM to reduce training cost, employs a dual-mode toggling inference strategy to balance speed and 3D consistency, and adds an efficient texture refinement stage to enhance realism. The method achieves state-of-the-art performance with significantly reduced generation time, producing high-quality 3D assets in roughly $1$ minute on a single GPU, and is suitable for scalable, compositional 3D content creation. These advances hold practical impact for game, AR/VR, and visualization pipelines by enabling rapid, text-driven 3D asset generation with consistent geometry and textures.

Abstract

We present Dual3D, a novel text-to-3D generation framework that generates high-quality 3D assets from texts in only $1$ minute.The key component is a dual-mode multi-view latent diffusion model. Given the noisy multi-view latents, the 2D mode can efficiently denoise them with a single latent denoising network, while the 3D mode can generate a tri-plane neural surface for consistent rendering-based denoising. Most modules for both modes are tuned from a pre-trained text-to-image latent diffusion model to circumvent the expensive cost of training from scratch. To overcome the high rendering cost during inference, we propose the dual-mode toggling inference strategy to use only $1/10$ denoising steps with 3D mode, successfully generating a 3D asset in just $10$ seconds without sacrificing quality. The texture of the 3D asset can be further enhanced by our efficient texture refinement process in a short time. Extensive experiments demonstrate that our method delivers state-of-the-art performance while significantly reducing generation time. Our project page is available at https://dual3d.github.io

Dual3D: Efficient and Consistent Text-to-3D Generation with Dual-mode Multi-view Latent Diffusion

TL;DR

minute on a single GPU, and is suitable for scalable, compositional 3D content creation. These advances hold practical impact for game, AR/VR, and visualization pipelines by enabling rapid, text-driven 3D asset generation with consistent geometry and textures.

Abstract

We present Dual3D, a novel text-to-3D generation framework that generates high-quality 3D assets from texts in only

minute.The key component is a dual-mode multi-view latent diffusion model. Given the noisy multi-view latents, the 2D mode can efficiently denoise them with a single latent denoising network, while the 3D mode can generate a tri-plane neural surface for consistent rendering-based denoising. Most modules for both modes are tuned from a pre-trained text-to-image latent diffusion model to circumvent the expensive cost of training from scratch. To overcome the high rendering cost during inference, we propose the dual-mode toggling inference strategy to use only

denoising steps with 3D mode, successfully generating a 3D asset in just

seconds without sacrificing quality. The texture of the 3D asset can be further enhanced by our efficient texture refinement process in a short time. Extensive experiments demonstrate that our method delivers state-of-the-art performance while significantly reducing generation time. Our project page is available at https://dual3d.github.io

Paper Structure (26 sections, 8 equations, 12 figures, 3 tables)

This paper contains 26 sections, 8 equations, 12 figures, 3 tables.

Introduction
Related Works
Preliminary
Method
Dual-mode Multi-view Latent Diffusion Model
Dual-mode Toggling Inference
Efficient Texture Refinement
Experiments
Settings
Quantitative results.
Qualitative results.
Diverse and Fine-grained Generation
Ablation Study
Limitations
Conclusion
...and 11 more sections

Figures (12)

Figure 1: The Framework of Dual3D. Firstly, we fine-tune a pre-trained 2D LDM into a dual-mode multi-view LDM. Subsequently, we employ a dual-mode toggling inference strategy to choose different denoising modes during inference to balance the inference speed and 3D consistency. Finally, the mesh extracted from the neural surface is further optimized via our efficient texture refinement process, enhancing the photo-realism and details of the asset.
Figure 2: Two compositional 3D scenes rendered by Blender, where all visible assets are generated by our method with only texts as inputs. The text prompts for some assets are indicated by arrows. Please refer to our project page for the tour videos.
Figure 3: The architecture of dual-mode multi-view LDM. The noisy multi-view latents and three learnable tri-plane latents are fed into the 2D latent denoising network $Z_\theta$ in parallel, where all self-attention blocks are replaced by cross-view self-attention blocks. A tiny transformer is used to enhance the connections between the multi-view features and the tri-plane features. The denoised tri-plane latents are decoded into higher resolution with the 2D latent decoder $D$ and rendered to images with volume rendering of the tri-plane surface. Two main objectives, $\mathcal{L}_{\text{2d}}$ and $\mathcal{L}_{\text{3d}}$, are used to optimize the model.
Figure 4: User study
Figure 5: Qualitative comparison.
...and 7 more figures

Dual3D: Efficient and Consistent Text-to-3D Generation with Dual-mode Multi-view Latent Diffusion

TL;DR

Abstract

Dual3D: Efficient and Consistent Text-to-3D Generation with Dual-mode Multi-view Latent Diffusion

Authors

TL;DR

Abstract

Table of Contents

Figures (12)