Table of Contents
Fetching ...

PI3D: Efficient Text-to-3D Generation with Pseudo-Image Diffusion

Ying-Tian Liu, Yuan-Chen Guo, Guan Luo, Heyi Sun, Wei Yin, Song-Hai Zhang

TL;DR

PI3D tackles the data scarcity challenge in text-to-3D generation by converting 3D geometry into a set of six pseudo-images via a triplane representation and adapting a pre-trained text-to-image diffusion model to output these pseudo-images. It first fits a depth-aware triplane geometry and then uses a pseudo-image diffusion model, trained on paired 3D and 2D data, to generate fast coarse 3D samples, which are subsequently refined with a lightweight SDS-based process guided by 2D diffusion models. The approach achieves high-quality, 3D-consistent results in minutes, outperforming existing 3D diffusion and 2D-lifting methods on text alignment and generation speed, while maintaining robustness through mixed 2D-3D training and careful CFG tuning. The work demonstrates that leveraging 2D priors through pseudo-images can effectively transfer rich 2D generative knowledge to 3D, enabling scalable and efficient text-to-3D content creation with practical impact for creators and researchers alike.

Abstract

Diffusion models trained on large-scale text-image datasets have demonstrated a strong capability of controllable high-quality image generation from arbitrary text prompts. However, the generation quality and generalization ability of 3D diffusion models is hindered by the scarcity of high-quality and large-scale 3D datasets. In this paper, we present PI3D, a framework that fully leverages the pre-trained text-to-image diffusion models' ability to generate high-quality 3D shapes from text prompts in minutes. The core idea is to connect the 2D and 3D domains by representing a 3D shape as a set of Pseudo RGB Images. We fine-tune an existing text-to-image diffusion model to produce such pseudo-images using a small number of text-3D pairs. Surprisingly, we find that it can already generate meaningful and consistent 3D shapes given complex text descriptions. We further take the generated shapes as the starting point for a lightweight iterative refinement using score distillation sampling to achieve high-quality generation under a low budget. PI3D generates a single 3D shape from text in only 3 minutes and the quality is validated to outperform existing 3D generative models by a large margin.

PI3D: Efficient Text-to-3D Generation with Pseudo-Image Diffusion

TL;DR

PI3D tackles the data scarcity challenge in text-to-3D generation by converting 3D geometry into a set of six pseudo-images via a triplane representation and adapting a pre-trained text-to-image diffusion model to output these pseudo-images. It first fits a depth-aware triplane geometry and then uses a pseudo-image diffusion model, trained on paired 3D and 2D data, to generate fast coarse 3D samples, which are subsequently refined with a lightweight SDS-based process guided by 2D diffusion models. The approach achieves high-quality, 3D-consistent results in minutes, outperforming existing 3D diffusion and 2D-lifting methods on text alignment and generation speed, while maintaining robustness through mixed 2D-3D training and careful CFG tuning. The work demonstrates that leveraging 2D priors through pseudo-images can effectively transfer rich 2D generative knowledge to 3D, enabling scalable and efficient text-to-3D content creation with practical impact for creators and researchers alike.

Abstract

Diffusion models trained on large-scale text-image datasets have demonstrated a strong capability of controllable high-quality image generation from arbitrary text prompts. However, the generation quality and generalization ability of 3D diffusion models is hindered by the scarcity of high-quality and large-scale 3D datasets. In this paper, we present PI3D, a framework that fully leverages the pre-trained text-to-image diffusion models' ability to generate high-quality 3D shapes from text prompts in minutes. The core idea is to connect the 2D and 3D domains by representing a 3D shape as a set of Pseudo RGB Images. We fine-tune an existing text-to-image diffusion model to produce such pseudo-images using a small number of text-3D pairs. Surprisingly, we find that it can already generate meaningful and consistent 3D shapes given complex text descriptions. We further take the generated shapes as the starting point for a lightweight iterative refinement using score distillation sampling to achieve high-quality generation under a low budget. PI3D generates a single 3D shape from text in only 3 minutes and the quality is validated to outperform existing 3D generative models by a large margin.
Paper Structure (21 sections, 11 equations, 9 figures, 2 tables)

This paper contains 21 sections, 11 equations, 9 figures, 2 tables.

Figures (9)

  • Figure 1: Our method, PI3D, is able to generate the pseudo-images of a 3D shape in seconds from text prompts and further refine it in 3 minutes to achieve better quality.
  • Figure 2: We show the 3 orthogonal rendering results and the triplane representation as 6 pseudo-images for the same scene. We can observe semantic congruence between them, such as the contours of different parts.
  • Figure 3: We fit the triplane representation for each object with the supervision of the rendered RGB images, binary mask, and object depth. We also adopt axis-aligned masks on the triplane to encourage the geometry to approach the correct surface faster.
  • Figure 4: Examples of 3D models generated by Point·E pointe, Shap·E shape and our method conditioned on various text prompts. 3D models are presented by 3 rendered views. We follow Point·E pointe to extract the surface for the point cloud rendering. The prompts are listed below each sample.
  • Figure 5: Examples generated by Dreamfusion-IF dreamfusionthreestudio and our method. PI3D can generate 3D consistent models much faster.
  • ...and 4 more figures