Table of Contents
Fetching ...

Progressive Rendering Distillation: Adapting Stable Diffusion for Instant Text-to-Mesh Generation without 3D Data

Zhiyuan Ma, Xinyue Liang, Rongyuan Wu, Xiangyu Zhu, Zhen Lei, Lei Zhang

TL;DR

This work tackles the challenge of generating high-fidelity textured 3D meshes from text prompts without 3D ground-truth data. It introduces Progressive Rendering Distillation (PRD), a data-free distillation framework that uses multi-view diffusion teachers to guide an SD-based native 3D generator, enabling fast four-step inference. A key contribution is Parameter-Efficient Triplane Adaptation (PETA), which adds only a small fraction of trainable parameters to SD to produce a 3D representation with geometry and texture Triplanes for 3D meshes in about 1.2 seconds, with improved prompt fidelity and generalization as the training corpus scales. The approach demonstrates strong performance against state-of-the-art text-to-3D methods, highlighting both higher quality and speed, and opens avenues for scaling 3D generation to open-ended prompts without costly 3D data.

Abstract

It is highly desirable to obtain a model that can generate high-quality 3D meshes from text prompts in just seconds. While recent attempts have adapted pre-trained text-to-image diffusion models, such as Stable Diffusion (SD), into generators of 3D representations (e.g., Triplane), they often suffer from poor quality due to the lack of sufficient high-quality 3D training data. Aiming at overcoming the data shortage, we propose a novel training scheme, termed as Progressive Rendering Distillation (PRD), eliminating the need for 3D ground-truths by distilling multi-view diffusion models and adapting SD into a native 3D generator. In each iteration of training, PRD uses the U-Net to progressively denoise the latent from random noise for a few steps, and in each step it decodes the denoised latent into 3D output. Multi-view diffusion models, including MVDream and RichDreamer, are used in joint with SD to distill text-consistent textures and geometries into the 3D outputs through score distillation. Since PRD supports training without 3D ground-truths, we can easily scale up the training data and improve generation quality for challenging text prompts with creative concepts. Meanwhile, PRD can accelerate the inference speed of the generation model in just a few steps. With PRD, we train a Triplane generator, namely TriplaneTurbo, which adds only $2.5\%$ trainable parameters to adapt SD for Triplane generation. TriplaneTurbo outperforms previous text-to-3D generators in both efficiency and quality. Specifically, it can produce high-quality 3D meshes in 1.2 seconds and generalize well for challenging text input. The code is available at https://github.com/theEricMa/TriplaneTurbo.

Progressive Rendering Distillation: Adapting Stable Diffusion for Instant Text-to-Mesh Generation without 3D Data

TL;DR

This work tackles the challenge of generating high-fidelity textured 3D meshes from text prompts without 3D ground-truth data. It introduces Progressive Rendering Distillation (PRD), a data-free distillation framework that uses multi-view diffusion teachers to guide an SD-based native 3D generator, enabling fast four-step inference. A key contribution is Parameter-Efficient Triplane Adaptation (PETA), which adds only a small fraction of trainable parameters to SD to produce a 3D representation with geometry and texture Triplanes for 3D meshes in about 1.2 seconds, with improved prompt fidelity and generalization as the training corpus scales. The approach demonstrates strong performance against state-of-the-art text-to-3D methods, highlighting both higher quality and speed, and opens avenues for scaling 3D generation to open-ended prompts without costly 3D data.

Abstract

It is highly desirable to obtain a model that can generate high-quality 3D meshes from text prompts in just seconds. While recent attempts have adapted pre-trained text-to-image diffusion models, such as Stable Diffusion (SD), into generators of 3D representations (e.g., Triplane), they often suffer from poor quality due to the lack of sufficient high-quality 3D training data. Aiming at overcoming the data shortage, we propose a novel training scheme, termed as Progressive Rendering Distillation (PRD), eliminating the need for 3D ground-truths by distilling multi-view diffusion models and adapting SD into a native 3D generator. In each iteration of training, PRD uses the U-Net to progressively denoise the latent from random noise for a few steps, and in each step it decodes the denoised latent into 3D output. Multi-view diffusion models, including MVDream and RichDreamer, are used in joint with SD to distill text-consistent textures and geometries into the 3D outputs through score distillation. Since PRD supports training without 3D ground-truths, we can easily scale up the training data and improve generation quality for challenging text prompts with creative concepts. Meanwhile, PRD can accelerate the inference speed of the generation model in just a few steps. With PRD, we train a Triplane generator, namely TriplaneTurbo, which adds only trainable parameters to adapt SD for Triplane generation. TriplaneTurbo outperforms previous text-to-3D generators in both efficiency and quality. Specifically, it can produce high-quality 3D meshes in 1.2 seconds and generalize well for challenging text input. The code is available at https://github.com/theEricMa/TriplaneTurbo.

Paper Structure

This paper contains 17 sections, 2 equations, 16 figures, 4 tables, 1 algorithm.

Figures (16)

  • Figure 1: Our method adapts Stable Diffusion rombach2022high to generate high-fidelity textured meshes in 1.2 seconds.
  • Figure 2: Comparison between (a) traditional SD adaptation and (b) our proposed progressive rendering distillation (PRD) for native 3D generation. Traditional approach requires ground-truth 3D representations $\theta$ and their latents $\boldsymbol{z}_0$ for each 3D sample to generate $\boldsymbol{z}_0$. Our proposed PRD scheme progressively denoises latents $\boldsymbol{z}_t$ initialized from random noise into $\boldsymbol{z}_0$, which are decoded to $\theta$, using multi-view diffusion models as teachers for distillation, eliminating the need for 3D data during adaptation and overcoming data scarcity.
  • Figure 3: Illustration of TriplaneTurbo: an SD-adapted native 3D generator using our PRD scheme. Our model generates six feature planes comprising geometry Triplane $\theta_{\mathrm{geo}}$ and texture Triplane $\theta_{\mathrm{tex}}$ in 4 steps. We introduce Parameter Efficient Triplane Adaptation (PETA), which requires only $2.5\%$ additional parameters for adaptation. The parameter arrangement is illustrated in the figure.
  • Figure 4: Qualitative comparison of text-to-mesh generation results by competing methods. Please refer to \ref{['sec:comparison']} for details.
  • Figure 5: More results of our model trained with expanded corpus.
  • ...and 11 more figures