Table of Contents
Fetching ...

TV-3DG: Mastering Text-to-3D Customized Generation with Visual Prompt

Jiahui Yang, Donglin Di, Baorui Ma, Xun Yang, Yongjia Ma, Wenzhang Sun, Wei Chen, Jianxun Cui, Zhou Xue, Meng Wang, Yebin Liu

TL;DR

A novel algorithm, Classifier Score Matching (CSM), is proposed, which removes the difference term in SDS and uses a deterministic noise addition process to reduce noise during optimization, effectively overcoming the low-quality limitations of SDS in the authors' customized generation framework.

Abstract

In recent years, advancements in generative models have significantly expanded the capabilities of text-to-3D generation. Many approaches rely on Score Distillation Sampling (SDS) technology. However, SDS struggles to accommodate multi-condition inputs, such as text and visual prompts, in customized generation tasks. To explore the core reasons, we decompose SDS into a difference term and a classifier-free guidance term. Our analysis identifies the core issue as arising from the difference term and the random noise addition during the optimization process, both contributing to deviations from the target mode during distillation. To address this, we propose a novel algorithm, Classifier Score Matching (CSM), which removes the difference term in SDS and uses a deterministic noise addition process to reduce noise during optimization, effectively overcoming the low-quality limitations of SDS in our customized generation framework. Based on CSM, we integrate visual prompt information with an attention fusion mechanism and sampling guidance techniques, forming the Visual Prompt CSM (VPCSM) algorithm. Furthermore, we introduce a Semantic-Geometry Calibration (SGC) module to enhance quality through improved textual information integration. We present our approach as TV-3DG, with extensive experiments demonstrating its capability to achieve stable, high-quality, customized 3D generation. Project page: \url{https://yjhboy.github.io/TV-3DG}

TV-3DG: Mastering Text-to-3D Customized Generation with Visual Prompt

TL;DR

A novel algorithm, Classifier Score Matching (CSM), is proposed, which removes the difference term in SDS and uses a deterministic noise addition process to reduce noise during optimization, effectively overcoming the low-quality limitations of SDS in the authors' customized generation framework.

Abstract

In recent years, advancements in generative models have significantly expanded the capabilities of text-to-3D generation. Many approaches rely on Score Distillation Sampling (SDS) technology. However, SDS struggles to accommodate multi-condition inputs, such as text and visual prompts, in customized generation tasks. To explore the core reasons, we decompose SDS into a difference term and a classifier-free guidance term. Our analysis identifies the core issue as arising from the difference term and the random noise addition during the optimization process, both contributing to deviations from the target mode during distillation. To address this, we propose a novel algorithm, Classifier Score Matching (CSM), which removes the difference term in SDS and uses a deterministic noise addition process to reduce noise during optimization, effectively overcoming the low-quality limitations of SDS in our customized generation framework. Based on CSM, we integrate visual prompt information with an attention fusion mechanism and sampling guidance techniques, forming the Visual Prompt CSM (VPCSM) algorithm. Furthermore, we introduce a Semantic-Geometry Calibration (SGC) module to enhance quality through improved textual information integration. We present our approach as TV-3DG, with extensive experiments demonstrating its capability to achieve stable, high-quality, customized 3D generation. Project page: \url{https://yjhboy.github.io/TV-3DG}

Paper Structure

This paper contains 18 sections, 19 equations, 16 figures, 2 tables, 1 algorithm.

Figures (16)

  • Figure 1: An overarching understanding of our TV-3DG system. Our customized generation framework can achieves high-quality and intricate stylized generation through the use of visual prompt.
  • Figure 2: In-depth analysis of SDS loss gradient in customized generation. We use randomly initialized noise as the image. At the 2D level, we experiment with different combinations of terms in the SDS loss. The left column shows results using the complete SDS loss, the middle column retains only the term with the CFG ho2022classifier coefficient, and the right column retains only the term without the CFG coefficient, namely the difference term. We present the results guided by an arbitrary visual prompt in the lower section (method described in Sec \ref{['subsec:vpcsm']}). The prompt is "A photograph of an astronaut riding a horse."
  • Figure 3: Evaluation of Classifier Score Matching (CSM) loss. We conduct experiments with our CSM loss on 2D level. We present the optimization process of ${\bm{x}}_0$ and observe that CSM achieves clearer image details compared to SDS loss at the same timestep. When a visual prompt with significantly different semantics is introduced, CSM effectively preserves clear geometric structures and captures enhanced style and texture information.
  • Figure 4: Illustration of Classifier Score Matching (CSM). We aim to utilize a pre-trained text-to-image model ${\bm{\epsilon}}_\psi$ to perform score matching on the 2D level. An image is rendered from $\theta$ for a specific viewpoint, which is then subjected to noise addition through DDIM inversion. The denoising Unet subsequently estimates the noise. In our framework, it is necessary to estimate two outputs of the Unet: ${\bm{\epsilon}}_\psi({\bm{x}}_t,t,y)$ and ${\bm{\epsilon}}_\psi({\bm{x}}_t,t,\emptyset)$, with a classifier-free guidance scale $\lambda$. Finally, optimization is performed using our proposed CSM.
  • Figure 5: Overview of our proposed TV-3DG. Our framework integrates several advanced modules: a Visual Prompt Classifier Score Matching (VPCSM) module that incorporates visual prompt guidance along with Classifier Free Guidance and Perturbed Attention Guidance techniques for aligning texture and style; and a Semantic-Geometry Calibration (SGC) module designed to enhance semantic and geometric fidelity. Our input includes a text prompt and a visual prompt. When the visual prompt aligns with the textual description, the framework generates high-quality optimized outputs (i.e., texture alignment), as indicated by the purple arrow. Conversely, when the visual prompt and text description are inconsistent, TV3DG learns relevant stylistic and appearance elements from the visual information (i.e., style alignment) while retaining the main subject depicted in the text, as indicated by the green arrow.
  • ...and 11 more figures