Table of Contents
Fetching ...

Text-driven 3D Human Generation via Contrastive Preference Optimization

Pengfei Zhou, Xukun Shen, Yong Hu

TL;DR

The paper tackles the problem of generating faithful 3D human models from long, complex text prompts by identifying diffusion-prior semantic entanglement in SDS. It introduces a contrastive preference framework comprising a Preference Optimization Module (POM) and a Negative Preference Optimization Module (NPOM), guided by LLM-driven long prompts and dynamic negation prompts, built on SMPL-X and 3D Gaussian Splatting. An adaptive Least Common Multiple (LCM)-based weighting (LCMW) fuses multiple preference models to enhance fine-grained semantic alignment, while dynamic negation mitigates reward hacking via static and dynamic prompts. Experiments demonstrate state-of-the-art texture realism and semantic alignment for complex prompts, highlighting practical impact for VR/AR assets, digital entertainment, and avatar personalization.

Abstract

Recent advances in Score Distillation Sampling (SDS) have improved 3D human generation from textual descriptions. However, existing methods still face challenges in accurately aligning 3D models with long and complex textual inputs. To address this challenge, we propose a novel framework that introduces contrastive preferences, where human-level preference models, guided by both positive and negative prompts, assist SDS for improved alignment. Specifically, we design a preference optimization module that integrates multiple models to comprehensively capture the full range of textual features. Furthermore, we introduce a negation preference module to mitigate over-optimization of irrelevant details by leveraging static-dynamic negation prompts, effectively preventing ``reward hacking". Extensive experiments demonstrate that our method achieves state-of-the-art results, significantly enhancing texture realism and visual alignment with textual descriptions, particularly for long and complex inputs.

Text-driven 3D Human Generation via Contrastive Preference Optimization

TL;DR

The paper tackles the problem of generating faithful 3D human models from long, complex text prompts by identifying diffusion-prior semantic entanglement in SDS. It introduces a contrastive preference framework comprising a Preference Optimization Module (POM) and a Negative Preference Optimization Module (NPOM), guided by LLM-driven long prompts and dynamic negation prompts, built on SMPL-X and 3D Gaussian Splatting. An adaptive Least Common Multiple (LCM)-based weighting (LCMW) fuses multiple preference models to enhance fine-grained semantic alignment, while dynamic negation mitigates reward hacking via static and dynamic prompts. Experiments demonstrate state-of-the-art texture realism and semantic alignment for complex prompts, highlighting practical impact for VR/AR assets, digital entertainment, and avatar personalization.

Abstract

Recent advances in Score Distillation Sampling (SDS) have improved 3D human generation from textual descriptions. However, existing methods still face challenges in accurately aligning 3D models with long and complex textual inputs. To address this challenge, we propose a novel framework that introduces contrastive preferences, where human-level preference models, guided by both positive and negative prompts, assist SDS for improved alignment. Specifically, we design a preference optimization module that integrates multiple models to comprehensively capture the full range of textual features. Furthermore, we introduce a negation preference module to mitigate over-optimization of irrelevant details by leveraging static-dynamic negation prompts, effectively preventing ``reward hacking". Extensive experiments demonstrate that our method achieves state-of-the-art results, significantly enhancing texture realism and visual alignment with textual descriptions, particularly for long and complex inputs.

Paper Structure

This paper contains 17 sections, 10 equations, 9 figures, 2 tables.

Figures (9)

  • Figure 1: Comparison of Long-Text 3D Human Generation Methods. We compare the state-of-the-art methods, HumanGaussian humangaussian and HumanNorm humannorm, for generating 3D humans from complex long-text descriptions. The results show that these methods struggle to achieve full semantic alignment, with elements such as the green vest and brown shirt being partially occluded or incorrectly altered by other colors. This misalignment stems from the semantic entanglement effect induced by diffusion priors, which disrupts fine-grained attribute mapping and leads to incomplete text-to-3D correspondence in long-text-conditioned generation.
  • Figure 2: Overview of the Proposed Framework. We first leverage a LLM to generate complex long-text prompts along with dynamic negative prompts, which guide the optimization process and enhance the quality of 3D human generation. Next, we employ the representation of 3DGS 3dgs to synthesize accurate 3D human models from the generated textual descriptions. The process begins with densely sampling Gaussian functions on the SMPL-X smplx human mesh, serving as the initialization for model generation. To improve semantic alignment between the long-text descriptions and the 3D human models, we introduce a preference optimization module, which integrates preference models such as ImageReward imagereward and PickScore pickscore. These models capture fine-grained semantic nuances across different text segments, enhancing overall alignment accuracy. Additionally, we propose a negation preference optimization module, which incorporates both static and dynamic negative prompts to mitigate over-optimization in preference models. This mechanism prevents excessive emphasis on certain attributes while ensuring balanced semantic representation. The entire framework is optimized through SDS, where preference models refine the SDS process to achieve more precise alignment between long-text descriptions and 3D human representations.
  • Figure 3: PickScore pickscore demonstrates a stronger capability in capturing categorical semantics, accurately generating a bucket hat, whereas ImageReward imagereward fails to distinguish the category precisely, resulting in a baseball cap. In contrast, ImageReward excels in attribute-level semantic understanding, correctly identifying the pattern's color, while PickScore misinterprets it as the hoodie’s color, leading to an incorrect generation.
  • Figure 4: Qualitative comparisons with the SOTA Text-to-3D human methods. Compared to existing text-driven 3D human generation methods, such as DreamWaltz dreamwaltz, TADA tada, X-Oscar xoscar, HumanNorm humannorm, and HumanGaussian humangaussian, our approach demonstrates significant improvements. The text prompts used for comparison are listed at the top.
  • Figure 5: Qualitative comparisons with the SOTA Text-to-3D methods.
  • ...and 4 more figures