Table of Contents
Fetching ...

DreamPartGen: Semantically Grounded Part-Level 3D Generation via Collaborative Latent Denoising

Tianjiao Yu, Xinzhuo Li, Muntasir Wahed, Jerry Xiong, Yifan Shen, Ying Shen, Ismini Lourentzou

Abstract

Understanding and generating 3D objects as compositions of meaningful parts is fundamental to human perception and reasoning. However, most text-to-3D methods overlook the semantic and functional structure of parts. While recent part-aware approaches introduce decomposition, they remain largely geometry-focused, lacking semantic grounding and failing to model how parts align with textual descriptions or their inter-part relations. We propose DreamPartGen, a framework for semantically grounded, part-aware text-to-3D generation. DreamPartGen introduces Duplex Part Latents (DPLs) that jointly model each part's geometry and appearance, and Relational Semantic Latents (RSLs) that capture inter-part dependencies derived from language. A synchronized co-denoising process enforces mutual geometric and semantic consistency, enabling coherent, interpretable, and text-aligned 3D synthesis. Across multiple benchmarks, DreamPartGen delivers state-of-the-art performance in geometric fidelity and text-shape alignment.

DreamPartGen: Semantically Grounded Part-Level 3D Generation via Collaborative Latent Denoising

Abstract

Understanding and generating 3D objects as compositions of meaningful parts is fundamental to human perception and reasoning. However, most text-to-3D methods overlook the semantic and functional structure of parts. While recent part-aware approaches introduce decomposition, they remain largely geometry-focused, lacking semantic grounding and failing to model how parts align with textual descriptions or their inter-part relations. We propose DreamPartGen, a framework for semantically grounded, part-aware text-to-3D generation. DreamPartGen introduces Duplex Part Latents (DPLs) that jointly model each part's geometry and appearance, and Relational Semantic Latents (RSLs) that capture inter-part dependencies derived from language. A synchronized co-denoising process enforces mutual geometric and semantic consistency, enabling coherent, interpretable, and text-aligned 3D synthesis. Across multiple benchmarks, DreamPartGen delivers state-of-the-art performance in geometric fidelity and text-shape alignment.
Paper Structure (21 sections, 6 equations, 12 figures, 11 tables)

This paper contains 21 sections, 6 equations, 12 figures, 11 tables.

Figures (12)

  • Figure 1: DreamPartGen27, 161, 226136, 74, 178 connects part-level geometry and appearance with language-driven relational semantics, providing precise control over how parts are modified, arranged, and contextualized. This unified representation enables a wide range of downstream applications, including fine-grained part editing, articulated object generation, and mini-scene synthesis.
  • Figure 2: DreamPartGen27, 161, 226136, 74, 178 Overview. DreamPartGen performs text-guided 3D generation by jointly denoising geometry, appearance, and relational semantics. Each object is decomposed into parts represented as Duplex Part Latents (DPLs) from 3D and 2D encoders, while Relational Semantic Latents (RSLs) encode text-derived details and global structure. Through intra-part (geometry–appearance alignment) and inter-part (relational planning via language) synchronization, DreamPartGen co-denoises DPLs and RSLs, enabling semantically grounded reconstruction of coherent part-aware 3D objects.
  • Figure 3: PartRel3D dataset overview of structured functional and spatial triplets for fine-grained inter-part semantic supervision.
  • Figure 4: Qualitative comparison on part-level 3D generation. Across diverse object categories, DreamPartGen27, 161, 226136, 74, 178 yields the most faithful decompositions, preserving clear part boundaries, correct topology, and consistent spatial alignment. Baselines frequently exhibit assembly failures such as missing or detached parts (e.g., wing/head), spatial drift of small components (e.g., wheels/mechanical parts floating off the chassis), and unstable attachments that create surface tearing or holes around high-contact regions (neck, torso, shoulders, limb joints).
  • Figure 5: Ablation on the co-denoising process with local semantic tokens $\mathbf{S}^{\text{loc}}$. The condition-only baseline ($- \mathbf{S}^{\text{loc}}$) yields coarse geometry and weak semantic coherence between parts.
  • ...and 7 more figures