Table of Contents
Fetching ...

From One to More: Contextual Part Latents for 3D Generation

Shaocong Dong, Lihe Ding, Xiao Chen, Yaokun Li, Yuxin Wang, Yucheng Wang, Qi Wang, Jaehyeok Kim, Chenjian Gao, Zhanpeng Huang, Zibin Wang, Tianfan Xue, Dan Xu

TL;DR

CoPart addresses the challenge of designing high-quality, controllable 3D objects with multiple independent parts by introducing contextual part latents—each part has a geometric token and an image token—learned via synchronized diffusion with mutual guidance between parts and modalities. A global guidance branch and part-level bounding box conditioning provide cross-part coherence and explicit local controllability, while PartVerse supplies a large, semi-automated dataset of 91k parts from 12k objects to enable scalable training. The framework supports part-based editing, articulated generation, and mini-scene composition, achieving superior detail in small parts and improved generalization over holistic 3D generators. Overall, CoPart advances controllable, part-aware 3D generation with a scalable dataset and a diffusion-based, cross-part planning paradigm that aligns geometry and appearance while enabling precise part-level control.

Abstract

Recent advances in 3D generation have transitioned from multi-view 2D rendering approaches to 3D-native latent diffusion frameworks that exploit geometric priors in ground truth data. Despite progress, three key limitations persist: (1) Single-latent representations fail to capture complex multi-part geometries, causing detail degradation; (2) Holistic latent coding neglects part independence and interrelationships critical for compositional design; (3) Global conditioning mechanisms lack fine-grained controllability. Inspired by human 3D design workflows, we propose CoPart - a part-aware diffusion framework that decomposes 3D objects into contextual part latents for coherent multi-part generation. This paradigm offers three advantages: i) Reduces encoding complexity through part decomposition; ii) Enables explicit part relationship modeling; iii) Supports part-level conditioning. We further develop a mutual guidance strategy to fine-tune pre-trained diffusion models for joint part latent denoising, ensuring both geometric coherence and foundation model priors. To enable large-scale training, we construct Partverse - a novel 3D part dataset derived from Objaverse through automated mesh segmentation and human-verified annotations. Extensive experiments demonstrate CoPart's superior capabilities in part-level editing, articulated object generation, and scene composition with unprecedented controllability.

From One to More: Contextual Part Latents for 3D Generation

TL;DR

CoPart addresses the challenge of designing high-quality, controllable 3D objects with multiple independent parts by introducing contextual part latents—each part has a geometric token and an image token—learned via synchronized diffusion with mutual guidance between parts and modalities. A global guidance branch and part-level bounding box conditioning provide cross-part coherence and explicit local controllability, while PartVerse supplies a large, semi-automated dataset of 91k parts from 12k objects to enable scalable training. The framework supports part-based editing, articulated generation, and mini-scene composition, achieving superior detail in small parts and improved generalization over holistic 3D generators. Overall, CoPart advances controllable, part-aware 3D generation with a scalable dataset and a diffusion-based, cross-part planning paradigm that aligns geometry and appearance while enabling precise part-level control.

Abstract

Recent advances in 3D generation have transitioned from multi-view 2D rendering approaches to 3D-native latent diffusion frameworks that exploit geometric priors in ground truth data. Despite progress, three key limitations persist: (1) Single-latent representations fail to capture complex multi-part geometries, causing detail degradation; (2) Holistic latent coding neglects part independence and interrelationships critical for compositional design; (3) Global conditioning mechanisms lack fine-grained controllability. Inspired by human 3D design workflows, we propose CoPart - a part-aware diffusion framework that decomposes 3D objects into contextual part latents for coherent multi-part generation. This paradigm offers three advantages: i) Reduces encoding complexity through part decomposition; ii) Enables explicit part relationship modeling; iii) Supports part-level conditioning. We further develop a mutual guidance strategy to fine-tune pre-trained diffusion models for joint part latent denoising, ensuring both geometric coherence and foundation model priors. To enable large-scale training, we construct Partverse - a novel 3D part dataset derived from Objaverse through automated mesh segmentation and human-verified annotations. Extensive experiments demonstrate CoPart's superior capabilities in part-level editing, articulated object generation, and scene composition with unprecedented controllability.

Paper Structure

This paper contains 22 sections, 13 equations, 8 figures, 2 tables.

Figures (8)

  • Figure 1: CoPart achieves high-quality part-based 3D generation and supports various applications.
  • Figure 2: The framework of CoPart operates as follows: Gaussian noise is added to part image and geometric tokens extracted from the VAE, which are then fed into 3D and 2D denoisers. Mutual guidance (a) is introduced to facilitate information exchange between the 3D and 2D modalities (via Cross-Modality Attention) as well as between different parts (via Cross-Part Attention). Additionally, (b) the 3D bounding boxes are treated as cube meshes, and the extracted box tokens are injected into the 3D denoiser through cross-attention. Simultaneously, the boxes are rendered into 2D images and injected into the 2D denoiser via ControlNet.
  • Figure 3: Comparison with state-of-the-art 3D generators. CoPart can generate detailed and independent 3D parts.
  • Figure 4: Comparison with part-based generator SALAD.
  • Figure 5: Qualitative results of part editing and mini scene generation.
  • ...and 3 more figures