Table of Contents
Fetching ...

Chirpy3D: Creative Fine-grained 3D Object Fabrication via Part Sampling

Kam Woh Ng, Jing Yang, Jia Wei Sii, Jiankang Deng, Chee Seng Chan, Yi-Zhe Song, Tao Xiang, Xiatian Zhu

TL;DR

Chirpy3D tackles zero-shot, fine-grained 3D object generation from unposed $2$D images of seen species by learning a hierarchical, part-aware latent space and conditioning a multi-view diffusion backbone. The method decomposes objects into anchor parts, regularizes part latents with a Gaussian prior, aligns parts across species via shared position embeddings, and maps them to textual tokens to guide diffusion-based multi-view generation; a self-supervised feature consistency loss further enforces cross-view coherence. Key contributions include unsupervised part discovery, distributional part modeling enabling novel part recombination, and a three-mode creative generation pipeline implemented with Score Distillation Sampling (SDS) on NeRF or 3DGS backbones. Empirical results on the CUB-200-2011 bird dataset show higher fidelity and diversity than competitive baselines, confirming the approach’s ability to synthesize entirely new, species-specific 3D objects without part-level supervision or 3D data, with practical impact for gaming and design. The framework generalizes beyond birds to other deformable categories, though limitations in multi-view consistency and partial disentanglement remain avenues for future refinement.

Abstract

We present Chirpy3D, a novel approach for fine-grained 3D object generation, tackling the challenging task of synthesizing creative 3D objects in a zero-shot setting, with access only to unposed 2D images of seen categories. Without structured supervision -- such as camera poses, 3D part annotations, or object-specific labels -- the model must infer plausible 3D structures, capture fine-grained details, and generalize to novel objects using only category-level labels from seen categories. To address this, Chirpy3D introduces a multi-view diffusion model that decomposes training objects into anchor parts in an unsupervised manner, representing the latent space of both seen and unseen parts as continuous distributions. This allows smooth interpolation and flexible recombination of parts to generate entirely new objects with species-specific details. A self-supervised feature consistency loss further ensures structural and semantic coherence. The result is the first system capable of generating entirely novel 3D objects with species-specific fine-grained details through flexible part sampling and composition. Our experiments demonstrate that Chirpy3D surpasses existing methods in generating creative 3D objects with higher quality and fine-grained details. Code will be released at https://github.com/kamwoh/chirpy3d.

Chirpy3D: Creative Fine-grained 3D Object Fabrication via Part Sampling

TL;DR

Chirpy3D tackles zero-shot, fine-grained 3D object generation from unposed D images of seen species by learning a hierarchical, part-aware latent space and conditioning a multi-view diffusion backbone. The method decomposes objects into anchor parts, regularizes part latents with a Gaussian prior, aligns parts across species via shared position embeddings, and maps them to textual tokens to guide diffusion-based multi-view generation; a self-supervised feature consistency loss further enforces cross-view coherence. Key contributions include unsupervised part discovery, distributional part modeling enabling novel part recombination, and a three-mode creative generation pipeline implemented with Score Distillation Sampling (SDS) on NeRF or 3DGS backbones. Empirical results on the CUB-200-2011 bird dataset show higher fidelity and diversity than competitive baselines, confirming the approach’s ability to synthesize entirely new, species-specific 3D objects without part-level supervision or 3D data, with practical impact for gaming and design. The framework generalizes beyond birds to other deformable categories, though limitations in multi-view consistency and partial disentanglement remain avenues for future refinement.

Abstract

We present Chirpy3D, a novel approach for fine-grained 3D object generation, tackling the challenging task of synthesizing creative 3D objects in a zero-shot setting, with access only to unposed 2D images of seen categories. Without structured supervision -- such as camera poses, 3D part annotations, or object-specific labels -- the model must infer plausible 3D structures, capture fine-grained details, and generalize to novel objects using only category-level labels from seen categories. To address this, Chirpy3D introduces a multi-view diffusion model that decomposes training objects into anchor parts in an unsupervised manner, representing the latent space of both seen and unseen parts as continuous distributions. This allows smooth interpolation and flexible recombination of parts to generate entirely new objects with species-specific details. A self-supervised feature consistency loss further ensures structural and semantic coherence. The result is the first system capable of generating entirely novel 3D objects with species-specific fine-grained details through flexible part sampling and composition. Our experiments demonstrate that Chirpy3D surpasses existing methods in generating creative 3D objects with higher quality and fine-grained details. Code will be released at https://github.com/kamwoh/chirpy3d.
Paper Structure (28 sections, 10 equations, 23 figures, 7 tables, 1 algorithm)

This paper contains 28 sections, 10 equations, 23 figures, 7 tables, 1 algorithm.

Figures (23)

  • Figure 1: Novel, creative species created by our Chirpy3D. Feel free to name them!
  • Figure 2: Overview of Chirpy3D. Chirpy3D takes (a) a set of unposed 2D images from multiple fine-grained species (e.g., birds) and (b) learns to decompose each object into a set of underlying parts (e.g., head, wings, torso, legs, tail) within a hierarchical part latent space -- species embedding $\mathbf{s}$ captures glboal species characteristics while part-level embedding $\mathbf{p}$ captures fine-grained part variations. (c) A regularized part latent space ensures smooth interpolation and novel part synthesis via a standard Gaussian prior, enabling creative generation through flexible part recombination. (d) Part-specific position embeddings ($PE$) are shared across all categories, enabling cross-species part alignment. (e) Part embeddings are projected via a learnable function $g$ into part-aware textual embeddings to condition the multi-view diffusion model (e.g., MVDream shi2023mvdream) to generate multi-view images. (f) A self-supervised feature consistency loss is applied to enforce structural and semantic coherence across views, improving the realism and alignment of both seen and unseen parts. (g) During inference, Chirpy3D supports both reconstruction and creative generation -- either directly using learned part latent codes or sampling/interpolating within the part latent space -- which guides 3D representation learning (e.g., NeRF or 3DGS) via SDS optimization.
  • Figure 3: As we do not have images of unseen part latents, we use real natural images as our proxy. We extract cross-attention feature maps $F$ of two noised latents, then minimize the discrepancy between the two feature maps. This will encourage the model to compute similar feature maps for any given part latents, which indirectly stabilizes the denoising process for unseen latents.
  • Figure 4: (a) Seen part selection generation. Unseen part synthesis via (b) novel sampling and (c) interpolation.
  • Figure 5: Multi-view subject-driven generation on two species -blue jay, white pelican. Both PartCraft and Chirpy3D achieve comparable subject fidelity, whereas Textual Inversion falls short.
  • ...and 18 more figures