Table of Contents
Fetching ...

C3DAG: Controlled 3D Animal Generation using 3D pose guidance

Sandeep Mishra, Oindrila Saha, Alan C. Bovik

TL;DR

C3DAG tackles the problem of anatomically accurate 3D animal generation from text and pose by introducing a two-stage diffusion-based pipeline. It first initializes a NeRF from a naive balloon-shaped 3D mesh generated from a 3D pose, using depth-guided Score Distillation Sampling, and then refines the model with pose-guided SDS guided by a 2D Tetrapod-pose ControlNet trained on diverse animal keypoints. The approach combines an automatic 3D shape creator with a specialized control network to achieve high-fidelity, pose-consistent 3D animals across mammals, reptiles, birds, and amphibians, while offering substantially faster runtimes than prior state-of-the-art methods. This enables precise, controllable 3D animal generation suitable for animation and rendering, with an accessible web-based tool for interactive pose and shape manipulation.

Abstract

Recent advancements in text-to-3D generation have demonstrated the ability to generate high quality 3D assets. However while generating animals these methods underperform, often portraying inaccurate anatomy and geometry. Towards ameliorating this defect, we present C3DAG, a novel pose-Controlled text-to-3D Animal Generation framework which generates a high quality 3D animal consistent with a given pose. We also introduce an automatic 3D shape creator tool, that allows dynamic pose generation and modification via a web-based tool, and that generates a 3D balloon animal using simple geometries. A NeRF is then initialized using this 3D shape using depth-controlled SDS. In the next stage, the pre-trained NeRF is fine-tuned using quadruped-pose-controlled SDS. The pipeline that we have developed not only produces geometrically and anatomically consistent results, but also renders highly controlled 3D animals, unlike prior methods which do not allow fine-grained pose control.

C3DAG: Controlled 3D Animal Generation using 3D pose guidance

TL;DR

C3DAG tackles the problem of anatomically accurate 3D animal generation from text and pose by introducing a two-stage diffusion-based pipeline. It first initializes a NeRF from a naive balloon-shaped 3D mesh generated from a 3D pose, using depth-guided Score Distillation Sampling, and then refines the model with pose-guided SDS guided by a 2D Tetrapod-pose ControlNet trained on diverse animal keypoints. The approach combines an automatic 3D shape creator with a specialized control network to achieve high-fidelity, pose-consistent 3D animals across mammals, reptiles, birds, and amphibians, while offering substantially faster runtimes than prior state-of-the-art methods. This enables precise, controllable 3D animal generation suitable for animation and rendering, with an accessible web-based tool for interactive pose and shape manipulation.

Abstract

Recent advancements in text-to-3D generation have demonstrated the ability to generate high quality 3D assets. However while generating animals these methods underperform, often portraying inaccurate anatomy and geometry. Towards ameliorating this defect, we present C3DAG, a novel pose-Controlled text-to-3D Animal Generation framework which generates a high quality 3D animal consistent with a given pose. We also introduce an automatic 3D shape creator tool, that allows dynamic pose generation and modification via a web-based tool, and that generates a 3D balloon animal using simple geometries. A NeRF is then initialized using this 3D shape using depth-controlled SDS. In the next stage, the pre-trained NeRF is fine-tuned using quadruped-pose-controlled SDS. The pipeline that we have developed not only produces geometrically and anatomically consistent results, but also renders highly controlled 3D animals, unlike prior methods which do not allow fine-grained pose control.
Paper Structure (11 sections, 2 figures)

This paper contains 11 sections, 2 figures.

Figures (2)

  • Figure 1: Comparison with text-to-3D generation guided by 2D diffusion. Given the prompt as shown, we qualitatively compare with the open-source version of Dreamfusion and HiFA using their default settings. For both of these we append ", full body" to the prompt. It is clearly observed that both prior work suffer from various inconsistencies producing multiple heads or limbs. Stable Dreamfusion usually produces lower details in textures. HiFA produces high-quality textures but almost always produces anatomically incorrect animals. We provide more results in supplementary.
  • Figure 2: Comparison with parametric model based method. Given the input image 3DFauna fails to capture high-frequency details and follow the input image (see tail), whereas our method produces a highly detailed animal given input pose and text, which closely follows the input pose control.