Table of Contents
Fetching ...

Deep Geometric Moments Promote Shape Consistency in Text-to-3D Generation

Utkarsh Nath, Rajeev Goel, Eun Som Jeon, Changhoon Kim, Kyle Min, Yezhou Yang, Yingzhen Yang, Pavan Turaga

TL;DR

The paper tackles the challenge of producing geometrically consistent text-to-3D assets under limited 3D data by grounding 2D lifting in a single high-fidelity 3D reference. It introduces MT3D, which combines depth-conditioned ControlNet, LoRA conditioning on depth, and Deep Geometric Moments (DGM) on a 3D Gaussian (Gaussian Splatting) representation to enforce accurate shape across views. A two-stage optimization—geometry refinement followed by texture refinement—yields improved geometric fidelity and reduced Janus artifacts, achieving a Janus rate that is $38\%$ better than the next-best baseline in their experiments. This geometry-informed approach enhances the practicality and reliability of text-to-3D generation, with potential for more faithful texture transfer in future work.

Abstract

To address the data scarcity associated with 3D assets, 2D-lifting techniques such as Score Distillation Sampling (SDS) have become a widely adopted practice in text-to-3D generation pipelines. However, the diffusion models used in these techniques are prone to viewpoint bias and thus lead to geometric inconsistencies such as the Janus problem. To counter this, we introduce MT3D, a text-to-3D generative model that leverages a high-fidelity 3D object to overcome viewpoint bias and explicitly infuse geometric understanding into the generation pipeline. Firstly, we employ depth maps derived from a high-quality 3D model as control signals to guarantee that the generated 2D images preserve the fundamental shape and structure, thereby reducing the inherent viewpoint bias. Next, we utilize deep geometric moments to ensure geometric consistency in the 3D representation explicitly. By incorporating geometric details from a 3D asset, MT3D enables the creation of diverse and geometrically consistent objects, thereby improving the quality and usability of our 3D representations. Project page and code: https://moment-3d.github.io/

Deep Geometric Moments Promote Shape Consistency in Text-to-3D Generation

TL;DR

The paper tackles the challenge of producing geometrically consistent text-to-3D assets under limited 3D data by grounding 2D lifting in a single high-fidelity 3D reference. It introduces MT3D, which combines depth-conditioned ControlNet, LoRA conditioning on depth, and Deep Geometric Moments (DGM) on a 3D Gaussian (Gaussian Splatting) representation to enforce accurate shape across views. A two-stage optimization—geometry refinement followed by texture refinement—yields improved geometric fidelity and reduced Janus artifacts, achieving a Janus rate that is better than the next-best baseline in their experiments. This geometry-informed approach enhances the practicality and reliability of text-to-3D generation, with potential for more faithful texture transfer in future work.

Abstract

To address the data scarcity associated with 3D assets, 2D-lifting techniques such as Score Distillation Sampling (SDS) have become a widely adopted practice in text-to-3D generation pipelines. However, the diffusion models used in these techniques are prone to viewpoint bias and thus lead to geometric inconsistencies such as the Janus problem. To counter this, we introduce MT3D, a text-to-3D generative model that leverages a high-fidelity 3D object to overcome viewpoint bias and explicitly infuse geometric understanding into the generation pipeline. Firstly, we employ depth maps derived from a high-quality 3D model as control signals to guarantee that the generated 2D images preserve the fundamental shape and structure, thereby reducing the inherent viewpoint bias. Next, we utilize deep geometric moments to ensure geometric consistency in the 3D representation explicitly. By incorporating geometric details from a 3D asset, MT3D enables the creation of diverse and geometrically consistent objects, thereby improving the quality and usability of our 3D representations. Project page and code: https://moment-3d.github.io/
Paper Structure (23 sections, 8 equations, 17 figures, 1 table)

This paper contains 23 sections, 8 equations, 17 figures, 1 table.

Figures (17)

  • Figure 1: Illustration of 3D objects generated by our model corresponding to input text prompt (a) 'A DSLR photo of Batman' and (b) 'A high-quality photo of a corgi wearing a top hat'.
  • Figure 2: (a) An illustration of the multi-faced Janus problem. (b) An illustration of the inherent viewpoint bias in diffusion models. We present samples generated by Stable Diffusion v1.5 from the prompt 'cat,' where most samples predominantly depict the cat from a front view.
  • Figure 3: Overview of MT3D. In the first stage, we optimize 3D Gaussians using a high-fidelity 3D object with depth-conditioned ControlNet and deep geometric moments (DGM). Red and green locks represent frozen and trainable weights, respectively. The second stage extends the first by not only utilizing ControlNet and DGM features but also applying additional densification and pruning.
  • Figure 4: (Left) Samples generated by the depth-conditioned ControlNet model from the prompt 'Elephant,' conditioned on various viewpoints. The outputs from the ControlNet model align well with the majority of the input-conditioned depth maps. (Right) Illustration of features obtained from the ImageNet-pretrained deep geometric moment Model (DGM) across various images. DGM features effectively capture the shape and structure.
  • Figure 5: Qualitative comparison between the proposed MT3D and state-of-the-art generators, including Magic3D lin2023magic3d, Fantasia3D chen2023fantasia3d, ProlificDreamer wang2023prolificdreamer, HIFA zhu2023hifa and GSGEN chen2023text.
  • ...and 12 more figures