Table of Contents
Fetching ...

Chasing Consistency in Text-to-3D Generation from a Single Image

Yichen Ouyang, Wenhao Chai, Jiayi Ye, Dapeng Tao, Yibing Zhan, Gaoang Wang

TL;DR

Consist3D tackles the persistent inconsistencies in single-image text-to-3D generation by introducing a three-stage framework that learns a semantic consistency token and a geometric consistency token to guide a low-scale diffusion-based optimization. The semantic token stabilizes subject semantics independent of shape priors, while the geometric token enforces cross-view geometry through warp and reconstruction losses, reducing overfitting. In the final stage, a low CFG-scales SDS refines a 3D volume using both tokens, yielding faithful, photo-realistic results and enabling background and object editing via prompts. Across diverse datasets, Consist3D outperforms baselines in fidelity and consistency, with strong user study support and robustness to seeds, while maintaining editing capabilities and pointing toward future enhancements in background/geometry modeling.

Abstract

Text-to-3D generation from a single-view image is a popular but challenging task in 3D vision. Although numerous methods have been proposed, existing works still suffer from the inconsistency issues, including 1) semantic inconsistency, 2) geometric inconsistency, and 3) saturation inconsistency, resulting in distorted, overfitted, and over-saturated generations. In light of the above issues, we present Consist3D, a three-stage framework Chasing for semantic-, geometric-, and saturation-Consistent Text-to-3D generation from a single image, in which the first two stages aim to learn parameterized consistency tokens, and the last stage is for optimization. Specifically, the semantic encoding stage learns a token independent of views and estimations, promoting semantic consistency and robustness. Meanwhile, the geometric encoding stage learns another token with comprehensive geometry and reconstruction constraints under novel-view estimations, reducing overfitting and encouraging geometric consistency. Finally, the optimization stage benefits from the semantic and geometric tokens, allowing a low classifier-free guidance scale and therefore preventing oversaturation. Experimental results demonstrate that Consist3D produces more consistent, faithful, and photo-realistic 3D assets compared to previous state-of-the-art methods. Furthermore, Consist3D also allows background and object editing through text prompts.

Chasing Consistency in Text-to-3D Generation from a Single Image

TL;DR

Consist3D tackles the persistent inconsistencies in single-image text-to-3D generation by introducing a three-stage framework that learns a semantic consistency token and a geometric consistency token to guide a low-scale diffusion-based optimization. The semantic token stabilizes subject semantics independent of shape priors, while the geometric token enforces cross-view geometry through warp and reconstruction losses, reducing overfitting. In the final stage, a low CFG-scales SDS refines a 3D volume using both tokens, yielding faithful, photo-realistic results and enabling background and object editing via prompts. Across diverse datasets, Consist3D outperforms baselines in fidelity and consistency, with strong user study support and robustness to seeds, while maintaining editing capabilities and pointing toward future enhancements in background/geometry modeling.

Abstract

Text-to-3D generation from a single-view image is a popular but challenging task in 3D vision. Although numerous methods have been proposed, existing works still suffer from the inconsistency issues, including 1) semantic inconsistency, 2) geometric inconsistency, and 3) saturation inconsistency, resulting in distorted, overfitted, and over-saturated generations. In light of the above issues, we present Consist3D, a three-stage framework Chasing for semantic-, geometric-, and saturation-Consistent Text-to-3D generation from a single image, in which the first two stages aim to learn parameterized consistency tokens, and the last stage is for optimization. Specifically, the semantic encoding stage learns a token independent of views and estimations, promoting semantic consistency and robustness. Meanwhile, the geometric encoding stage learns another token with comprehensive geometry and reconstruction constraints under novel-view estimations, reducing overfitting and encouraging geometric consistency. Finally, the optimization stage benefits from the semantic and geometric tokens, allowing a low classifier-free guidance scale and therefore preventing oversaturation. Experimental results demonstrate that Consist3D produces more consistent, faithful, and photo-realistic 3D assets compared to previous state-of-the-art methods. Furthermore, Consist3D also allows background and object editing through text prompts.
Paper Structure (26 sections, 7 equations, 8 figures, 2 tables)

This paper contains 26 sections, 7 equations, 8 figures, 2 tables.

Figures (8)

  • Figure 1: Inconsistency issues. (a) The semantic inconsistency: the generated object looks like a box instead of hat. (b) The geometric consistency: the generated cat's face exists in the back view of a cat whose face originally towards the front view. (c) The saturation inconstancy: the rendering of the generated teapot is oversaturated compared with the original teapots color.
  • Figure 2: Effectiveness of our method. (a) Single-view Text-to-3D generation: each case with one single input image and 3 novel views rendered from the 3D generations. (b) Background Editing: the background of the generation can be edited by prompt, and the option for no background is provided as well. (c) Object Editing: the object of the generation can be edited by prompt. For example, we can change the "cat" into "rabbit" or "lion" without changing the input image.
  • Figure 3: Pipeline. Stage I. A single-view image is input to the semantic encoding module, and a semantic token is trained with sem loss. Stage II. The single-view image is the input and used to estimate a point cloud as the shape guidance to apply condition on the geometric encoding module, and a geometric token is trained with warp loss and rec loss. Stage III. A randomly initialized 3D volume is the input and the two tokens trained previously is utilized together with tokenized text prompt as the condition, and this 3D volume is trained into a 3D model faithful to the reference single image.
  • Figure 4: Geometric encoding. We adopt ControlNet with depth guidance for the generation. The training object is $\mathcal{L}_{warp}$ and $\mathcal{L}_{rec}$. The $\mathcal{L}_{warp}$ calculated loss between two neighboring views with warp mask under novel views, and the $\mathcal{L}_{rec}$ calculated loss between the single input image and the generation with reference mask under reference view.
  • Figure 5: Score distillation sampling. A rendered image of a 3D volume is utilized as the input and a depth ControlNet with low CFG scales is utilized for generation. For the text condition, we combine the semantic token and geometric token with tokenized texts, which enables background editing and object editing through prompt.
  • ...and 3 more figures