Table of Contents
Fetching ...

ESCT3D: Efficient and Selectively Controllable Text-Driven 3D Content Generation with Gaussian Splatting

Huiqi Wu, Jianbo Mei, Yingjie Huang, Yining Xu, Jingjiao You, Yilong Liu, Li Yao

TL;DR

This work proposes GPT-4V for self-optimization, which significantly enhances the efficiency of generating satisfactory content in a single attempt and achieves robust generalization, facilitating the efficient and controllable generation of high-quality 3D content.

Abstract

In recent years, significant advancements have been made in text-driven 3D content generation. However, several challenges remain. In practical applications, users often provide extremely simple text inputs while expecting high-quality 3D content. Generating optimal results from such minimal text is a difficult task due to the strong dependency of text-to-3D models on the quality of input prompts. Moreover, the generation process exhibits high variability, making it difficult to control. Consequently, multiple iterations are typically required to produce content that meets user expectations, reducing generation efficiency. To address this issue, we propose GPT-4V for self-optimization, which significantly enhances the efficiency of generating satisfactory content in a single attempt. Furthermore, the controllability of text-to-3D generation methods has not been fully explored. Our approach enables users to not only provide textual descriptions but also specify additional conditions, such as style, edges, scribbles, poses, or combinations of multiple conditions, allowing for more precise control over the generated 3D content. Additionally, during training, we effectively integrate multi-view information, including multi-view depth, masks, features, and images, to address the common Janus problem in 3D content generation. Extensive experiments demonstrate that our method achieves robust generalization, facilitating the efficient and controllable generation of high-quality 3D content.

ESCT3D: Efficient and Selectively Controllable Text-Driven 3D Content Generation with Gaussian Splatting

TL;DR

This work proposes GPT-4V for self-optimization, which significantly enhances the efficiency of generating satisfactory content in a single attempt and achieves robust generalization, facilitating the efficient and controllable generation of high-quality 3D content.

Abstract

In recent years, significant advancements have been made in text-driven 3D content generation. However, several challenges remain. In practical applications, users often provide extremely simple text inputs while expecting high-quality 3D content. Generating optimal results from such minimal text is a difficult task due to the strong dependency of text-to-3D models on the quality of input prompts. Moreover, the generation process exhibits high variability, making it difficult to control. Consequently, multiple iterations are typically required to produce content that meets user expectations, reducing generation efficiency. To address this issue, we propose GPT-4V for self-optimization, which significantly enhances the efficiency of generating satisfactory content in a single attempt. Furthermore, the controllability of text-to-3D generation methods has not been fully explored. Our approach enables users to not only provide textual descriptions but also specify additional conditions, such as style, edges, scribbles, poses, or combinations of multiple conditions, allowing for more precise control over the generated 3D content. Additionally, during training, we effectively integrate multi-view information, including multi-view depth, masks, features, and images, to address the common Janus problem in 3D content generation. Extensive experiments demonstrate that our method achieves robust generalization, facilitating the efficient and controllable generation of high-quality 3D content.

Paper Structure

This paper contains 18 sections, 13 equations, 14 figures, 2 tables.

Figures (14)

  • Figure 1: Framework. The multi-modal self-optimization framework is designed to control the generation of high-quality images. Then, 3D Gaussians are employed to effectively initialize geometry and appearance using SDS loss, while incorporating multi-view depth and mask information. Afterward, we iteratively refine the texture image using various multi view losses.
  • Figure 2: A comparison of our method with Shap-E and DreamGaussian. Shap-E, DreamGaussian, and our method can all generate 3D content in a short amount of time. Our method, with a time cost similar to DreamGaussian, is able to generate more complex and richer 3D content compared to Shap-E and DreamGaussian.
  • Figure 3: Comparisons on Text-to-3D methods. Among these methods, our approach takes the least time to generate 3D content while maintaining viewpoint consistency.
  • Figure 4: Comparisons on Image-to-3D methods. Compared to Zero123 and Magic123, our method generates more reasonable and consistent content in a shorter time.
  • Figure 5: Promptist vs. Ours: Prompt Optimization results. The left side represents the Promptist results, while the right side represents our results.
  • ...and 9 more figures