Table of Contents
Fetching ...

Controllable Text-to-3D Generation via Surface-Aligned Gaussian Splatting

Zhiqi Li, Yiming Chen, Lingzhe Zhao, Peidong Liu

TL;DR

This paper tackles controllable text-to-3D generation by integrating conditioning signals (edge, depth, normal, scribble) with a frozen multi-view diffusion model via a new MVControl network. MVControl outputs local and global embeddings informed by relative camera poses, enabling consistent multi-view guidance during optimization. The authors propose a three-stage pipeline that first generates coarse 3D Gaussians with LGM, refines geometry and texture through a hybrid 2D/3D diffusion prior with SuGaR regularization, and finally binds Gaussians to a mesh to produce a textured mesh. Experimental results show strong generalization across condition types, superior view-consistency, and high-quality textured 3D assets with improved efficiency.

Abstract

While text-to-3D and image-to-3D generation tasks have received considerable attention, one important but under-explored field between them is controllable text-to-3D generation, which we mainly focus on in this work. To address this task, 1) we introduce Multi-view ControlNet (MVControl), a novel neural network architecture designed to enhance existing pre-trained multi-view diffusion models by integrating additional input conditions, such as edge, depth, normal, and scribble maps. Our innovation lies in the introduction of a conditioning module that controls the base diffusion model using both local and global embeddings, which are computed from the input condition images and camera poses. Once trained, MVControl is able to offer 3D diffusion guidance for optimization-based 3D generation. And, 2) we propose an efficient multi-stage 3D generation pipeline that leverages the benefits of recent large reconstruction models and score distillation algorithm. Building upon our MVControl architecture, we employ a unique hybrid diffusion guidance method to direct the optimization process. In pursuit of efficiency, we adopt 3D Gaussians as our representation instead of the commonly used implicit representations. We also pioneer the use of SuGaR, a hybrid representation that binds Gaussians to mesh triangle faces. This approach alleviates the issue of poor geometry in 3D Gaussians and enables the direct sculpting of fine-grained geometry on the mesh. Extensive experiments demonstrate that our method achieves robust generalization and enables the controllable generation of high-quality 3D content. Project page: https://lizhiqi49.github.io/MVControl/.

Controllable Text-to-3D Generation via Surface-Aligned Gaussian Splatting

TL;DR

This paper tackles controllable text-to-3D generation by integrating conditioning signals (edge, depth, normal, scribble) with a frozen multi-view diffusion model via a new MVControl network. MVControl outputs local and global embeddings informed by relative camera poses, enabling consistent multi-view guidance during optimization. The authors propose a three-stage pipeline that first generates coarse 3D Gaussians with LGM, refines geometry and texture through a hybrid 2D/3D diffusion prior with SuGaR regularization, and finally binds Gaussians to a mesh to produce a textured mesh. Experimental results show strong generalization across condition types, superior view-consistency, and high-quality textured 3D assets with improved efficiency.

Abstract

While text-to-3D and image-to-3D generation tasks have received considerable attention, one important but under-explored field between them is controllable text-to-3D generation, which we mainly focus on in this work. To address this task, 1) we introduce Multi-view ControlNet (MVControl), a novel neural network architecture designed to enhance existing pre-trained multi-view diffusion models by integrating additional input conditions, such as edge, depth, normal, and scribble maps. Our innovation lies in the introduction of a conditioning module that controls the base diffusion model using both local and global embeddings, which are computed from the input condition images and camera poses. Once trained, MVControl is able to offer 3D diffusion guidance for optimization-based 3D generation. And, 2) we propose an efficient multi-stage 3D generation pipeline that leverages the benefits of recent large reconstruction models and score distillation algorithm. Building upon our MVControl architecture, we employ a unique hybrid diffusion guidance method to direct the optimization process. In pursuit of efficiency, we adopt 3D Gaussians as our representation instead of the commonly used implicit representations. We also pioneer the use of SuGaR, a hybrid representation that binds Gaussians to mesh triangle faces. This approach alleviates the issue of poor geometry in 3D Gaussians and enables the direct sculpting of fine-grained geometry on the mesh. Extensive experiments demonstrate that our method achieves robust generalization and enables the controllable generation of high-quality 3D content. Project page: https://lizhiqi49.github.io/MVControl/.
Paper Structure (20 sections, 4 equations, 8 figures, 2 tables)

This paper contains 20 sections, 4 equations, 8 figures, 2 tables.

Figures (8)

  • Figure 1: Given a text prompt and a condition image, our method is able to achieve high-fidelity and efficient controllable text-to-3D generation of Gaussian binded mesh and textured mesh.
  • Figure 2: Architecture of proposed MVControl. (a) MVControl consists of a frozen multi-view diffusion model and a trainable MVControl. (b) Our model takes care of all input conditions to control the generation process both locally and globally through a conditioning module. (c) Once MVControl is trained, we can exploit it to serve a hybrid diffusion prior for controllable text-to-3D content generation via SDS optimization procedure.
  • Figure 3: Proposed 3D generation pipeline. The multi-stage pipeline can efficiently generate high-quality textured meshes starting from a set of coarse Gaussians generated by LGM, with the input being the multi-view images generated by our MVControl. In the second stage, we employ a 2D & 3D hybrid diffusion prior for Gaussian optimization. Finally, in the third stage, we calculate the VSD loss to refine the SuGaR representation.
  • Figure 4: Comparison with baseline 3D generation methods. Our method yields more delicate texture, and generates much better meshes than the compared methods. We use different color blocks to emphasize that our method only takes the conditioning image rather than RGB as input. Corresponding textual prompts are provided in appendix.
  • Figure 5: Comparison of Multi-view image generation w/ and w/o our MVControl. MVDream generation results with and without our MVControl attached with edge map and normal map as input condition respectively.
  • ...and 3 more figures