Table of Contents
Fetching ...

ControlDreamer: Blending Geometry and Style in Text-to-3D

Yeongtak Oh, Jooyoung Choi, Yongsung Kim, Minjun Park, Chaehun Shin, Sungroh Yoon

TL;DR

ControlDreamer addresses the challenge of jointly controlling geometry and style in text-to-3D generation by introducing a two-stage pipeline. It first builds coarse geometry with NeRF from a geometry prompt and then refines a textured mesh via DMTet guided by a depth-aware MV-ControlNet, trained on a large, curated multi-view text dataset. The approach yields superior qualitative and quantitative results, including directional CLIP similarity and human evaluations, compared with existing methods, and establishes a new benchmark for 3D style editing. These advances improve multi-view consistency and enable faithful, text-guided stylization of 3D assets with potential impact on 3D content creation pipelines and tools.

Abstract

Recent advancements in text-to-3D generation have significantly contributed to the automation and democratization of 3D content creation. Building upon these developments, we aim to address the limitations of current methods in blending geometries and styles in text-to-3D generation. We introduce multi-view ControlNet, a novel depth-aware multi-view diffusion model trained on generated datasets from a carefully curated text corpus. Our multi-view ControlNet is then integrated into our two-stage pipeline, ControlDreamer, enabling text-guided generation of stylized 3D models. Additionally, we present a comprehensive benchmark for 3D style editing, encompassing a broad range of subjects, including objects, animals, and characters, to further facilitate research on diverse 3D generation. Our comparative analysis reveals that this new pipeline outperforms existing text-to-3D methods as evidenced by human evaluations and CLIP score metrics. Project page: https://controldreamer.github.io

ControlDreamer: Blending Geometry and Style in Text-to-3D

TL;DR

ControlDreamer addresses the challenge of jointly controlling geometry and style in text-to-3D generation by introducing a two-stage pipeline. It first builds coarse geometry with NeRF from a geometry prompt and then refines a textured mesh via DMTet guided by a depth-aware MV-ControlNet, trained on a large, curated multi-view text dataset. The approach yields superior qualitative and quantitative results, including directional CLIP similarity and human evaluations, compared with existing methods, and establishes a new benchmark for 3D style editing. These advances improve multi-view consistency and enable faithful, text-guided stylization of 3D assets with potential impact on 3D content creation pipelines and tools.

Abstract

Recent advancements in text-to-3D generation have significantly contributed to the automation and democratization of 3D content creation. Building upon these developments, we aim to address the limitations of current methods in blending geometries and styles in text-to-3D generation. We introduce multi-view ControlNet, a novel depth-aware multi-view diffusion model trained on generated datasets from a carefully curated text corpus. Our multi-view ControlNet is then integrated into our two-stage pipeline, ControlDreamer, enabling text-guided generation of stylized 3D models. Additionally, we present a comprehensive benchmark for 3D style editing, encompassing a broad range of subjects, including objects, animals, and characters, to further facilitate research on diverse 3D generation. Our comparative analysis reveals that this new pipeline outperforms existing text-to-3D methods as evidenced by human evaluations and CLIP score metrics. Project page: https://controldreamer.github.io
Paper Structure (38 sections, 5 equations, 17 figures, 3 tables)

This paper contains 38 sections, 5 equations, 17 figures, 3 tables.

Figures (17)

  • Figure 1: Comparing text-to-3D pipeline of our ControlDreamer and MVDream. On the right, MVDream’s output reveals a vulnerability to pre-training geometry biases, often producing (a) unintended results such as shields or (b) stereotypical geometries related to prompts. On the left, ControlDreamer overcomes these biases, enabling unique combinations of geometry and style, even facilitating counterfactual generation in 3D models.
  • Figure 2: Main results. Our generation process begins by generating a coarse-grained geometry, followed by creating a fine-grained stylized 3D model using a style prompt.
  • Figure 3: Illustration of ControlDreamer. (Left) Starting with a geometry prompt, we use MVDream to generate a NeRF, ensuring consistency through 3D self-attention. (Right) The NeRF is converted into a mesh via DMTet, followed by style generation through our MV-ControlNet, which integrates a trainable copy (red) and employs zero-initialized convolutions (white). MV-ControlNet is designed to understand geometry using multi-view depth.
  • Figure 4: We compare the depth-aware MV-ControlNet with the P2P hertz2022prompt approach on MVDream, and also against MV-ControlNet variants trained under edge and normal conditions. On the left, source images are displayed alongside their respective conditions. Among these, the depth-conditioned multi-view images display the most visually appealing results.
  • Figure 5: In (a), we present comparisons with previous pipelines. Hulk's geometry from Fig. \ref{['fig:fig1']}, styled as Ironman, reveals that Magic3D and ProlificDreamer often produce texture artifacts, while Fantasia3D and MVDream are prone to color oversaturation. (b) illustrates the results under various input conditions. (c) shows the refinement process using Magic3D, while (d) highlights the superior results achieved with our ControlDreamer.
  • ...and 12 more figures