Table of Contents
Fetching ...

ConsDreamer: Advancing Multi-View Consistency for Zero-Shot Text-to-3D Generation

Yuan Zhou, Shilong Jin, Litao Hua, Wanjun Lv, Haoran Duan, Jungong Han

TL;DR

This work tackles the multi-face Janus problem in zero-shot text-to-3D generation by diagnosing view bias in pre-trained text-to-image priors and proposing ConsDreamer, which introduces a View Disentanglement Module (VDM) to remove prior view content from conditional prompts and inject precise target-view cues, plus a similarity-based partial order loss (L_P) to enforce cross-view consistency in the unconditional term. The approach is designed to be plug-and-play across various 3D representations (e.g., NeRF, 3D Gaussian Splatting) and score-distillation paradigms, addressing both conditional and unconditional biases that lead to view-inconsistent 3D outputs. Extensive ablations, integration with state-of-the-art baselines, and even 2D T2I applications demonstrate substantial reductions in the multi-face Janus artifacts and improved inter-view semantic coherence, with strong quantitative and qualitative gains and a favorable user-study verdict. The work highlights an actionable, lightweight path to robust, view-consistent zero-shot text-to-3D synthesis without demanding extra 3D data or heavy re-training.

Abstract

Recent advances in zero-shot text-to-3D generation have revolutionized 3D content creation by enabling direct synthesis from textual descriptions. While state-of-the-art methods leverage 3D Gaussian Splatting with score distillation to enhance multi-view rendering through pre-trained text-to-image (T2I) models, they suffer from inherent prior view biases in T2I priors. These biases lead to inconsistent 3D generation, particularly manifesting as the multi-face Janus problem, where objects exhibit conflicting features across views. To address this fundamental challenge, we propose ConsDreamer, a novel method that mitigates view bias by refining both the conditional and unconditional terms in the score distillation process: (1) a View Disentanglement Module (VDM) that eliminates viewpoint biases in conditional prompts by decoupling irrelevant view components and injecting precise view control; and (2) a similarity-based partial order loss that enforces geometric consistency in the unconditional term by aligning cosine similarities with azimuth relationships. Extensive experiments demonstrate that ConsDreamer can be seamlessly integrated into various 3D representations and score distillation paradigms, effectively mitigating the multi-face Janus problem.

ConsDreamer: Advancing Multi-View Consistency for Zero-Shot Text-to-3D Generation

TL;DR

This work tackles the multi-face Janus problem in zero-shot text-to-3D generation by diagnosing view bias in pre-trained text-to-image priors and proposing ConsDreamer, which introduces a View Disentanglement Module (VDM) to remove prior view content from conditional prompts and inject precise target-view cues, plus a similarity-based partial order loss (L_P) to enforce cross-view consistency in the unconditional term. The approach is designed to be plug-and-play across various 3D representations (e.g., NeRF, 3D Gaussian Splatting) and score-distillation paradigms, addressing both conditional and unconditional biases that lead to view-inconsistent 3D outputs. Extensive ablations, integration with state-of-the-art baselines, and even 2D T2I applications demonstrate substantial reductions in the multi-face Janus artifacts and improved inter-view semantic coherence, with strong quantitative and qualitative gains and a favorable user-study verdict. The work highlights an actionable, lightweight path to robust, view-consistent zero-shot text-to-3D synthesis without demanding extra 3D data or heavy re-training.

Abstract

Recent advances in zero-shot text-to-3D generation have revolutionized 3D content creation by enabling direct synthesis from textual descriptions. While state-of-the-art methods leverage 3D Gaussian Splatting with score distillation to enhance multi-view rendering through pre-trained text-to-image (T2I) models, they suffer from inherent prior view biases in T2I priors. These biases lead to inconsistent 3D generation, particularly manifesting as the multi-face Janus problem, where objects exhibit conflicting features across views. To address this fundamental challenge, we propose ConsDreamer, a novel method that mitigates view bias by refining both the conditional and unconditional terms in the score distillation process: (1) a View Disentanglement Module (VDM) that eliminates viewpoint biases in conditional prompts by decoupling irrelevant view components and injecting precise view control; and (2) a similarity-based partial order loss that enforces geometric consistency in the unconditional term by aligning cosine similarities with azimuth relationships. Extensive experiments demonstrate that ConsDreamer can be seamlessly integrated into various 3D representations and score distillation paradigms, effectively mitigating the multi-face Janus problem.

Paper Structure

This paper contains 15 sections, 18 equations, 14 figures, 4 tables.

Figures (14)

  • Figure 1: Examples of the Multi-Face Janus Problem: (a) generated by DreamFusion poole2022dreamfusion-lucid-34, (b) by SDI lukoianov2024score, (c) by LucidDreamer liang2024luciddreamer-lucid, (d) by DreamScene 10.1007/978-3-031-72904-1_13-dreamscene, and (e) by GE3D li2024text.
  • Figure 2: Examples of text-to-3D content creation using our method. We introduce $ConsDreamer$, a method that guarantees multi-view consistency by leveraging a View Disentanglement Model and a novel partial order loss to ensure semantic clarity across views (detailed in Section \ref{['sec:Methodology']}). The generative 3D results demonstrate the superiority of ConsDreamer. Please zoom in for details.
  • Figure 3: Comparison of PerpNeg and VDM in handling prior view biases. (a) Perp-Neg works after UNet denoising in the score space by orthogonalising negative-view scores. However, frontal-view features that have already been entangled into the subject-keyword CA map are absorbed into the preserved component $\textit{neg}_{\parallel}$, so prior-view faces survive and still cause Janus artefacts. (b) Our VDM instead intervenes before denoising in the prompt-embedding space: it extracts view-specific residuals from view-augmented prompts, subtracts prior-view residuals, and injects only the residual corresponding to the target azimuth, so the conditioning entering the UNet is already view-corrected, producing clean CA maps and substantially reducing Janus.
  • Figure 4: An overview of ConsDreamer. Our method is built upon the main flow of 3D content distilled from the T2I model. ConsDreamer introduces two key innovations: (a) VDM disentangles the keyword in the prompt to obtain the canonical view features $\delta_{\text{key}}^{\text{view}}$, which are then used for precise view control (detailed in Section \ref{['sec-View Disentanglement Module.']}). (b) A novel partial order loss $\mathcal{L}_\text{P}$ is introduced among multi-view rendered images to endow the model with view-aware capabilities (detailed in Section \ref{['Partial Order Loss for Cross-View Consistency']}). Together, the VDM and $\mathcal{L}_\text{P}$ enhance the clarity of view semantics and significantly mitigate the multi-face Janus problem.
  • Figure 5: Application of the VDM in 2D Generation. (a) Without the VDM, the model mainly generates front or side views of a car, while the back view is rarely generated. (b) With the VDM, prior view biases are eliminated and targeted view information is injected, allowing the model to successfully generate a car from the back view.
  • ...and 9 more figures