ConsDreamer: Advancing Multi-View Consistency for Zero-Shot Text-to-3D Generation

Yuan Zhou; Shilong Jin; Litao Hua; Wanjun Lv; Haoran Duan; Jungong Han

ConsDreamer: Advancing Multi-View Consistency for Zero-Shot Text-to-3D Generation

Yuan Zhou, Shilong Jin, Litao Hua, Wanjun Lv, Haoran Duan, Jungong Han

TL;DR

This work tackles the multi-face Janus problem in zero-shot text-to-3D generation by diagnosing view bias in pre-trained text-to-image priors and proposing ConsDreamer, which introduces a View Disentanglement Module (VDM) to remove prior view content from conditional prompts and inject precise target-view cues, plus a similarity-based partial order loss (L_P) to enforce cross-view consistency in the unconditional term. The approach is designed to be plug-and-play across various 3D representations (e.g., NeRF, 3D Gaussian Splatting) and score-distillation paradigms, addressing both conditional and unconditional biases that lead to view-inconsistent 3D outputs. Extensive ablations, integration with state-of-the-art baselines, and even 2D T2I applications demonstrate substantial reductions in the multi-face Janus artifacts and improved inter-view semantic coherence, with strong quantitative and qualitative gains and a favorable user-study verdict. The work highlights an actionable, lightweight path to robust, view-consistent zero-shot text-to-3D synthesis without demanding extra 3D data or heavy re-training.

Abstract

Recent advances in zero-shot text-to-3D generation have revolutionized 3D content creation by enabling direct synthesis from textual descriptions. While state-of-the-art methods leverage 3D Gaussian Splatting with score distillation to enhance multi-view rendering through pre-trained text-to-image (T2I) models, they suffer from inherent prior view biases in T2I priors. These biases lead to inconsistent 3D generation, particularly manifesting as the multi-face Janus problem, where objects exhibit conflicting features across views. To address this fundamental challenge, we propose ConsDreamer, a novel method that mitigates view bias by refining both the conditional and unconditional terms in the score distillation process: (1) a View Disentanglement Module (VDM) that eliminates viewpoint biases in conditional prompts by decoupling irrelevant view components and injecting precise view control; and (2) a similarity-based partial order loss that enforces geometric consistency in the unconditional term by aligning cosine similarities with azimuth relationships. Extensive experiments demonstrate that ConsDreamer can be seamlessly integrated into various 3D representations and score distillation paradigms, effectively mitigating the multi-face Janus problem.

ConsDreamer: Advancing Multi-View Consistency for Zero-Shot Text-to-3D Generation

TL;DR

Abstract

ConsDreamer: Advancing Multi-View Consistency for Zero-Shot Text-to-3D Generation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (14)