CorrespondentDream: Enhancing 3D Fidelity of Text-to-3D using Cross-View Correspondences

Seungwook Kim; Kejie Li; Xueqing Deng; Yichun Shi; Minsu Cho; Peng Wang

CorrespondentDream: Enhancing 3D Fidelity of Text-to-3D using Cross-View Correspondences

Seungwook Kim, Kejie Li, Xueqing Deng, Yichun Shi, Minsu Cho, Peng Wang

TL;DR

Zero-shot text-to-3D methods with diffusion priors yield realistic 2D renders but often produce geometrically inconsistent 3D shapes. CorrespondentDream introduces an annotation-free cross-view correspondence loss $\mathcal{L}_{\textrm{corr}}$ derived from diffusion U-Net features to supplement the SDS-based NeRF optimization, using adjacent view sets and a two-stage training schedule. Dense cross-view correspondences are computed from diffusion features and aligned with NeRF reprojections via a weighted Huber loss, with an alternating optimization scheme to balance 2D guidance and 3D geometry corrections. Results show notable improvements in 3D fidelity and a favorable user study outcome, demonstrating enhanced 3D geometry without requiring additional supervision or new priors.

Abstract

Leveraging multi-view diffusion models as priors for 3D optimization have alleviated the problem of 3D consistency, e.g., the Janus face problem or the content drift problem, in zero-shot text-to-3D models. However, the 3D geometric fidelity of the output remains an unresolved issue; albeit the rendered 2D views are realistic, the underlying geometry may contain errors such as unreasonable concavities. In this work, we propose CorrespondentDream, an effective method to leverage annotation-free, cross-view correspondences yielded from the diffusion U-Net to provide additional 3D prior to the NeRF optimization process. We find that these correspondences are strongly consistent with human perception, and by adopting it in our loss design, we are able to produce NeRF models with geometries that are more coherent with common sense, e.g., more smoothed object surface, yielding higher 3D fidelity. We demonstrate the efficacy of our approach through various comparative qualitative results and a solid user study.

CorrespondentDream: Enhancing 3D Fidelity of Text-to-3D using Cross-View Correspondences

TL;DR

derived from diffusion U-Net features to supplement the SDS-based NeRF optimization, using adjacent view sets and a two-stage training schedule. Dense cross-view correspondences are computed from diffusion features and aligned with NeRF reprojections via a weighted Huber loss, with an alternating optimization scheme to balance 2D guidance and 3D geometry corrections. Results show notable improvements in 3D fidelity and a favorable user study outcome, demonstrating enhanced 3D geometry without requiring additional supervision or new priors.

Abstract

Paper Structure (27 sections, 11 equations, 21 figures, 3 tables)

This paper contains 27 sections, 11 equations, 21 figures, 3 tables.

Introduction
Related Work
Preliminary: Text-to-3D using Diffusion
Method
Adjacent multi-view NeRF rendering
Annotation-free feature extraction
Cross-view correspondence computation
Cross-view correspondence loss
NeRF optimization
Experiment
Implementation details
Qualitative results
Comparative analysis
User study
Drawbacks and failure cases.
...and 12 more sections

Figures (21)

Figure 1: Comparison between the baseline (MVDream shi2023mvdream) and CorrespondentDream (ours). Our method substantially alleviates the 3D geometric infidelity issue in zero-shot text-to-3D generation methods. Best viewed on electronics, zoom in for clearer visualization.
Figure 2: Rendered 2D view and 3D normal map of MVDream shi2023mvdream. While the rendered 2D views look realistic, the underlying 3D geometry lacks fidelity, with concavities or missing surfaces (highlighted in white squares).
Figure 3: Overview of CorrespondentDream. We employ NeRF mildenhall2021nerf for 3D representation, optimized alternately using the SDS loss ($\mathcal{L}_{\textrm{SDS}}$) and cross-view correspondence loss ($\mathcal{L}_{\text{corr}}$). The $\mathcal{L}_{\textrm{SDS}}$ is based on the multi-view formulation from \ref{['eqn:multiview_sds']} in MVDream shi2023mvdream. To compute $\mathcal{L}_{\text{corr}}$, we render two adjacent view sets from NeRF with identical noise, inputting them into a frozen pre-trained multi-view diffusion model. We then extract multi-layer features from the diffusion U-Net's upsampling layers to establish correspondences ($\text{corr}_\text{diff}$) between each view pair. Utilizing ground-truth camera parameters and NeRF-rendered depth, we reproject pixels to obtain $\text{corr}_\text{NeRF}$. By minimizing the discrepancy between $\text{corr}_\text{NeRF}$ and $\text{corr}_\text{diff}$, the pseudo ground-truth, we correct NeRF's 3D infidelities in the NeRF depths.
Figure 4: Qualitative results of our CorrespondentDream across various prompts. It can be seen that CorrespondentDream yields substantially improved 3D fidelity across various prompts. The 3D infidelities from the baseline (MVDream shi2023mvdream) are highlighted and zoomed in white squares. Best viewed on electronics, zoom in for better visualization.
Figure 5: Analysis of alternating supervision. Noticeable 3D inaccuracies are marked with white squares. Our alternating supervision approach demonstrates superior qualitative outcomes.
...and 16 more figures

CorrespondentDream: Enhancing 3D Fidelity of Text-to-3D using Cross-View Correspondences

TL;DR

Abstract

CorrespondentDream: Enhancing 3D Fidelity of Text-to-3D using Cross-View Correspondences

Authors

TL;DR

Abstract

Table of Contents

Figures (21)