CorrespondentDream: Enhancing 3D Fidelity of Text-to-3D using Cross-View Correspondences
Seungwook Kim, Kejie Li, Xueqing Deng, Yichun Shi, Minsu Cho, Peng Wang
TL;DR
Zero-shot text-to-3D methods with diffusion priors yield realistic 2D renders but often produce geometrically inconsistent 3D shapes. CorrespondentDream introduces an annotation-free cross-view correspondence loss $\mathcal{L}_{\textrm{corr}}$ derived from diffusion U-Net features to supplement the SDS-based NeRF optimization, using adjacent view sets and a two-stage training schedule. Dense cross-view correspondences are computed from diffusion features and aligned with NeRF reprojections via a weighted Huber loss, with an alternating optimization scheme to balance 2D guidance and 3D geometry corrections. Results show notable improvements in 3D fidelity and a favorable user study outcome, demonstrating enhanced 3D geometry without requiring additional supervision or new priors.
Abstract
Leveraging multi-view diffusion models as priors for 3D optimization have alleviated the problem of 3D consistency, e.g., the Janus face problem or the content drift problem, in zero-shot text-to-3D models. However, the 3D geometric fidelity of the output remains an unresolved issue; albeit the rendered 2D views are realistic, the underlying geometry may contain errors such as unreasonable concavities. In this work, we propose CorrespondentDream, an effective method to leverage annotation-free, cross-view correspondences yielded from the diffusion U-Net to provide additional 3D prior to the NeRF optimization process. We find that these correspondences are strongly consistent with human perception, and by adopting it in our loss design, we are able to produce NeRF models with geometries that are more coherent with common sense, e.g., more smoothed object surface, yielding higher 3D fidelity. We demonstrate the efficacy of our approach through various comparative qualitative results and a solid user study.
