Unifying Correspondence, Pose and NeRF for Pose-Free Novel View Synthesis from Stereo Pairs
Sunghwan Hong, Jaewoo Jung, Heeseong Shin, Jiaolong Yang, Seungryong Kim, Chong Luo
TL;DR
This work tackles pose-free generalized novel view synthesis from stereo pairs by introducing CoPoNeRF, a unified framework that jointly learns 2D correspondences, relative camera pose, and NeRF rendering from a shared representation. It combines multi-level cost volumes, self-attention feature aggregation, a cost-volume-based matching distribution, an epipolar-based attention renderer, and a triplet-consistency training objective to enable robust view synthesis under extreme viewpoint changes. Extensive experiments on RealEstate10K and ACID demonstrate superior relative pose estimation and competitive novel-view rendering compared with state-of-the-art pose-free NeRFs, while ablations validate the mutual benefits of joint learning and the proposed training strategy. The approach offers a practical, end-to-end solution that leverages interdependencies among correspondence, pose, and rendering to improve both geometry understanding and syntheses in challenging real-world settings.
Abstract
This work delves into the task of pose-free novel view synthesis from stereo pairs, a challenging and pioneering task in 3D vision. Our innovative framework, unlike any before, seamlessly integrates 2D correspondence matching, camera pose estimation, and NeRF rendering, fostering a synergistic enhancement of these tasks. We achieve this through designing an architecture that utilizes a shared representation, which serves as a foundation for enhanced 3D geometry understanding. Capitalizing on the inherent interplay between the tasks, our unified framework is trained end-to-end with the proposed training strategy to improve overall model accuracy. Through extensive evaluations across diverse indoor and outdoor scenes from two real-world datasets, we demonstrate that our approach achieves substantial improvement over previous methodologies, especially in scenarios characterized by extreme viewpoint changes and the absence of accurate camera poses.
