Table of Contents
Fetching ...

Unifying Correspondence, Pose and NeRF for Pose-Free Novel View Synthesis from Stereo Pairs

Sunghwan Hong, Jaewoo Jung, Heeseong Shin, Jiaolong Yang, Seungryong Kim, Chong Luo

TL;DR

This work tackles pose-free generalized novel view synthesis from stereo pairs by introducing CoPoNeRF, a unified framework that jointly learns 2D correspondences, relative camera pose, and NeRF rendering from a shared representation. It combines multi-level cost volumes, self-attention feature aggregation, a cost-volume-based matching distribution, an epipolar-based attention renderer, and a triplet-consistency training objective to enable robust view synthesis under extreme viewpoint changes. Extensive experiments on RealEstate10K and ACID demonstrate superior relative pose estimation and competitive novel-view rendering compared with state-of-the-art pose-free NeRFs, while ablations validate the mutual benefits of joint learning and the proposed training strategy. The approach offers a practical, end-to-end solution that leverages interdependencies among correspondence, pose, and rendering to improve both geometry understanding and syntheses in challenging real-world settings.

Abstract

This work delves into the task of pose-free novel view synthesis from stereo pairs, a challenging and pioneering task in 3D vision. Our innovative framework, unlike any before, seamlessly integrates 2D correspondence matching, camera pose estimation, and NeRF rendering, fostering a synergistic enhancement of these tasks. We achieve this through designing an architecture that utilizes a shared representation, which serves as a foundation for enhanced 3D geometry understanding. Capitalizing on the inherent interplay between the tasks, our unified framework is trained end-to-end with the proposed training strategy to improve overall model accuracy. Through extensive evaluations across diverse indoor and outdoor scenes from two real-world datasets, we demonstrate that our approach achieves substantial improvement over previous methodologies, especially in scenarios characterized by extreme viewpoint changes and the absence of accurate camera poses.

Unifying Correspondence, Pose and NeRF for Pose-Free Novel View Synthesis from Stereo Pairs

TL;DR

This work tackles pose-free generalized novel view synthesis from stereo pairs by introducing CoPoNeRF, a unified framework that jointly learns 2D correspondences, relative camera pose, and NeRF rendering from a shared representation. It combines multi-level cost volumes, self-attention feature aggregation, a cost-volume-based matching distribution, an epipolar-based attention renderer, and a triplet-consistency training objective to enable robust view synthesis under extreme viewpoint changes. Extensive experiments on RealEstate10K and ACID demonstrate superior relative pose estimation and competitive novel-view rendering compared with state-of-the-art pose-free NeRFs, while ablations validate the mutual benefits of joint learning and the proposed training strategy. The approach offers a practical, end-to-end solution that leverages interdependencies among correspondence, pose, and rendering to improve both geometry understanding and syntheses in challenging real-world settings.

Abstract

This work delves into the task of pose-free novel view synthesis from stereo pairs, a challenging and pioneering task in 3D vision. Our innovative framework, unlike any before, seamlessly integrates 2D correspondence matching, camera pose estimation, and NeRF rendering, fostering a synergistic enhancement of these tasks. We achieve this through designing an architecture that utilizes a shared representation, which serves as a foundation for enhanced 3D geometry understanding. Capitalizing on the inherent interplay between the tasks, our unified framework is trained end-to-end with the proposed training strategy to improve overall model accuracy. Through extensive evaluations across diverse indoor and outdoor scenes from two real-world datasets, we demonstrate that our approach achieves substantial improvement over previous methodologies, especially in scenarios characterized by extreme viewpoint changes and the absence of accurate camera poses.
Paper Structure (46 sections, 3 equations, 13 figures, 5 tables)

This paper contains 46 sections, 3 equations, 13 figures, 5 tables.

Figures (13)

  • Figure 1: Overview. Given an unposed pair of images, possibly under extreme viewpoint changes and with minimal overlapping, our framework synergistically performs and effectively fosters mutual enhancement among three tasks -- 2D correspondence estimation, camera pose estimation, and NeRF rendering -- to enable high-quality novel view synthesis.
  • Figure 2: Overall architecture of the proposed method. For a pair of images, we extract multi-level feature maps and construct 4D correlation maps at each level, encoding pixel pair similarities. These maps are refined for flow and pose estimation, and the renderer then uses the estimated pose and refined feature maps for color and depth computation.
  • Figure 3: Visualization of epipolar lines. We use the relative camera pose to draw epipolar lines based on the points in (a). Our predictions can well follow the ground truth even under large viewpoint changes.
  • Figure 4: Visualization of correspondences and confidence. We show top 100 confident matches between input images and the covisible regions are highlighted based on confidence scores.
  • Figure 5: Qualitative comparison on RealEstate10K.
  • ...and 8 more figures