Table of Contents
Fetching ...

Explicit Correspondence Matching for Generalizable Neural Radiance Fields

Yuedong Chen, Haofei Xu, Qianyi Wu, Chuanxia Zheng, Tat-Jen Cham, Jianfei Cai

TL;DR

This paper tackles the challenge of generalizable neural radiance fields that can render novel views from very few inputs without per-scene optimization. It introduces MatchNeRF, which explicitly models cross-view feature correspondence as a geometry prior by using a Transformer-based encoder to align multi-view features and a group-wise cosine similarity computed on projected 2D features to guide the NeRF decoder. The method achieves state-of-the-art results on DTU, Real Forward-Facing, Blender, and Tanks & Temples across 2- and 3-view setups, and demonstrates robustness to reference-view selection and improved depth reconstruction. The approach is notable for its view-agnostic design and its potential to generalize across different 3D representations, offering a practical feed-forward alternative to costly cost-volume-based methods and a foundation for future extensions in occlusion handling and explicit optimization-free 3D reconstruction.

Abstract

We present a new generalizable NeRF method that is able to directly generalize to new unseen scenarios and perform novel view synthesis with as few as two source views. The key to our approach lies in the explicitly modeled correspondence matching information, so as to provide the geometry prior to the prediction of NeRF color and density for volume rendering. The explicit correspondence matching is quantified with the cosine similarity between image features sampled at the 2D projections of a 3D point on different views, which is able to provide reliable cues about the surface geometry. Unlike previous methods where image features are extracted independently for each view, we consider modeling the cross-view interactions via Transformer cross-attention, which greatly improves the feature matching quality. Our method achieves state-of-the-art results on different evaluation settings, with the experiments showing a strong correlation between our learned cosine feature similarity and volume density, demonstrating the effectiveness and superiority of our proposed method. The code and model are on our project page: https://donydchen.github.io/matchnerf

Explicit Correspondence Matching for Generalizable Neural Radiance Fields

TL;DR

This paper tackles the challenge of generalizable neural radiance fields that can render novel views from very few inputs without per-scene optimization. It introduces MatchNeRF, which explicitly models cross-view feature correspondence as a geometry prior by using a Transformer-based encoder to align multi-view features and a group-wise cosine similarity computed on projected 2D features to guide the NeRF decoder. The method achieves state-of-the-art results on DTU, Real Forward-Facing, Blender, and Tanks & Temples across 2- and 3-view setups, and demonstrates robustness to reference-view selection and improved depth reconstruction. The approach is notable for its view-agnostic design and its potential to generalize across different 3D representations, offering a practical feed-forward alternative to costly cost-volume-based methods and a foundation for future extensions in occlusion handling and explicit optimization-free 3D reconstruction.

Abstract

We present a new generalizable NeRF method that is able to directly generalize to new unseen scenarios and perform novel view synthesis with as few as two source views. The key to our approach lies in the explicitly modeled correspondence matching information, so as to provide the geometry prior to the prediction of NeRF color and density for volume rendering. The explicit correspondence matching is quantified with the cosine similarity between image features sampled at the 2D projections of a 3D point on different views, which is able to provide reliable cues about the surface geometry. Unlike previous methods where image features are extracted independently for each view, we consider modeling the cross-view interactions via Transformer cross-attention, which greatly improves the feature matching quality. Our method achieves state-of-the-art results on different evaluation settings, with the experiments showing a strong correlation between our learned cosine feature similarity and volume density, demonstrating the effectiveness and superiority of our proposed method. The code and model are on our project page: https://donydchen.github.io/matchnerf
Paper Structure (33 sections, 6 equations, 8 figures, 9 tables)

This paper contains 33 sections, 6 equations, 8 figures, 9 tables.

Figures (8)

  • Figure 1: Correlation between cosine feature similarity and volume density. We first extract image features via a Transformer by considering cross-view interactions. Then, we explicitly fetch the correspondence feature matching information by computing the cosine similarity between sampled features, which shows strong correlation with volume density and thus provides valuable geometric cues for density prediction.
  • Figure 2: MatchNeRF overview. Given $N$ input images, we extract the Transformer features and compute the cosine similarity in a pair-wise manner, and finally merge all pair-wise cosine similarities with element-wise average. i@) For image pair ${\bm I}_i$ and ${\bm I}_j$, we first extract downsampled convolutional features with a weight-sharing CNN. The convolutional features are then fed into a Transformer to model cross-view interactions with cross-attention (Sec. \ref{['sec:feature']}). ii@) To predict the color and volume density of a point on a ray for volume rendering, we project the 3D point into the 2D Transformer features ${\bm F}_i$ and ${\bm F}_j$ with the camera parameters and bilinearly sample the feature vectors ${\bm f}_i$ and ${\bm f}_j$ at the projected locations. We then compute the cosine similarity ${\bm z} = \cos({\bm f}_i, {\bm f}_j)$ between sampled features to encode the correspondence matching information (Sec. \ref{['sec:matching']}). iii@)${\bm z}$ is next used with the 3D position $\bm p$ and 2D view direction $\bm d$ for color $\bm c$ and density $\sigma$ prediction. An additional ray Transformer is used to model cross-point interactions along a ray (Sec. \ref{['sec:nerf_decoder']}).
  • Figure 3: Qualitative results on Blender (1st row), DTU (2nd row) and RFF (3rd row). We showcase the visual results of MVSNeRF and our MatchNeRF method. Input views contain 3 viewpoints nearest to the target one, and the first input view is the reference view for MVSNeRF. Our MatchNeRF reconstructs better details ('leaves' scene of RFF) and contains less background artifacts ('doll' scene of DTU). The construction of cost volume in MVSNeRF requires all other views to be warped to the reference view, which results in poor quality when some views are clearly different from the reference view ('chair' scene of Blender, elaborated in Appendix \ref{['sec:app_cv_limit']}). Quantitative results measured over the whole image are placed below each image, which in order are scores of PSNR, SSIM and LPIPS.
  • Figure 4: Relationship between the learned cosine similarity and volume density. Four pixels are randomly selected from the foreground of a DTU test scene ('scan63'). For each pixel, we showcase the learned cosine similarity (scalar value, predicted by the ablation model 'cosine' in TABLE \ref{['tab:ablations']} ) and volume density of all sampled points along the corresponding ray. The strong correlation demonstrates that our proposed cosine similarity is able to provide valuable geometric cues for volume density prediction.
  • Figure 5: Visual results of rendered depth maps on DTU. Background regions are masked out for depth maps rendered by both methods since ground-truth values are not available for those regions. MatchNeRF reconstructs better depth with sharper borders.
  • ...and 3 more figures