MVSNeRF: Fast Generalizable Radiance Field Reconstruction from Multi-View Stereo
Anpei Chen, Zexiang Xu, Fuqiang Zhao, Xiaoshuai Zhang, Fanbo Xiang, Jingyi Yu, Hao Su
TL;DR
MVSNeRF addresses the challenge of fast, generalizable view synthesis by reconstructing neural radiance fields from three nearby views. It combines plane-swept cost volumes from multi-view stereo with physically based volume rendering to produce geometry-aware radiance fields, trained on the DTU dataset and tested across three datasets to demonstrate cross-scene generalization, including indoor scenes. The method enables rapid per-scene reconstruction, with the option to fine-tune when dense imagery is available, yielding higher rendering quality and substantially reduced optimization time compared to NeRF. This work provides a practical, scalable approach to generalizable neural rendering that extends to unseen scenes and diverse environments while maintaining high-quality view synthesis.
Abstract
We present MVSNeRF, a novel neural rendering approach that can efficiently reconstruct neural radiance fields for view synthesis. Unlike prior works on neural radiance fields that consider per-scene optimization on densely captured images, we propose a generic deep neural network that can reconstruct radiance fields from only three nearby input views via fast network inference. Our approach leverages plane-swept cost volumes (widely used in multi-view stereo) for geometry-aware scene reasoning, and combines this with physically based volume rendering for neural radiance field reconstruction. We train our network on real objects in the DTU dataset, and test it on three different datasets to evaluate its effectiveness and generalizability. Our approach can generalize across scenes (even indoor scenes, completely different from our training scenes of objects) and generate realistic view synthesis results using only three input images, significantly outperforming concurrent works on generalizable radiance field reconstruction. Moreover, if dense images are captured, our estimated radiance field representation can be easily fine-tuned; this leads to fast per-scene reconstruction with higher rendering quality and substantially less optimization time than NeRF.
