Table of Contents
Fetching ...

NerfBaselines: Consistent and Reproducible Evaluation of Novel View Synthesis Methods

Jonas Kulhanek, Torsten Sattler

TL;DR

Novel view synthesis methods (NeRFs and 3D Gaussian Splatting) face challenges from nonuniform evaluation protocols that impede fair progress. NerfBaselines provides a standardized, reproducible evaluation framework with wrappers around official code, unified datasets, and a shared protocol, plus an online benchmark and interactive viewer. Through reproducing published results and cross-dataset analyses (e.g., Mip-NeRF360, Blender, Tanks & Temples), the work demonstrates that small protocol shifts can invert method rankings, underscoring the need for consistent benchmarking. The framework lowers adoption barriers, enabling robust, scalable comparisons across diverse methods and datasets, and advancing reliable progress in novel view synthesis.

Abstract

Novel view synthesis is an important problem with many applications, including AR/VR, gaming, and robotic simulations. With the recent rapid development of Neural Radiance Fields (NeRFs) and 3D Gaussian Splatting (3DGS) methods, it is becoming difficult to keep track of the current state of the art (SoTA) due to methods using different evaluation protocols, codebases being difficult to install and use, and methods not generalizing well to novel 3D scenes. In our experiments, we show that even tiny differences in the evaluation protocols of various methods can artificially boost the performance of these methods. This raises questions about the validity of quantitative comparisons performed in the literature. To address these questions, we propose NerfBaselines, an evaluation framework which provides consistent benchmarking tools, ensures reproducibility, and simplifies the installation and use of various methods. We validate our implementation experimentally by reproducing the numbers reported in the original papers. For improved accessibility, we release a web platform that compares commonly used methods on standard benchmarks. We strongly believe NerfBaselines is a valuable contribution to the community as it ensures that quantitative results are comparable and thus truly measure progress in the field of novel view synthesis.

NerfBaselines: Consistent and Reproducible Evaluation of Novel View Synthesis Methods

TL;DR

Novel view synthesis methods (NeRFs and 3D Gaussian Splatting) face challenges from nonuniform evaluation protocols that impede fair progress. NerfBaselines provides a standardized, reproducible evaluation framework with wrappers around official code, unified datasets, and a shared protocol, plus an online benchmark and interactive viewer. Through reproducing published results and cross-dataset analyses (e.g., Mip-NeRF360, Blender, Tanks & Temples), the work demonstrates that small protocol shifts can invert method rankings, underscoring the need for consistent benchmarking. The framework lowers adoption barriers, enabling robust, scalable comparisons across diverse methods and datasets, and advancing reliable progress in novel view synthesis.

Abstract

Novel view synthesis is an important problem with many applications, including AR/VR, gaming, and robotic simulations. With the recent rapid development of Neural Radiance Fields (NeRFs) and 3D Gaussian Splatting (3DGS) methods, it is becoming difficult to keep track of the current state of the art (SoTA) due to methods using different evaluation protocols, codebases being difficult to install and use, and methods not generalizing well to novel 3D scenes. In our experiments, we show that even tiny differences in the evaluation protocols of various methods can artificially boost the performance of these methods. This raises questions about the validity of quantitative comparisons performed in the literature. To address these questions, we propose NerfBaselines, an evaluation framework which provides consistent benchmarking tools, ensures reproducibility, and simplifies the installation and use of various methods. We validate our implementation experimentally by reproducing the numbers reported in the original papers. For improved accessibility, we release a web platform that compares commonly used methods on standard benchmarks. We strongly believe NerfBaselines is a valuable contribution to the community as it ensures that quantitative results are comparable and thus truly measure progress in the field of novel view synthesis.

Paper Structure

This paper contains 18 sections, 6 figures, 11 tables.

Figures (6)

  • Figure 1: Impact of altering evaluation protocol. By changing how images were downscaled, Gsplat ye2024gsplat increased its rank by 3 places in PSNR on the Mip-NeRF 360 dataset barron2022mip360.
  • Figure 2: Existing codebases. Integrated methods are bold green.
  • Figure 3: The NerfBaselines Viewer enables interactive rendering, shows train/test cameras, and input point cloud. The figure shows trajectory editor used to for rendering custom camera trajectories.
  • Figure 4: Mip-NeRF 360 barron2022mip360 and Blender mildenhall2021nerf results comparing PSNRs obtained via NerfBaselines with those reported in the original papers. We show the difference in PSNR. In most cases, the difference is $<1\%$. Instant-NGP muller2022ingp and Mip Splatting yu2023mip3dgs are consistently underperforming because different evaluation protocols were used in the papers.
  • Figure 5: Qualitative results. We compare methods on views close and far from the training trajectory. Top: MipNeRF360/stump scene, bottom: T&T/Auditorium.
  • ...and 1 more figures