Table of Contents
Fetching ...

NeRF On-the-go: Exploiting Uncertainty for Distractor-free NeRFs in the Wild

Weining Ren, Zihan Zhu, Boyang Sun, Jiaqi Chen, Marc Pollefeys, Songyou Peng

TL;DR

NeRF On-the-go tackles the problem of reconstructing static scene radiance fields from casually captured dynamic environments by predicting per-pixel uncertainty from DINOv2 features to suppress distractors during NeRF training. It introduces a decoupled optimization framework with an SSIM-based uncertainty loss and a dilated patch sampling strategy to enhance context and convergence speed; the uncertainty predictor is regularized to be spatially and temporally coherent. Empirical results across indoor/outdoor scenes, including the RobustNeRF and On-the-go datasets, show significant improvements over state-of-the-art methods in PSNR, SSIM, and LPIPS, along with up to an order-of-magnitude faster convergence. The approach yields robust distractor removal in diverse real-world conditions and demonstrates applicability to static scenes while highlighting remaining challenges in regions with strong view-dependent effects.

Abstract

Neural Radiance Fields (NeRFs) have shown remarkable success in synthesizing photorealistic views from multi-view images of static scenes, but face challenges in dynamic, real-world environments with distractors like moving objects, shadows, and lighting changes. Existing methods manage controlled environments and low occlusion ratios but fall short in render quality, especially under high occlusion scenarios. In this paper, we introduce NeRF On-the-go, a simple yet effective approach that enables the robust synthesis of novel views in complex, in-the-wild scenes from only casually captured image sequences. Delving into uncertainty, our method not only efficiently eliminates distractors, even when they are predominant in captures, but also achieves a notably faster convergence speed. Through comprehensive experiments on various scenes, our method demonstrates a significant improvement over state-of-the-art techniques. This advancement opens new avenues for NeRF in diverse and dynamic real-world applications.

NeRF On-the-go: Exploiting Uncertainty for Distractor-free NeRFs in the Wild

TL;DR

NeRF On-the-go tackles the problem of reconstructing static scene radiance fields from casually captured dynamic environments by predicting per-pixel uncertainty from DINOv2 features to suppress distractors during NeRF training. It introduces a decoupled optimization framework with an SSIM-based uncertainty loss and a dilated patch sampling strategy to enhance context and convergence speed; the uncertainty predictor is regularized to be spatially and temporally coherent. Empirical results across indoor/outdoor scenes, including the RobustNeRF and On-the-go datasets, show significant improvements over state-of-the-art methods in PSNR, SSIM, and LPIPS, along with up to an order-of-magnitude faster convergence. The approach yields robust distractor removal in diverse real-world conditions and demonstrates applicability to static scenes while highlighting remaining challenges in regions with strong view-dependent effects.

Abstract

Neural Radiance Fields (NeRFs) have shown remarkable success in synthesizing photorealistic views from multi-view images of static scenes, but face challenges in dynamic, real-world environments with distractors like moving objects, shadows, and lighting changes. Existing methods manage controlled environments and low occlusion ratios but fall short in render quality, especially under high occlusion scenarios. In this paper, we introduce NeRF On-the-go, a simple yet effective approach that enables the robust synthesis of novel views in complex, in-the-wild scenes from only casually captured image sequences. Delving into uncertainty, our method not only efficiently eliminates distractors, even when they are predominant in captures, but also achieves a notably faster convergence speed. Through comprehensive experiments on various scenes, our method demonstrates a significant improvement over state-of-the-art techniques. This advancement opens new avenues for NeRF in diverse and dynamic real-world applications.
Paper Structure (48 sections, 18 equations, 15 figures, 10 tables)

This paper contains 48 sections, 18 equations, 15 figures, 10 tables.

Figures (15)

  • Figure 1: NeRF On-the-go. Given casually captured image sequences or videos in the wild as inputs, the goal of this paper is to train a NeRF for static scenes and effectively remove all dynamic elements in the scenes (cars, trams, pedestrians, etc), i.e. distractors. Unlike existing methods such as NeRF-W nerfw and RobustNeRF robustnerf, which produce imperfect results, our method leverages the predicted uncertainty maps to effectively remove those distractors. This results in high-fidelity novel view synthesis on challenging dynamic scenes.
  • Figure 2: Pipeline. A pre-trained DINOv2 network extracts feature maps from posed images, followed by a dilated patch sampler that selects rays. The uncertainty MLP $G$ then takes the DINOv2 features of these rays as inputs to generate the uncertainties $\beta(\mathbf{r})$. Three losses (on the right) are used to optimize $G$ and the NeRF model. Note that the training process is facilitated by detaching the gradient flows as indicated by the colored dashed lines.
  • Figure 3: SSIM Can Effectively Distinguish Distractors. In this scene from robustnerf, the 3 wooden robots are the dynamic elements. SSIM pinpoints distractors by leveraging discrepancies in three measurements including luminance, contrast, and structure. Conversely, relying solely on the $\ell_2$ error between RGB values (luminance error) proves challenging, especially when the distractors and background have similar colors. The color bar on the right side indicates the correspondence for error interpretation.
  • Figure 4: Comparison of Different Ray Sampling Strategies. In contrast to random sampling and patch sampling, dilated patch sampling can improve training efficiency and uncertainty learning.
  • Figure 5: On-the-go Dataset. Sample training images showing the distractors in several scenes of our self-captured dataset.
  • ...and 10 more figures

Theorems & Definitions (1)

  • proof