NeRFDeformer: NeRF Transformation from a Single View via 3D Scene Flows
Zhenggang Tang, Zhongzheng Ren, Xiaoming Zhao, Bowen Wen, Jonathan Tremblay, Stan Birchfield, Alexander Schwing
TL;DR
NeRFDeformer tackles transforming a NeRF from a single RGBD observation of a non-rigidly transformed scene by modeling the transformation as a 3D scene flow. The flow is defined as a weighted linear blend of rigid transformations anchored at surface mesh vertices, enabling both forward ($F^{A\rightarrow B}$) and backward ($F^{B\rightarrow A}$) mappings that link the original scene $A$ to the transformed scene $B$ and support rendering of $B$ from novel viewpoints. A robust NeRF-based correspondence pipeline combines dense 2D matches (via ASpanFormer) with 3D filtering, grounding anchor points and informing an embedded deformation graph optimized with $L_{ARAP}$ and a consistency loss $L_{Con}$. The authors contribute a new 113-scene Objaverse-derived dataset, demonstrate superior performance over NeRF editing and diffusion baselines on both geometry and appearance metrics, and show ablations that validate the design choices. This work enables automatic, single-view, non-rigid NeRF editing with practical implications for robotics and dynamic scene manipulation without re-capturing the entire scene.
Abstract
We present a method for automatically modifying a NeRF representation based on a single observation of a non-rigid transformed version of the original scene. Our method defines the transformation as a 3D flow, specifically as a weighted linear blending of rigid transformations of 3D anchor points that are defined on the surface of the scene. In order to identify anchor points, we introduce a novel correspondence algorithm that first matches RGB-based pairs, then leverages multi-view information and 3D reprojection to robustly filter false positives in two steps. We also introduce a new dataset for exploring the problem of modifying a NeRF scene through a single observation. Our dataset ( https://github.com/nerfdeformer/nerfdeformer ) contains 113 synthetic scenes leveraging 47 3D assets. We show that our proposed method outperforms NeRF editing methods as well as diffusion-based methods, and we also explore different methods for filtering correspondences.
