Table of Contents
Fetching ...

NeRFDeformer: NeRF Transformation from a Single View via 3D Scene Flows

Zhenggang Tang, Zhongzheng Ren, Xiaoming Zhao, Bowen Wen, Jonathan Tremblay, Stan Birchfield, Alexander Schwing

TL;DR

NeRFDeformer tackles transforming a NeRF from a single RGBD observation of a non-rigidly transformed scene by modeling the transformation as a 3D scene flow. The flow is defined as a weighted linear blend of rigid transformations anchored at surface mesh vertices, enabling both forward ($F^{A\rightarrow B}$) and backward ($F^{B\rightarrow A}$) mappings that link the original scene $A$ to the transformed scene $B$ and support rendering of $B$ from novel viewpoints. A robust NeRF-based correspondence pipeline combines dense 2D matches (via ASpanFormer) with 3D filtering, grounding anchor points and informing an embedded deformation graph optimized with $L_{ARAP}$ and a consistency loss $L_{Con}$. The authors contribute a new 113-scene Objaverse-derived dataset, demonstrate superior performance over NeRF editing and diffusion baselines on both geometry and appearance metrics, and show ablations that validate the design choices. This work enables automatic, single-view, non-rigid NeRF editing with practical implications for robotics and dynamic scene manipulation without re-capturing the entire scene.

Abstract

We present a method for automatically modifying a NeRF representation based on a single observation of a non-rigid transformed version of the original scene. Our method defines the transformation as a 3D flow, specifically as a weighted linear blending of rigid transformations of 3D anchor points that are defined on the surface of the scene. In order to identify anchor points, we introduce a novel correspondence algorithm that first matches RGB-based pairs, then leverages multi-view information and 3D reprojection to robustly filter false positives in two steps. We also introduce a new dataset for exploring the problem of modifying a NeRF scene through a single observation. Our dataset ( https://github.com/nerfdeformer/nerfdeformer ) contains 113 synthetic scenes leveraging 47 3D assets. We show that our proposed method outperforms NeRF editing methods as well as diffusion-based methods, and we also explore different methods for filtering correspondences.

NeRFDeformer: NeRF Transformation from a Single View via 3D Scene Flows

TL;DR

NeRFDeformer tackles transforming a NeRF from a single RGBD observation of a non-rigidly transformed scene by modeling the transformation as a 3D scene flow. The flow is defined as a weighted linear blend of rigid transformations anchored at surface mesh vertices, enabling both forward () and backward () mappings that link the original scene to the transformed scene and support rendering of from novel viewpoints. A robust NeRF-based correspondence pipeline combines dense 2D matches (via ASpanFormer) with 3D filtering, grounding anchor points and informing an embedded deformation graph optimized with and a consistency loss . The authors contribute a new 113-scene Objaverse-derived dataset, demonstrate superior performance over NeRF editing and diffusion baselines on both geometry and appearance metrics, and show ablations that validate the design choices. This work enables automatic, single-view, non-rigid NeRF editing with practical implications for robotics and dynamic scene manipulation without re-capturing the entire scene.

Abstract

We present a method for automatically modifying a NeRF representation based on a single observation of a non-rigid transformed version of the original scene. Our method defines the transformation as a 3D flow, specifically as a weighted linear blending of rigid transformations of 3D anchor points that are defined on the surface of the scene. In order to identify anchor points, we introduce a novel correspondence algorithm that first matches RGB-based pairs, then leverages multi-view information and 3D reprojection to robustly filter false positives in two steps. We also introduce a new dataset for exploring the problem of modifying a NeRF scene through a single observation. Our dataset ( https://github.com/nerfdeformer/nerfdeformer ) contains 113 synthetic scenes leveraging 47 3D assets. We show that our proposed method outperforms NeRF editing methods as well as diffusion-based methods, and we also explore different methods for filtering correspondences.
Paper Structure (13 sections, 10 equations, 8 figures, 2 tables)

This paper contains 13 sections, 10 equations, 8 figures, 2 tables.

Figures (8)

  • Figure 1: Problem definition. Given a NeRF of the original scene, and a single RGBD image of the transformed scene, we are interested in producing novel views and exporting a mesh of this transformed scene. Here we visualize the NeRF (top left) and a transformation of the scene (bottom left). We then show how the scene is re-rendered given a new camera pose in the transformed scene (top right) and its scene mesh (bottom right).
  • Figure 2: Overview of our method: we use two linked flows, $F^{A\rightarrow B}$ for transformed geometry reconstruction (bottom) and $F^{B\rightarrow A}$ for rendering the transformed scene (top).
  • Figure 3: Forward flow of our method in the 2D case. Green dots are the anchor points $v_i$, the purple $\times$ is a query point, connected to its $K$-nearest ($K=3$ here) anchor points' transformation $\xi$. Blue dashed lines indicate the warp of the 2D space.
  • Figure 4: The transformed space image (a) is matched with the input NeRF scene first via 2D dense matching between the transformed image and original images $I^A_1,...,I^A_N$ rendered from the NeRF (b). Pixel-space filtering (c) is applied where we only show selected matches (red and blue lines represent bad and good matches respectively). We show how any given pixel in $I^B$ can be matched to multiple views (see green, yellow, and red small circles). Out of the multiple matches, we keep the one with largest continuous patch of matched pixels, e.g., in $I^A_1$ the green circle has 2 matched neighbors whereas in $I^A_2$ there are 8. Thus we keep the latter. The points are then unprojected into 3D (d) and keep pairs that are physically close in the original space while behaving similarly in the transformed space.
  • Figure 5: Qualitative results comparing our method to prior work. We first show in the left-most columns the original scene and the transformed view. The other columns show different renderings of the transformed scene: ground truth in blue, DreamGaussiantang2023dreamgaussian in green, SINEbao2023sine in yellow, and our method in red (lexicographic order within each $2 \times 2$ block).
  • ...and 3 more figures