Table of Contents
Fetching ...

Template-free Articulated Neural Point Clouds for Reposable View Synthesis

Lukas Uzolas, Elmar Eisemann, Petr Kellnhofer

TL;DR

This work tackles reposing of dynamic NeRFs without object-specific templates by introducing a forward-warping, point-based representation supervised by a learned Linear Blend Skinning skeleton. A two-stage pipeline first pre-trains a backbone NeRF to obtain a canonical feature point cloud, then jointly learns skinning weights, skeletal joints, a pose regressor, and a point decoder to forward-warp to observed poses. The method delivers state-of-the-art novel-view synthesis with substantially faster training times and enables reposing and pose editing across diverse articulated objects, including humans, without manual templates. It achieves this while maintaining high fidelity, demonstrated across multiple datasets (Blender, Robots, ZJU-MoCap) and via extensive ablations, and it supports practical animation applications with a skeleton-simplification option for ease of use.

Abstract

Dynamic Neural Radiance Fields (NeRFs) achieve remarkable visual quality when synthesizing novel views of time-evolving 3D scenes. However, the common reliance on backward deformation fields makes reanimation of the captured object poses challenging. Moreover, the state of the art dynamic models are often limited by low visual fidelity, long reconstruction time or specificity to narrow application domains. In this paper, we present a novel method utilizing a point-based representation and Linear Blend Skinning (LBS) to jointly learn a Dynamic NeRF and an associated skeletal model from even sparse multi-view video. Our forward-warping approach achieves state-of-the-art visual fidelity when synthesizing novel views and poses while significantly reducing the necessary learning time when compared to existing work. We demonstrate the versatility of our representation on a variety of articulated objects from common datasets and obtain reposable 3D reconstructions without the need of object-specific skeletal templates. Code will be made available at https://github.com/lukasuz/Articulated-Point-NeRF.

Template-free Articulated Neural Point Clouds for Reposable View Synthesis

TL;DR

This work tackles reposing of dynamic NeRFs without object-specific templates by introducing a forward-warping, point-based representation supervised by a learned Linear Blend Skinning skeleton. A two-stage pipeline first pre-trains a backbone NeRF to obtain a canonical feature point cloud, then jointly learns skinning weights, skeletal joints, a pose regressor, and a point decoder to forward-warp to observed poses. The method delivers state-of-the-art novel-view synthesis with substantially faster training times and enables reposing and pose editing across diverse articulated objects, including humans, without manual templates. It achieves this while maintaining high fidelity, demonstrated across multiple datasets (Blender, Robots, ZJU-MoCap) and via extensive ablations, and it supports practical animation applications with a skeleton-simplification option for ease of use.

Abstract

Dynamic Neural Radiance Fields (NeRFs) achieve remarkable visual quality when synthesizing novel views of time-evolving 3D scenes. However, the common reliance on backward deformation fields makes reanimation of the captured object poses challenging. Moreover, the state of the art dynamic models are often limited by low visual fidelity, long reconstruction time or specificity to narrow application domains. In this paper, we present a novel method utilizing a point-based representation and Linear Blend Skinning (LBS) to jointly learn a Dynamic NeRF and an associated skeletal model from even sparse multi-view video. Our forward-warping approach achieves state-of-the-art visual fidelity when synthesizing novel views and poses while significantly reducing the necessary learning time when compared to existing work. We demonstrate the versatility of our representation on a variety of articulated objects from common datasets and obtain reposable 3D reconstructions without the need of object-specific skeletal templates. Code will be made available at https://github.com/lukasuz/Articulated-Point-NeRF.
Paper Structure (40 sections, 9 equations, 18 figures, 9 tables)

This paper contains 40 sections, 9 equations, 18 figures, 9 tables.

Figures (18)

  • Figure 1: Overview of our method: a) First, we pre-train a NeRF backbone to initialize a feature point cloud $P^c$ for a selected canonical timestamp and to extract an initial skeleton. b) During the main training stage, $P^c$ is forward-warped using LBS consisting of learned time-invariant skinning weights $\hat{w}_i$ and time-dependent pose transformations from an MLP regressor $\Phi_r$. The image is obtained by integration and decoding of features aggregated from points along each camera ray. In summary, we fine-tune the initial neural point features $\mathbf{f_i}$, skinning weights $\hat{w}_i$, joints $J$, density and color regressor $\Phi_d$ and $\Phi_c$ of the backbone. We fully train the pose regressor $\Phi_r$ and feature point decoder $\Phi_p$ from scratch. In test time, we modify the pose transformations to obtain novel poses.
  • Figure 2: Examples of poses difficult for backward warping. (a) Ill-defined projection from an observation to the canonical space. Both ambiguous solutions (green and magenta) correctly pass through the semantically corresponding surface points B and C. (b) Projection ambiguity for points of contact between two surfaces. Note that in contrast, it is trivial to forward warp the object points from a well-chosen canonical space to any observation space.
  • Figure 3: Qualitative comparison displaying two held-out views-frames of scenes from the Robots rendered by WIM noguchi2022watch and our method after 2 and 10 hours of training, and the PSNR scores.
  • Figure 4: Quality of unseen view synthesis during training with 95% confidence intervals in the Robots dataset noguchi2022watch. The initial plateau of WIM noguchi2022watch matches the 10k warm-up steps used by the authors before training with all data. Our onset time corresponds to the 70 minutes required for pre-training of the backbone. Training of our method was terminated after $2.5$ hours.
  • Figure 5: Effect of the backbone initialization pre-training steps on the final result of our method when trained on the Jumping Jacks scene from the Blender dataset. Our final choice of 20k iterations corresponds to approximately 70 minutes of real time.
  • ...and 13 more figures