Table of Contents
Fetching ...

DeVRF: Fast Deformable Voxel Radiance Fields for Dynamic Scenes

Jia-Wei Liu, Yan-Pei Cao, Weijia Mao, Wenqiao Zhang, David Junhao Zhang, Jussi Keppo, Ying Shan, Xiaohu Qie, Mike Zheng Shou

TL;DR

DeVRF tackles slow dynamic NeRF training by introducing a deformable voxel radiance field that separately models a 3D canonical space and a 4D deformation field. A static→dynamic learning paradigm leverages static multi-view priors to bootstrap learning from few-view dynamic sequences, aided by coarse-to-fine optimization and regularization techniques. The approach yields up to 100x faster training on a single GPU with four cameras while delivering on-par or better novel-view fidelity across synthetic and real scenes. Limitations include a large model size and non-synchronized canonical-space optimization during the dynamic phase, suggesting future work on joint canonical and deformation-field optimization.

Abstract

Modeling dynamic scenes is important for many applications such as virtual reality and telepresence. Despite achieving unprecedented fidelity for novel view synthesis in dynamic scenes, existing methods based on Neural Radiance Fields (NeRF) suffer from slow convergence (i.e., model training time measured in days). In this paper, we present DeVRF, a novel representation to accelerate learning dynamic radiance fields. The core of DeVRF is to model both the 3D canonical space and 4D deformation field of a dynamic, non-rigid scene with explicit and discrete voxel-based representations. However, it is quite challenging to train such a representation which has a large number of model parameters, often resulting in overfitting issues. To overcome this challenge, we devise a novel static-to-dynamic learning paradigm together with a new data capture setup that is convenient to deploy in practice. This paradigm unlocks efficient learning of deformable radiance fields via utilizing the 3D volumetric canonical space learnt from multi-view static images to ease the learning of 4D voxel deformation field with only few-view dynamic sequences. To further improve the efficiency of our DeVRF and its synthesized novel view's quality, we conduct thorough explorations and identify a set of strategies. We evaluate DeVRF on both synthetic and real-world dynamic scenes with different types of deformation. Experiments demonstrate that DeVRF achieves two orders of magnitude speedup (100x faster) with on-par high-fidelity results compared to the previous state-of-the-art approaches. The code and dataset will be released in https://github.com/showlab/DeVRF.

DeVRF: Fast Deformable Voxel Radiance Fields for Dynamic Scenes

TL;DR

DeVRF tackles slow dynamic NeRF training by introducing a deformable voxel radiance field that separately models a 3D canonical space and a 4D deformation field. A static→dynamic learning paradigm leverages static multi-view priors to bootstrap learning from few-view dynamic sequences, aided by coarse-to-fine optimization and regularization techniques. The approach yields up to 100x faster training on a single GPU with four cameras while delivering on-par or better novel-view fidelity across synthetic and real scenes. Limitations include a large model size and non-synchronized canonical-space optimization during the dynamic phase, suggesting future work on joint canonical and deformation-field optimization.

Abstract

Modeling dynamic scenes is important for many applications such as virtual reality and telepresence. Despite achieving unprecedented fidelity for novel view synthesis in dynamic scenes, existing methods based on Neural Radiance Fields (NeRF) suffer from slow convergence (i.e., model training time measured in days). In this paper, we present DeVRF, a novel representation to accelerate learning dynamic radiance fields. The core of DeVRF is to model both the 3D canonical space and 4D deformation field of a dynamic, non-rigid scene with explicit and discrete voxel-based representations. However, it is quite challenging to train such a representation which has a large number of model parameters, often resulting in overfitting issues. To overcome this challenge, we devise a novel static-to-dynamic learning paradigm together with a new data capture setup that is convenient to deploy in practice. This paradigm unlocks efficient learning of deformable radiance fields via utilizing the 3D volumetric canonical space learnt from multi-view static images to ease the learning of 4D voxel deformation field with only few-view dynamic sequences. To further improve the efficiency of our DeVRF and its synthesized novel view's quality, we conduct thorough explorations and identify a set of strategies. We evaluate DeVRF on both synthetic and real-world dynamic scenes with different types of deformation. Experiments demonstrate that DeVRF achieves two orders of magnitude speedup (100x faster) with on-par high-fidelity results compared to the previous state-of-the-art approaches. The code and dataset will be released in https://github.com/showlab/DeVRF.
Paper Structure (15 sections, 8 equations, 4 figures, 5 tables)

This paper contains 15 sections, 8 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: The 3D canonical space (a) and the 4D deformation field (b) of DeVRF for neural modeling of a non-rigid scene (c). (d): The comparison between DeVRF and SOTA approaches.
  • Figure 2: Overview of our method. In the first stage, DeVRF learns a 3D volumetric canonical prior (b) from multi-view static images (a). In the second stage, a 4D deformation field (d) is jointly optimized from taking few-view dynamic sequences (c) and the 3D canonical prior (b). For ray points sampled from a deformed frame, their deformation to canonical space can be efficiently queried from the 4D backward deformation field (d). Therefore, the scene properties (i.e., density, color) of these deformed points can be obtained through linear interpolation in the 3D volumetric canonical space, and novel views (f) can be accordingly synthesized by volume rendering (e) using these deformed sample points.
  • Figure 3: Qualitative comparisons of baselines and DeVRF on synthetic and real-world scenes.
  • Figure 4: Ablation evaluation on the number of dynamic training views: (a) PSNR, (b) LPIPS.