Table of Contents
Fetching ...

V4d: voxel for 4d novel view synthesis

Wanshui Gan, Hongbin Xu, Yi Huang, Shifeng Chen, Naoto Yokoya

TL;DR

This work introduces V4D, a voxel-based 4D neural radiance field for dynamic scenes, addressing the capacity and efficiency problems of MLP-based methods. It jointly models density and texture on separate voxel grids, with time conditioning, a conditional positional encoding to recover high-frequency details, and a plug-in LUT-based pixel refinement guided by a pseudo-surface depth. Empirical results on synthetic and real dynamic datasets show state-of-the-art or competitive performance with lower computational cost, and ablation confirms the utility of the TV regularization, CPE, and LUT refinement. The approach offers a practical, plug-and-play refinement module and highlights memory considerations, suggesting future work on tensor-factorized representations to scale to larger scenes.

Abstract

Neural radiance fields have made a remarkable breakthrough in the novel view synthesis task at the 3D static scene. However, for the 4D circumstance (e.g., dynamic scene), the performance of the existing method is still limited by the capacity of the neural network, typically in a multilayer perceptron network (MLP). In this paper, we utilize 3D Voxel to model the 4D neural radiance field, short as V4D, where the 3D voxel has two formats. The first one is to regularly model the 3D space and then use the sampled local 3D feature with the time index to model the density field and the texture field by a tiny MLP. The second one is in look-up tables (LUTs) format that is for the pixel-level refinement, where the pseudo-surface produced by the volume rendering is utilized as the guidance information to learn a 2D pixel-level refinement mapping. The proposed LUTs-based refinement module achieves the performance gain with little computational cost and could serve as the plug-and-play module in the novel view synthesis task. Moreover, we propose a more effective conditional positional encoding toward the 4D data that achieves performance gain with negligible computational burdens. Extensive experiments demonstrate that the proposed method achieves state-of-the-art performance at a low computational cost.

V4d: voxel for 4d novel view synthesis

TL;DR

This work introduces V4D, a voxel-based 4D neural radiance field for dynamic scenes, addressing the capacity and efficiency problems of MLP-based methods. It jointly models density and texture on separate voxel grids, with time conditioning, a conditional positional encoding to recover high-frequency details, and a plug-in LUT-based pixel refinement guided by a pseudo-surface depth. Empirical results on synthetic and real dynamic datasets show state-of-the-art or competitive performance with lower computational cost, and ablation confirms the utility of the TV regularization, CPE, and LUT refinement. The approach offers a practical, plug-and-play refinement module and highlights memory considerations, suggesting future work on tensor-factorized representations to scale to larger scenes.

Abstract

Neural radiance fields have made a remarkable breakthrough in the novel view synthesis task at the 3D static scene. However, for the 4D circumstance (e.g., dynamic scene), the performance of the existing method is still limited by the capacity of the neural network, typically in a multilayer perceptron network (MLP). In this paper, we utilize 3D Voxel to model the 4D neural radiance field, short as V4D, where the 3D voxel has two formats. The first one is to regularly model the 3D space and then use the sampled local 3D feature with the time index to model the density field and the texture field by a tiny MLP. The second one is in look-up tables (LUTs) format that is for the pixel-level refinement, where the pseudo-surface produced by the volume rendering is utilized as the guidance information to learn a 2D pixel-level refinement mapping. The proposed LUTs-based refinement module achieves the performance gain with little computational cost and could serve as the plug-and-play module in the novel view synthesis task. Moreover, we propose a more effective conditional positional encoding toward the 4D data that achieves performance gain with negligible computational burdens. Extensive experiments demonstrate that the proposed method achieves state-of-the-art performance at a low computational cost.
Paper Structure (14 sections, 4 equations, 7 figures, 5 tables)

This paper contains 14 sections, 4 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Overview of the voxel for 4D novel view synthesis (V4D). Given a single-view video clip, we present a voxel-based method for 4D novel view synthesis. The detailed introduction of the conditional positional encoding (CPE) and the look-up tables refinement module (LUT refine) are placed in Section \ref{['v4d']}.
  • Figure 2: LUTs refinement module. Given the coarse RGB value as input, we learn $M=5$ basic LUTs to model a 2D pixel-level refinement mapping with guidance from the pseudo-surface. We do the recurrent iteration with $Z=3$ times for the best result. The detailed introduction is in \ref{['lut']}.
  • Figure 3: The variant architecture in V4D for ablation study. For the SV, we unify the density volume and texture volume with volume size $160\times160\times160\times24$. For the SF, it is an NVSF-like structure liu2020neural but not in the sparse voxel format. For a fair comparison, we have kept the same setting during the implementation (e.g., the width and depth of the MLPs) apart from the architecture difference illustrated above.
  • Figure 4: Visual comparisons on the dynamic dataset. Please zoom in for better observation.
  • Figure 5: Visual comparisons for ours and TiNeuVox TiNeuVox on the dynamic real scenes park2021hypernerf. Please zoom in for better observation.
  • ...and 2 more figures