Table of Contents
Fetching ...

DiVa-360: The Dynamic Visual Dataset for Immersive Neural Fields

Cheng-You Lu, Peisen Zhou, Angela Xing, Chandradeep Pokhariya, Arnab Dey, Ishaan Shah, Rugved Mavidipalli, Dylan Hu, Andrew Comport, Kefan Chen, Srinath Sridhar

TL;DR

The state-of-the-art dynamic neural field methods on DiVa-360 are benchmarked and insights about existing methods and future challenges on long-duration neural field capture are provided.

Abstract

Advances in neural fields are enabling high-fidelity capture of the shape and appearance of dynamic 3D scenes. However, their capabilities lag behind those offered by conventional representations such as 2D videos because of algorithmic challenges and the lack of large-scale multi-view real-world datasets. We address the dataset limitation with DiVa-360, a real-world 360 dynamic visual dataset that contains synchronized high-resolution and long-duration multi-view video sequences of table-scale scenes captured using a customized low-cost system with 53 cameras. It contains 21 object-centric sequences categorized by different motion types, 25 intricate hand-object interaction sequences, and 8 long-duration sequences for a total of 17.4 M image frames. In addition, we provide foreground-background segmentation masks, synchronized audio, and text descriptions. We benchmark the state-of-the-art dynamic neural field methods on DiVa-360 and provide insights about existing methods and future challenges on long-duration neural field capture.

DiVa-360: The Dynamic Visual Dataset for Immersive Neural Fields

TL;DR

The state-of-the-art dynamic neural field methods on DiVa-360 are benchmarked and insights about existing methods and future challenges on long-duration neural field capture are provided.

Abstract

Advances in neural fields are enabling high-fidelity capture of the shape and appearance of dynamic 3D scenes. However, their capabilities lag behind those offered by conventional representations such as 2D videos because of algorithmic challenges and the lack of large-scale multi-view real-world datasets. We address the dataset limitation with DiVa-360, a real-world 360 dynamic visual dataset that contains synchronized high-resolution and long-duration multi-view video sequences of table-scale scenes captured using a customized low-cost system with 53 cameras. It contains 21 object-centric sequences categorized by different motion types, 25 intricate hand-object interaction sequences, and 8 long-duration sequences for a total of 17.4 M image frames. In addition, we provide foreground-background segmentation masks, synchronized audio, and text descriptions. We benchmark the state-of-the-art dynamic neural field methods on DiVa-360 and provide insights about existing methods and future challenges on long-duration neural field capture.
Paper Structure (18 sections, 29 figures, 13 tables)

This paper contains 18 sections, 29 figures, 13 tables.

Figures (29)

  • Figure 1: DiVa-360 is a real-world 360$^\circ$ multi-view visual dataset of dynamic tabletop scenes captured using a customized low-cost capture system consisting of 53 cameras. DiVa-360 contains 21 diverse moving object sequences, 25 hand-object interaction sequences, and 8 long-duration sequences (2-3 mins). DiVa-360 provides (1) 360$^\circ$ coverage of dynamic scenes, (2) foreground-background segmentation masks, and (3) diverse table-scale scenes with intricate motions. DiVa-360 aims to facilitate research in dynamic long-duration neural fields.
  • Figure 2: DiVa-360 provides multi-view dynamic sequences for dynamic neural field methods. The dataset contains a variety of object and motion types. Here, we showcase reconstruction results across time steps from PF I-NGP mueller2022instant, MixVoxels wang2023mixed, and K-Planes kplanes_2023 trained on our dataset. Surprisingly, the rendering results of PF I-NGP, a method that does not directly utilize temporal information from adjacent frames, are better than those of MixVoxels and K-Planes. MixVoxels struggles with complex motion data, such as hands, while K-Planes suffers from floaters in the background. We demonstrate more visualization results in the supplementary Section 2.
  • Figure 3: (a) BRICS is a refrigerator-sized aluminum frame that supports a 1 m$^3$ capture volume mounted on wheels for mobility. Each side wall of the capture volume is divided into a 3$\times$3 grid, with each grid square containing sensors, LEDs, single-board computers (SBCs), and light diffusers. (b) Two walls of the capture volume act as doors for easy access to the capture volume. (c) We can acquire 360$^\circ$ RGB views of dynamic objects and intricate hand-object interactions in this capture volume (6 views shown).
  • Figure 4: DiVa-360 covers diverse object and hand-object interaction data. Our object sequences represent a variety of motion types, while our hand-object interaction data contain intricate and realistic motions.
  • Figure 5: The rendering quality across different numbers of chunks with object and interaction data. The circle dot presents the storage space of the models in GB. MixVoxels prefers less temporal information, while K-Planes prefers more temporal information.
  • ...and 24 more figures