Table of Contents
Fetching ...

ImViD: Immersive Volumetric Videos for Enhanced VR Engagement

Zhengxian Yang, Shi Pan, Shengqi Wang, Haoxiang Wang, Li Lin, Guanjun Li, Zhengqi Wen, Borong Lin, Jianhua Tao, Tao Yu

TL;DR

ImViD addresses the lack of immersive volumetric video datasets that provide full 360° foreground/background capture, high-resolution multi-view video with synchronized audio, and a large interaction space for 6-DoF VR. The authors introduce a moving 46-camera rig to capture 5K60 sequences across indoor/outdoor scenes and propose a baseline pipeline for reconstructing coupled light and sound fields. They benchmark existing dynamic light-field methods (e.g., 4DGS, STG/STG++) and present a training-free sound-field reconstruction approach, validating the dataset through quantitative metrics, qualitative analysis, and a user-study. The work enables realistic multimodal VR experiences and offers a resource for advancing immersive volumetric video research.

Abstract

User engagement is greatly enhanced by fully immersive multi-modal experiences that combine visual and auditory stimuli. Consequently, the next frontier in VR/AR technologies lies in immersive volumetric videos with complete scene capture, large 6-DoF interaction space, multi-modal feedback, and high resolution & frame-rate contents. To stimulate the reconstruction of immersive volumetric videos, we introduce ImViD, a multi-view, multi-modal dataset featuring complete space-oriented data capture and various indoor/outdoor scenarios. Our capture rig supports multi-view video-audio capture while on the move, a capability absent in existing datasets, significantly enhancing the completeness, flexibility, and efficiency of data capture. The captured multi-view videos (with synchronized audios) are in 5K resolution at 60FPS, lasting from 1-5 minutes, and include rich foreground-background elements, and complex dynamics. We benchmark existing methods using our dataset and establish a base pipeline for constructing immersive volumetric videos from multi-view audiovisual inputs for 6-DoF multi-modal immersive VR experiences. The benchmark and the reconstruction and interaction results demonstrate the effectiveness of our dataset and baseline method, which we believe will stimulate future research on immersive volumetric video production.

ImViD: Immersive Volumetric Videos for Enhanced VR Engagement

TL;DR

ImViD addresses the lack of immersive volumetric video datasets that provide full 360° foreground/background capture, high-resolution multi-view video with synchronized audio, and a large interaction space for 6-DoF VR. The authors introduce a moving 46-camera rig to capture 5K60 sequences across indoor/outdoor scenes and propose a baseline pipeline for reconstructing coupled light and sound fields. They benchmark existing dynamic light-field methods (e.g., 4DGS, STG/STG++) and present a training-free sound-field reconstruction approach, validating the dataset through quantitative metrics, qualitative analysis, and a user-study. The work enables realistic multimodal VR experiences and offers a resource for advancing immersive volumetric video research.

Abstract

User engagement is greatly enhanced by fully immersive multi-modal experiences that combine visual and auditory stimuli. Consequently, the next frontier in VR/AR technologies lies in immersive volumetric videos with complete scene capture, large 6-DoF interaction space, multi-modal feedback, and high resolution & frame-rate contents. To stimulate the reconstruction of immersive volumetric videos, we introduce ImViD, a multi-view, multi-modal dataset featuring complete space-oriented data capture and various indoor/outdoor scenarios. Our capture rig supports multi-view video-audio capture while on the move, a capability absent in existing datasets, significantly enhancing the completeness, flexibility, and efficiency of data capture. The captured multi-view videos (with synchronized audios) are in 5K resolution at 60FPS, lasting from 1-5 minutes, and include rich foreground-background elements, and complex dynamics. We benchmark existing methods using our dataset and establish a base pipeline for constructing immersive volumetric videos from multi-view audiovisual inputs for 6-DoF multi-modal immersive VR experiences. The benchmark and the reconstruction and interaction results demonstrate the effectiveness of our dataset and baseline method, which we believe will stimulate future research on immersive volumetric video production.

Paper Structure

This paper contains 38 sections, 11 equations, 14 figures, 8 tables.

Figures (14)

  • Figure 1: We introduce ImViD, a dataset for immersive volumetric videos. ImViD records dynamic scenes using a multi-view audio-video capture rig moving in a space-oriented manner, which provides a new benchmark for volumetric video reconstruction and its application.
  • Figure 2: The pipeline to realize the multimodal 6-DoF immersive VR experiences. We applied a carefully designed rig to (a) simultaneously capture multi-view video and audio. The (b1) presents our reconstruction of dynamic light field based on STG li2024spacetime while (b2) demonstrates the construction process of sound field. We have achieved better results than the original algorithm in long-term dynamic scenes by incorporating affine color transformation and t-dimensional density control. Ultimately, we achieve a 6-DoF immersive experience in both light and sound fields, and also benchmark on recent representative methods like 4DGS wu20244d and 4Drotor duan20244d to demonstrate the effectiveness of both our dataset and baseline method.
  • Figure 3: Our rig support two kinds of capturing strategies for high resolution, high frame rate and 360° dynamic data acquisition.
  • Figure 4: Calculation method for spatiotemporal capture density. 1) the capture strategy of handheld monocular camera 2) represents the fixed camera array 3) Our rig covers a volume of 0.6× $\pi$ ×$0.5^2$ m³ over 5 seconds.
  • Figure 5: Comparison of the rendering results of four baselines on Scene 1 Opera, Scene 2 Laboratory, and Scene 6 Puppy.
  • ...and 9 more figures