Table of Contents
Fetching ...

Ego-1K -- A Large-Scale Multiview Video Dataset for Egocentric Vision

Jae Yong Lee, Daniel Scharstein, Akash Bapat, Hao Hu, Andrew Fu, Haoru Zhao, Paul Sammut, Xiang Li, Stephen Jeapes, Anik Gupta, Lior David, Saketh Madhuvarasu, Jay Girish Joshi, Jason Wither

Abstract

We present Ego-1K, a large-scale collection of time-synchronized egocentric multiview videos designed to advance neural 3D video synthesis and dynamic scene understanding. The dataset contains nearly 1,000 short egocentric videos captured with a custom rig with 12 synchronized cameras surrounding a 4-camera VR headset worn by the user. Scene content focuses on hand motions and hand-object interactions in different settings. We describe rig design, data processing, and calibration. Our dataset enables new ways to benchmark egocentric scene reconstruction methods, an important research area as smart glasses with multiple cameras become omnipresent. Our experiments demonstrate that our dataset presents unique challenges for existing 3D and 4D novel view synthesis methods due to large disparities and image motion caused by close dynamic objects and rig egomotion. Our dataset supports future research in this challenging domain. It is available at https://huggingface.co/datasets/facebook/ego-1k.

Ego-1K -- A Large-Scale Multiview Video Dataset for Egocentric Vision

Abstract

We present Ego-1K, a large-scale collection of time-synchronized egocentric multiview videos designed to advance neural 3D video synthesis and dynamic scene understanding. The dataset contains nearly 1,000 short egocentric videos captured with a custom rig with 12 synchronized cameras surrounding a 4-camera VR headset worn by the user. Scene content focuses on hand motions and hand-object interactions in different settings. We describe rig design, data processing, and calibration. Our dataset enables new ways to benchmark egocentric scene reconstruction methods, an important research area as smart glasses with multiple cameras become omnipresent. Our experiments demonstrate that our dataset presents unique challenges for existing 3D and 4D novel view synthesis methods due to large disparities and image motion caused by close dynamic objects and rig egomotion. Our dataset supports future research in this challenging domain. It is available at https://huggingface.co/datasets/facebook/ego-1k.
Paper Structure (8 sections, 1 equation, 7 figures, 4 tables)

This paper contains 8 sections, 1 equation, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Left: photo and rendering of our multi-camera rig integrating 12 global-shutter RGB fisheye cameras and a Quest 3 headset with 4 forward-facing cameras. All cameras are synchronized, enabling the capture of dynamic egocentric multiview videos at 60 Hz. Middle: a sample frame from a dynamic scene, captured by the 12 rig cameras; each horizontal pair is stereo-rectified. Right: overlays of the 4 corner views visualizing the disparity range; the average of all 12 views is shown in the center.
  • Figure 2: Sample frames from our dataset, illustrating the range of settings and hand motions.
  • Figure 3: Top: a sample frame from a recording in our research dataset, consisting of 6 rectified stereo pairs. Bottom: the same frame from the raw VRS with 12+4 fisheye camera streams.
  • Figure 4: Rig stereo pairs and target pair. The arrows show the 6 rectified stereo pairs; note that the baselines for most pairs are significantly larger than human eye distance (roughly the distance of the headset cameras and pair 9--10). The target pair (3--4) is shown in green. We warp the disparity maps of the other 5 pairs to the target pair and evaluate their consistency.
  • Figure 5: Stereo-guided 3DGS. We use stereo to compute surfaces, and sample surface points to initialize 3D Gaussians. Then, we fine-tune the 3D Gaussians to minimize photometric loss.
  • ...and 2 more figures