Table of Contents
Fetching ...

EPIC Fields: Marrying 3D Geometry and Video Understanding

Vadim Tschernezki, Ahmad Darkhalil, Zhifan Zhu, David Fouhey, Iro Laina, Diane Larlus, Dima Damen, Andrea Vedaldi

TL;DR

EPIC Fields extends EPIC-KITCHENS by adding per-frame 3D camera intrinsics and extrinsics to enable 3D grounding of egocentric actions. It introduces a frame-filtered SfM pipeline to reconstruct camera trajectories over long, dynamic sequences, achieving high reconstruction coverage (e.g., 96% across 671 videos and 19M frames in 45 kitchens) without additional hardware. The paper defines three benchmarks—Dynamic New-View Synthesis, Unsupervised Dynamic Object Segmentation, and Semi-Supervised Video Object Segmentation—evaluating NeRF-W, NeuralDiff, and T-NeRF+ alongside 2D baselines, and reveals clear gaps in handling dynamic content, while illustrating the benefits of 3D geometry for segmentation tasks. Together, EPIC Fields provides a public dataset and benchmarks that empower research at the intersection of 3D geometry and video understanding in realistic, long-form egocentric data.

Abstract

Neural rendering is fuelling a unification of learning, 3D geometry and video understanding that has been waiting for more than two decades. Progress, however, is still hampered by a lack of suitable datasets and benchmarks. To address this gap, we introduce EPIC Fields, an augmentation of EPIC-KITCHENS with 3D camera information. Like other datasets for neural rendering, EPIC Fields removes the complex and expensive step of reconstructing cameras using photogrammetry, and allows researchers to focus on modelling problems. We illustrate the challenge of photogrammetry in egocentric videos of dynamic actions and propose innovations to address them. Compared to other neural rendering datasets, EPIC Fields is better tailored to video understanding because it is paired with labelled action segments and the recent VISOR segment annotations. To further motivate the community, we also evaluate two benchmark tasks in neural rendering and segmenting dynamic objects, with strong baselines that showcase what is not possible today. We also highlight the advantage of geometry in semi-supervised video object segmentations on the VISOR annotations. EPIC Fields reconstructs 96% of videos in EPICKITCHENS, registering 19M frames in 99 hours recorded in 45 kitchens.

EPIC Fields: Marrying 3D Geometry and Video Understanding

TL;DR

EPIC Fields extends EPIC-KITCHENS by adding per-frame 3D camera intrinsics and extrinsics to enable 3D grounding of egocentric actions. It introduces a frame-filtered SfM pipeline to reconstruct camera trajectories over long, dynamic sequences, achieving high reconstruction coverage (e.g., 96% across 671 videos and 19M frames in 45 kitchens) without additional hardware. The paper defines three benchmarks—Dynamic New-View Synthesis, Unsupervised Dynamic Object Segmentation, and Semi-Supervised Video Object Segmentation—evaluating NeRF-W, NeuralDiff, and T-NeRF+ alongside 2D baselines, and reveals clear gaps in handling dynamic content, while illustrating the benefits of 3D geometry for segmentation tasks. Together, EPIC Fields provides a public dataset and benchmarks that empower research at the intersection of 3D geometry and video understanding in realistic, long-form egocentric data.

Abstract

Neural rendering is fuelling a unification of learning, 3D geometry and video understanding that has been waiting for more than two decades. Progress, however, is still hampered by a lack of suitable datasets and benchmarks. To address this gap, we introduce EPIC Fields, an augmentation of EPIC-KITCHENS with 3D camera information. Like other datasets for neural rendering, EPIC Fields removes the complex and expensive step of reconstructing cameras using photogrammetry, and allows researchers to focus on modelling problems. We illustrate the challenge of photogrammetry in egocentric videos of dynamic actions and propose innovations to address them. Compared to other neural rendering datasets, EPIC Fields is better tailored to video understanding because it is paired with labelled action segments and the recent VISOR segment annotations. To further motivate the community, we also evaluate two benchmark tasks in neural rendering and segmenting dynamic objects, with strong baselines that showcase what is not possible today. We also highlight the advantage of geometry in semi-supervised video object segmentations on the VISOR annotations. EPIC Fields reconstructs 96% of videos in EPICKITCHENS, registering 19M frames in 99 hours recorded in 45 kitchens.
Paper Structure (48 sections, 15 figures, 6 tables)

This paper contains 48 sections, 15 figures, 6 tables.

Figures (15)

  • Figure 1: We propose EPIC Fields that extends EPIC-KITCHENS with 3D information, including full frame-rate camera pose trajectories (top). These are directly obtained from dynamic sequences of object interactions (sampled frames) without additional modalities or pre-scans. We showcase EPIC Fields through several benchmarks (bottom) that use the fusion of geometric and semantic cues.
  • Figure 2: EPIC Fields unlocks applications that combine interactions with 3D information. We showcase examples of actions grounded in 3D (top row), and examples of integrating single-image 3D hands rong2020frankmocap into the kitchen reconstruction during interactions (bottom row).
  • Figure 3: 3D reconstructions with different sampling. We compare three scenes reconstructed using either uniform frame selection or our homography-based pipeline. Uniform sampling yields partial reconstructions with limited coverage. Ours demonstrates superior performance, resulting in better coverage by registering successfully more viewpoints.
  • Figure 4: Definition of the three difficulty levels for the task of dynamic new-view synthesis. Validation and test frames are selected to meet three reconstruction difficulty levels. In-Action frames (Hard) happen during an action and are harder to reconstruct due to the dynamics. Out-of-Action (Medium) frames happen outside an action, but are far from a train frame. Out-of-Action (Easy) frames are near train frames. Frames in a bounding box (orange) represent either val/test frames. Frames marked with a cross are discarded to create a larger time gap around each val/test frame (medium and hard levels). All other frames can be used for training.
  • Figure 5: Dynamic new-view synthesis. We compare the outputs of 3D methods NeRF-W martinbrualla2020nerfw, T-NeRF+ gao2022monocular, and NeuralDiff tschernezki22neural, for novel viewpoints, across three different complexity levels. The predictions are more accurate with less difficult motion as shown in the first and second row. The task becomes more challenging for our hard samples.
  • ...and 10 more figures