Table of Contents
Fetching ...

HD-EPIC: A Highly-Detailed Egocentric Video Dataset

Toby Perrett, Ahmad Darkhalil, Saptarshi Sinha, Omar Emara, Sam Pollard, Kranti Parida, Kaiting Liu, Prajwal Gatti, Siddhant Bansal, Kevin Flanagan, Jacob Chalk, Zhifan Zhu, Rhodri Guerrier, Fahd Abdelazim, Bin Zhu, Davide Moltisanti, Michael Wray, Hazel Doughty, Dima Damen

TL;DR

HD-EPIC tackles the gap between lab-grade annotated datasets and real-world egocentric perception by delivering 41 hours of unscripted kitchen video with dense, 3D-grounded ground truth. The dataset combines recipe steps, ingredient nutrition, fine-grained actions, gaze, audio, and long-term object trajectories within Blender-based digital twins of 9 kitchens, enabling robust validation of video-only and video-language models. A 26K-question VQA benchmark spanning 7 annotation types reveals substantial gaps in current state-of-the-art models (Gemini Pro achieves 38.5% in the abstract’s benchmark, while humans reach ~90%), demonstrating the dataset’s difficulty and the need for richer reasoning. HD-EPIC also provides standard action/sound recognition and long-term VOS benchmarks, highlighting modality-specific challenges and offering a practical, comprehensive validation resource for embodied AI in unconstrained environments.

Abstract

We present a validation dataset of newly-collected kitchen-based egocentric videos, manually annotated with highly detailed and interconnected ground-truth labels covering: recipe steps, fine-grained actions, ingredients with nutritional values, moving objects, and audio annotations. Importantly, all annotations are grounded in 3D through digital twinning of the scene, fixtures, object locations, and primed with gaze. Footage is collected from unscripted recordings in diverse home environments, making HDEPIC the first dataset collected in-the-wild but with detailed annotations matching those in controlled lab environments. We show the potential of our highly-detailed annotations through a challenging VQA benchmark of 26K questions assessing the capability to recognise recipes, ingredients, nutrition, fine-grained actions, 3D perception, object motion, and gaze direction. The powerful long-context Gemini Pro only achieves 38.5% on this benchmark, showcasing its difficulty and highlighting shortcomings in current VLMs. We additionally assess action recognition, sound recognition, and long-term video-object segmentation on HD-EPIC. HD-EPIC is 41 hours of video in 9 kitchens with digital twins of 413 kitchen fixtures, capturing 69 recipes, 59K fine-grained actions, 51K audio events, 20K object movements and 37K object masks lifted to 3D. On average, we have 263 annotations per minute of our unscripted videos.

HD-EPIC: A Highly-Detailed Egocentric Video Dataset

TL;DR

HD-EPIC tackles the gap between lab-grade annotated datasets and real-world egocentric perception by delivering 41 hours of unscripted kitchen video with dense, 3D-grounded ground truth. The dataset combines recipe steps, ingredient nutrition, fine-grained actions, gaze, audio, and long-term object trajectories within Blender-based digital twins of 9 kitchens, enabling robust validation of video-only and video-language models. A 26K-question VQA benchmark spanning 7 annotation types reveals substantial gaps in current state-of-the-art models (Gemini Pro achieves 38.5% in the abstract’s benchmark, while humans reach ~90%), demonstrating the dataset’s difficulty and the need for richer reasoning. HD-EPIC also provides standard action/sound recognition and long-term VOS benchmarks, highlighting modality-specific challenges and offering a practical, comprehensive validation resource for embodied AI in unconstrained environments.

Abstract

We present a validation dataset of newly-collected kitchen-based egocentric videos, manually annotated with highly detailed and interconnected ground-truth labels covering: recipe steps, fine-grained actions, ingredients with nutritional values, moving objects, and audio annotations. Importantly, all annotations are grounded in 3D through digital twinning of the scene, fixtures, object locations, and primed with gaze. Footage is collected from unscripted recordings in diverse home environments, making HDEPIC the first dataset collected in-the-wild but with detailed annotations matching those in controlled lab environments. We show the potential of our highly-detailed annotations through a challenging VQA benchmark of 26K questions assessing the capability to recognise recipes, ingredients, nutrition, fine-grained actions, 3D perception, object motion, and gaze direction. The powerful long-context Gemini Pro only achieves 38.5% on this benchmark, showcasing its difficulty and highlighting shortcomings in current VLMs. We additionally assess action recognition, sound recognition, and long-term video-object segmentation on HD-EPIC. HD-EPIC is 41 hours of video in 9 kitchens with digital twins of 413 kitchen fixtures, capturing 69 recipes, 59K fine-grained actions, 51K audio events, 20K object movements and 37K object masks lifted to 3D. On average, we have 263 annotations per minute of our unscripted videos.

Paper Structure

This paper contains 41 sections, 1 equation, 25 figures, 9 tables.

Figures (25)

  • Figure 1: Annotation Highlights. We capture multi-day recordings of unscripted activities. Centre-Top: Recipes are recorded with steps and their preparation temporally annotated, along with ingredient addition. Ingredients are weighed and nutrition recorded. Centre-Middle: Dense fine-grained narrations detailing what, how, and why are parsed and clustered. Audio events are also annotated. Centre-Bottom: Object movements are temporally annotated with bounding boxes and hands and object masks. Right-Top: All annotations are temporally grounded in a 3D digital twin. We show trajectories of 3 (masked) objects: Sweet potato, Food processor and Spoon, highlighting relevant kitchen fixtures. Right-Bottom: Gaze captures when objects are primed (i.e. looked at) before being taken/placed.
  • Figure 2: Diversity in HD-EPIC, which is filmed over 3 days in-the-wild, resulting in many objects, activities and recipes.
  • Figure 3: Recipe modification in ingredients and steps.
  • Figure 4: For the 'Carbonara' recipe, we visualise the prep and step time segments for three consecutive steps (left), along with sample frames with corresponding action narrations (top). The interleaving of different preps/steps is evident in the annotations.
  • Figure 5: Frequency of verb clusters (top) and noun clusters (bottom) in narrated sentences by category, shown on a logarithmic scale.
  • ...and 20 more figures