Table of Contents
Fetching ...

Scaling Egocentric Vision: The EPIC-KITCHENS Dataset

Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, Michael Wray

TL;DR

EPIC-KITCHENS addresses the lack of large-scale egocentric datasets by collecting 55 hours of native-kitchen video from 32 participants across 4 cities, and annotating actions via participant narrations to capture true intention. The work introduces a comprehensive data-pipeline including action-segment annotations, active object bounding boxes, and verb/noun classonomies, plus QA checks, resulting in 39.6K action segments and 454.3K bounding boxes. It defines three benchmarks—object detection, action recognition, and action anticipation—with seen/unseen kitchen splits and public leaderboards to gauge generalization and progress. Baseline results using Faster R-CNN and TSN reveal the dataset’s difficulty and its potential to push advances in fine-grained, multi-task egocentric video understanding and human-robot interaction in daily living contexts.

Abstract

First-person vision is gaining interest as it offers a unique viewpoint on people's interaction with objects, their attention, and even intention. However, progress in this challenging domain has been relatively slow due to the lack of sufficiently large datasets. In this paper, we introduce EPIC-KITCHENS, a large-scale egocentric video benchmark recorded by 32 participants in their native kitchen environments. Our videos depict nonscripted daily activities: we simply asked each participant to start recording every time they entered their kitchen. Recording took place in 4 cities (in North America and Europe) by participants belonging to 10 different nationalities, resulting in highly diverse cooking styles. Our dataset features 55 hours of video consisting of 11.5M frames, which we densely labeled for a total of 39.6K action segments and 454.3K object bounding boxes. Our annotation is unique in that we had the participants narrate their own videos (after recording), thus reflecting true intention, and we crowd-sourced ground-truths based on these. We describe our object, action and anticipation challenges, and evaluate several baselines over two test splits, seen and unseen kitchens. Dataset and Project page: http://epic-kitchens.github.io

Scaling Egocentric Vision: The EPIC-KITCHENS Dataset

TL;DR

EPIC-KITCHENS addresses the lack of large-scale egocentric datasets by collecting 55 hours of native-kitchen video from 32 participants across 4 cities, and annotating actions via participant narrations to capture true intention. The work introduces a comprehensive data-pipeline including action-segment annotations, active object bounding boxes, and verb/noun classonomies, plus QA checks, resulting in 39.6K action segments and 454.3K bounding boxes. It defines three benchmarks—object detection, action recognition, and action anticipation—with seen/unseen kitchen splits and public leaderboards to gauge generalization and progress. Baseline results using Faster R-CNN and TSN reveal the dataset’s difficulty and its potential to push advances in fine-grained, multi-task egocentric video understanding and human-robot interaction in daily living contexts.

Abstract

First-person vision is gaining interest as it offers a unique viewpoint on people's interaction with objects, their attention, and even intention. However, progress in this challenging domain has been relatively slow due to the lack of sufficiently large datasets. In this paper, we introduce EPIC-KITCHENS, a large-scale egocentric video benchmark recorded by 32 participants in their native kitchen environments. Our videos depict nonscripted daily activities: we simply asked each participant to start recording every time they entered their kitchen. Recording took place in 4 cities (in North America and Europe) by participants belonging to 10 different nationalities, resulting in highly diverse cooking styles. Our dataset features 55 hours of video consisting of 11.5M frames, which we densely labeled for a total of 39.6K action segments and 454.3K object bounding boxes. Our annotation is unique in that we had the participants narrate their own videos (after recording), thus reflecting true intention, and we crowd-sourced ground-truths based on these. We describe our object, action and anticipation challenges, and evaluate several baselines over two test splits, seen and unseen kitchens. Dataset and Project page: http://epic-kitchens.github.io

Paper Structure

This paper contains 14 sections, 2 equations, 10 figures, 7 tables.

Figures (10)

  • Figure 1: From top: Frames from the 32 environments; Narrations by participants used to annotate action segments; Active object bounding box annotations
  • Figure 2: Head-mounted GoPro used in dataset recording
  • Figure 3: Instructions used to collect video narrations from our participants
  • Figure 4: Top (left to right): time of day of the recording, pie chart of high-level goals, histogram of sequence durations and dataset logo; Bottom: Wordles of narrations in native languages (English, Italian, Spanish, Greek and Chinese)
  • Figure 5: An example of annotated action segments for 2 consecutive actions
  • ...and 5 more figures