Table of Contents
Fetching ...

CaptainCook4D: A Dataset for Understanding Errors in Procedural Activities

Rohith Peddi, Shivvrat Arya, Bharath Challa, Likhitha Pallapothula, Akshay Vyas, Bhavya Gouripeddi, Jikai Wang, Qifan Zhang, Vasundhara Komaragiri, Eric Ragan, Nicholas Ruozzi, Yu Xiang, Vibhav Gogate

TL;DR

CaptainCook4D introduces a large, egocentric 4D dataset of real-kitchen procedural activities with both normal and errorful executions to advance procedural understanding. It provides task graphs, multi-modal sensor data (GoPro and Hololens2), and detailed annotations (coarse and fine-grained steps, plus error categories) across 384 recordings of 24 WikiHow recipes. The authors benchmark error recognition, multi-step localization, and procedure learning using supervised and self-supervised methods, demonstrating the value of longer temporal context and multimodal fusion while highlighting current gaps for zero-shot and robust error handling. The work offers a rich resource for developing and evaluating robust procedural reasoning systems with potential transfer to high-stakes domains like medicine and chemistry, and it lays groundwork for extending such datasets to broader domains and modalities.

Abstract

Following step-by-step procedures is an essential component of various activities carried out by individuals in their daily lives. These procedures serve as a guiding framework that helps to achieve goals efficiently, whether it is assembling furniture or preparing a recipe. However, the complexity and duration of procedural activities inherently increase the likelihood of making errors. Understanding such procedural activities from a sequence of frames is a challenging task that demands an accurate interpretation of visual information and the ability to reason about the structure of the activity. To this end, we collect a new egocentric 4D dataset, CaptainCook4D, comprising 384 recordings (94.5 hours) of people performing recipes in real kitchen environments. This dataset consists of two distinct types of activity: one in which participants adhere to the provided recipe instructions and another in which they deviate and induce errors. We provide 5.3K step annotations and 10K fine-grained action annotations and benchmark the dataset for the following tasks: supervised error recognition, multistep localization, and procedure learning

CaptainCook4D: A Dataset for Understanding Errors in Procedural Activities

TL;DR

CaptainCook4D introduces a large, egocentric 4D dataset of real-kitchen procedural activities with both normal and errorful executions to advance procedural understanding. It provides task graphs, multi-modal sensor data (GoPro and Hololens2), and detailed annotations (coarse and fine-grained steps, plus error categories) across 384 recordings of 24 WikiHow recipes. The authors benchmark error recognition, multi-step localization, and procedure learning using supervised and self-supervised methods, demonstrating the value of longer temporal context and multimodal fusion while highlighting current gaps for zero-shot and robust error handling. The work offers a rich resource for developing and evaluating robust procedural reasoning systems with potential transfer to high-stakes domains like medicine and chemistry, and it lays groundwork for extending such datasets to broader domains and modalities.

Abstract

Following step-by-step procedures is an essential component of various activities carried out by individuals in their daily lives. These procedures serve as a guiding framework that helps to achieve goals efficiently, whether it is assembling furniture or preparing a recipe. However, the complexity and duration of procedural activities inherently increase the likelihood of making errors. Understanding such procedural activities from a sequence of frames is a challenging task that demands an accurate interpretation of visual information and the ability to reason about the structure of the activity. To this end, we collect a new egocentric 4D dataset, CaptainCook4D, comprising 384 recordings (94.5 hours) of people performing recipes in real kitchen environments. This dataset consists of two distinct types of activity: one in which participants adhere to the provided recipe instructions and another in which they deviate and induce errors. We provide 5.3K step annotations and 10K fine-grained action annotations and benchmark the dataset for the following tasks: supervised error recognition, multistep localization, and procedure learning
Paper Structure (78 sections, 26 figures, 19 tables)

This paper contains 78 sections, 26 figures, 19 tables.

Figures (26)

  • Figure 1: Overview.Top: We constructed task graphs for the selected recipes. These graphs facilitated sampling topological orders (cooking steps) that participants followed to perform. During the execution of these steps, participants induced errors that were both intentional and unintentional in nature. Bottom Left: We present the sensors employed for data collection. Bottom Right: We describe the details of the modalities of the data collected while the participant performs the recipe.
  • Figure 2: Snapshots of steps and recorded errors while preparing the recipe Cucumber Raita. Three of the four errors were intentional, but the participant missed the Peeling step unintentionally.
  • Figure 3: Error Categories.Left: We present a categorization of participant-induced errors derived from the annotated error descriptions of the recordings. Right: We display frames captured from various recordings, highlighting correct and erroneous executions. Bottom Right: We present statistics on the error categories in the dataset derived from the compiled annotations of all recordings.
  • Figure 4: Statistics.Top: We present video and step duration statistics to the left & right respectively. Bottom: We present the total count and the durations of normal and error recordings for each recipe.
  • Figure 5: SupervisedER architectures of 3 baselines.
  • ...and 21 more figures