Table of Contents
Fetching ...

IndEgo: A Dataset of Industrial Scenarios and Collaborative Work for Egocentric Assistants

Vivek Chavan, Yasmina Imgrund, Tung Dao, Sanwantri Bai, Bosong Wang, Ze Lu, Oliver Heimann, Jörg Krüger

TL;DR

IndEgo tackles the scarcity of industrial, collaborative, long-horizon egocentric datasets by introducing 3,460 egocentric videos (≈197 hours) and 1,092 exocentric videos (≈97 hours) with rich multimodal data (eye gaze, narration, audio, motion) and two-person collaboration. It provides detailed annotations, task graphs, and benchmarks for procedural/non-procedural task understanding, Mistake Detection, and reasoning-based QA, revealing significant challenges for current multimodal models. The paper demonstrates baseline results across MD, QA, and collaborative task understanding, showing the value of joint ego-exo views and modality-aware ablations. IndEgo’s release aims to spur research in instruction following, human-AI collaboration, and embodied AI for safe, productive industrial operations, with data and code openly accessible on Hugging Face and GitHub.

Abstract

We introduce IndEgo, a multimodal egocentric and exocentric dataset addressing common industrial tasks, including assembly/disassembly, logistics and organisation, inspection and repair, woodworking, and others. The dataset contains 3,460 egocentric recordings (approximately 197 hours), along with 1,092 exocentric recordings (approximately 97 hours). A key focus of the dataset is collaborative work, where two workers jointly perform cognitively and physically intensive tasks. The egocentric recordings include rich multimodal data and added context via eye gaze, narration, sound, motion, and others. We provide detailed annotations (actions, summaries, mistake annotations, narrations), metadata, processed outputs (eye gaze, hand pose, semi-dense point cloud), and benchmarks on procedural and non-procedural task understanding, Mistake Detection, and reasoning-based Question Answering. Baseline evaluations for Mistake Detection, Question Answering and collaborative task understanding show that the dataset presents a challenge for the state-of-the-art multimodal models. Our dataset is available at: https://huggingface.co/datasets/FraunhoferIPK/IndEgo

IndEgo: A Dataset of Industrial Scenarios and Collaborative Work for Egocentric Assistants

TL;DR

IndEgo tackles the scarcity of industrial, collaborative, long-horizon egocentric datasets by introducing 3,460 egocentric videos (≈197 hours) and 1,092 exocentric videos (≈97 hours) with rich multimodal data (eye gaze, narration, audio, motion) and two-person collaboration. It provides detailed annotations, task graphs, and benchmarks for procedural/non-procedural task understanding, Mistake Detection, and reasoning-based QA, revealing significant challenges for current multimodal models. The paper demonstrates baseline results across MD, QA, and collaborative task understanding, showing the value of joint ego-exo views and modality-aware ablations. IndEgo’s release aims to spur research in instruction following, human-AI collaboration, and embodied AI for safe, productive industrial operations, with data and code openly accessible on Hugging Face and GitHub.

Abstract

We introduce IndEgo, a multimodal egocentric and exocentric dataset addressing common industrial tasks, including assembly/disassembly, logistics and organisation, inspection and repair, woodworking, and others. The dataset contains 3,460 egocentric recordings (approximately 197 hours), along with 1,092 exocentric recordings (approximately 97 hours). A key focus of the dataset is collaborative work, where two workers jointly perform cognitively and physically intensive tasks. The egocentric recordings include rich multimodal data and added context via eye gaze, narration, sound, motion, and others. We provide detailed annotations (actions, summaries, mistake annotations, narrations), metadata, processed outputs (eye gaze, hand pose, semi-dense point cloud), and benchmarks on procedural and non-procedural task understanding, Mistake Detection, and reasoning-based Question Answering. Baseline evaluations for Mistake Detection, Question Answering and collaborative task understanding show that the dataset presents a challenge for the state-of-the-art multimodal models. Our dataset is available at: https://huggingface.co/datasets/FraunhoferIPK/IndEgo

Paper Structure

This paper contains 42 sections, 12 figures, 42 tables.

Figures (12)

  • Figure 1: Some examples from the IndEgo dataset showing different industrial cases. a.Assembly/Disassembly and Collaborative Work (further elaboration on Figure \ref{['disassembly_table']}), b. Logistics and Organisation, c.Woodworking, d. Inspection and Repair (The worker's narration and the annotated actions are also shown).
  • Figure 2: A scenario of a disassembly process from the IndEgo dataset. The two participants work collaboratively on the task. The semi-dense point cloud and the user trajectories are generated by processing the raw data from the Aria device engel2023projectarianewtoolFacebookResearch2023ARK. The egocentric perspective of the two participants with the projected eye gaze point can be seen in relation to the 3D environment and the exocentric view. Bottom: The annotations from each worker's perspective, and the keysteps in the process. Right: The corresponding task graph for the procedure. The flow of activities is from top to bottom, and dependencies are shown with an arrow. denotes labour-intensive steps.
  • Figure 3: Grouped bar charts of frequencies (left axis) and durations (right axis) for the fine-grained action annotations: Top 20 nouns (left), verbs (middle), and adjectives (right). Our dataset covers diverse industrial contexts, which are not represented by current egocentric/exocentric datasets. This highlights the multimodality and human-centric attributes of IndEgo.
  • Figure 4: Histogram of egocentric video durations (minutes) of medium-longer task sequences for each category: assembly/disassembly, inspection/repair, logistics/organisation, and woodworking. The hatched regions represent two-person collaborative tasks. Details on the Miscellaneous category are provided in Appendix (\ref{['scenarios']} and \ref{['mistake']}).
  • Figure 5: Task graphs for some scenarios from the MD benchmark, along with some commonly seen mistakes (not exhaustive). The flow of actions is from top to bottom, and dependencies are shown with an arrow. denotes labour-intensive steps. Left:Miscellaneous sequence #3 at the top, and Assembly sequence #23 at bottom. Right:Logistics sequence #21. The graph is shortened due to spatial constraints.
  • ...and 7 more figures