Table of Contents
Fetching ...

A Multimodal Dataset for Enhancing Industrial Task Monitoring and Engagement Prediction

Naval Kishore Mehta, Arvind, Himanshu Kumar, Abeer Banerjee, Sumeet Saurav, Sanjay Singh

TL;DR

The paper tackles the challenge of monitoring operator actions, engagement, and object interactions in dynamic industrial workflows where unimodal approaches fall short. It introduces the Multimodal Industrial Activity Monitoring (MIAM) dataset, combining multi-view RGB, depth, and IMU data across 22 sessions (290 minutes) with annotations for action localization and engagement, and proposes a multimodal fusion network for engagement prediction. The model leverages an I3D backbone for RGB, Gramian Angular Field representations of IMU data processed by ResNet-18, and skeleton embeddings, with intra- and cross-modality attention and gated fusion, achieving up to 86.79% accuracy in Engaged/Disengaged prediction on a 1,572/386 train/test split. This dataset and associated code provide a valuable resource for advancing human-robot collaboration in real-world industrial environments, with a CC BY 4.0 license.

Abstract

Detecting and interpreting operator actions, engagement, and object interactions in dynamic industrial workflows remains a significant challenge in human-robot collaboration research, especially within complex, real-world environments. Traditional unimodal methods often fall short of capturing the intricacies of these unstructured industrial settings. To address this gap, we present a novel Multimodal Industrial Activity Monitoring (MIAM) dataset that captures realistic assembly and disassembly tasks, facilitating the evaluation of key meta-tasks such as action localization, object interaction, and engagement prediction. The dataset comprises multi-view RGB, depth, and Inertial Measurement Unit (IMU) data collected from 22 sessions, amounting to 290 minutes of untrimmed video, annotated in detail for task performance and operator behavior. Its distinctiveness lies in the integration of multiple data modalities and its emphasis on real-world, untrimmed industrial workflows-key for advancing research in human-robot collaboration and operator monitoring. Additionally, we propose a multimodal network that fuses RGB frames, IMU data, and skeleton sequences to predict engagement levels during industrial tasks. Our approach improves the accuracy of recognizing engagement states, providing a robust solution for monitoring operator performance in dynamic industrial environments. The dataset and code can be accessed from https://github.com/navalkishoremehta95/MIAM/.

A Multimodal Dataset for Enhancing Industrial Task Monitoring and Engagement Prediction

TL;DR

The paper tackles the challenge of monitoring operator actions, engagement, and object interactions in dynamic industrial workflows where unimodal approaches fall short. It introduces the Multimodal Industrial Activity Monitoring (MIAM) dataset, combining multi-view RGB, depth, and IMU data across 22 sessions (290 minutes) with annotations for action localization and engagement, and proposes a multimodal fusion network for engagement prediction. The model leverages an I3D backbone for RGB, Gramian Angular Field representations of IMU data processed by ResNet-18, and skeleton embeddings, with intra- and cross-modality attention and gated fusion, achieving up to 86.79% accuracy in Engaged/Disengaged prediction on a 1,572/386 train/test split. This dataset and associated code provide a valuable resource for advancing human-robot collaboration in real-world industrial environments, with a CC BY 4.0 license.

Abstract

Detecting and interpreting operator actions, engagement, and object interactions in dynamic industrial workflows remains a significant challenge in human-robot collaboration research, especially within complex, real-world environments. Traditional unimodal methods often fall short of capturing the intricacies of these unstructured industrial settings. To address this gap, we present a novel Multimodal Industrial Activity Monitoring (MIAM) dataset that captures realistic assembly and disassembly tasks, facilitating the evaluation of key meta-tasks such as action localization, object interaction, and engagement prediction. The dataset comprises multi-view RGB, depth, and Inertial Measurement Unit (IMU) data collected from 22 sessions, amounting to 290 minutes of untrimmed video, annotated in detail for task performance and operator behavior. Its distinctiveness lies in the integration of multiple data modalities and its emphasis on real-world, untrimmed industrial workflows-key for advancing research in human-robot collaboration and operator monitoring. Additionally, we propose a multimodal network that fuses RGB frames, IMU data, and skeleton sequences to predict engagement levels during industrial tasks. Our approach improves the accuracy of recognizing engagement states, providing a robust solution for monitoring operator performance in dynamic industrial environments. The dataset and code can be accessed from https://github.com/navalkishoremehta95/MIAM/.
Paper Structure (8 sections, 3 figures, 3 tables)

This paper contains 8 sections, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Recording setup for assembly tasks with visual and motion tracking.
  • Figure 2: MIAM dataset sample with engagement levels, action labels, RGB and depth views, and IMU data during assembly tasks.
  • Figure 3: Overview of the multimodal fusion network for engagement recognition, demonstrating the flow from RGB frames, IMU sensors, and pose keypoints through the attention blocks to the final fusion and classification layers.