Table of Contents
Fetching ...

A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains

Antonio Finocchiaro, Alessandro Sebastiano Catinello, Michele Mazzamuto, Rosario Leonardi, Antonino Furnari, Giovanni Maria Farinella

TL;DR

This work tackles real-time egocentric HOI detection in industrial environments by decoupling the problem into online action recognition and conditional object detection. It introduces a Mamba-based AR module with a distilled EfficientNetV2 backbone and a YOLOWorld OD module, achieving strong accuracy with real-time latency. Experiments on ENIGMA-51 demonstrate favorable p-AP and AP metrics, while an oracle baseline provides an upper bound on performance. The approach is demonstrated on a Meta Quest 3, illustrating practical edge deployment for wearable industrial assistance.

Abstract

Hand-object interaction detection remains an open challenge in real-time applications, where intuitive user experiences depend on fast and accurate detection of interactions with surrounding objects. We propose an efficient approach for detecting hand-objects interactions from streaming egocentric vision that operates in real time. Our approach consists of an action recognition module and an object detection module for identifying active objects upon confirmed interaction. Our Mamba model with EfficientNetV2 as backbone for action recognition achieves 38.52% p-AP on the ENIGMA-51 benchmark at 30fps, while our fine-tuned YOLOWorld reaches 85.13% AP for hand and object. We implement our models in a cascaded architecture where the action recognition and object detection modules operate sequentially. When the action recognition predicts a contact state, it activates the object detection module, which in turn performs inference on the relevant frame to detect and classify the active object.

A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains

TL;DR

This work tackles real-time egocentric HOI detection in industrial environments by decoupling the problem into online action recognition and conditional object detection. It introduces a Mamba-based AR module with a distilled EfficientNetV2 backbone and a YOLOWorld OD module, achieving strong accuracy with real-time latency. Experiments on ENIGMA-51 demonstrate favorable p-AP and AP metrics, while an oracle baseline provides an upper bound on performance. The approach is demonstrated on a Meta Quest 3, illustrating practical edge deployment for wearable industrial assistance.

Abstract

Hand-object interaction detection remains an open challenge in real-time applications, where intuitive user experiences depend on fast and accurate detection of interactions with surrounding objects. We propose an efficient approach for detecting hand-objects interactions from streaming egocentric vision that operates in real time. Our approach consists of an action recognition module and an object detection module for identifying active objects upon confirmed interaction. Our Mamba model with EfficientNetV2 as backbone for action recognition achieves 38.52% p-AP on the ENIGMA-51 benchmark at 30fps, while our fine-tuned YOLOWorld reaches 85.13% AP for hand and object. We implement our models in a cascaded architecture where the action recognition and object detection modules operate sequentially. When the action recognition predicts a contact state, it activates the object detection module, which in turn performs inference on the relevant frame to detect and classify the active object.

Paper Structure

This paper contains 14 sections, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Schema of the proposed approach. The AR module predicts the contact state online. Whenever a contact is detected, the object detection module identifies the active object class by evaluating the Intersection over Union (IoU) between objects and hands.
  • Figure 2: Representative samples of YOLOWorldM predictions on some validation images of ENIGMA-51. Blue boxes represent objects, green boxes represent hands.
  • Figure 3: Representative samples of active object retrieval performed by YOLOWorld. Green frames highlight success cases, red frames indicate failure cases.
  • Figure 4: Representation of the full pipeline, from input frames recording to model predictions back to the Meta Quest 3.