Exploiting Multimodal Synthetic Data for Egocentric Human-Object Interaction Detection in an Industrial Scenario

Rosario Leonardi; Francesco Ragusa; Antonino Furnari; Giovanni Maria Farinella

Exploiting Multimodal Synthetic Data for Egocentric Human-Object Interaction Detection in an Industrial Scenario

Rosario Leonardi, Francesco Ragusa, Antonino Furnari, Giovanni Maria Farinella

TL;DR

This work tackles egocentric human-object interaction detection in industrial environments by introducing a synthetic data generation pipeline and EgoISM-HOI, a multimodal dataset combining synthetic and real EHOI data with RGB, depth, and instance segmentation annotations. The authors design a multimodal EHOI detector that leverages RGB, depth, and masks, plus hand-side, hand-state, and offset-vector modules to predict <hand, contact state, active object> triplets, followed by a matching step to select active objects. They demonstrate that pretraining on domain-specific synthetic data substantially boosts performance on real-world data, especially for active-object metrics, and show further gains from full multimodal training and from pretraining on external datasets. The approach outperforms class-agnostic baselines (HiC, VISOR) across multiple settings, and the authors publicly release EgoISM-HOI data, code, and pretrained models to support research and practical deployment. Overall, synthetic-domain pretraining paired with multimodal hand-object modeling offers a practical path to robust EHOI detection in real industrial workflows.

Abstract

In this paper, we tackle the problem of Egocentric Human-Object Interaction (EHOI) detection in an industrial setting. To overcome the lack of public datasets in this context, we propose a pipeline and a tool for generating synthetic images of EHOIs paired with several annotations and data signals (e.g., depth maps or segmentation masks). Using the proposed pipeline, we present EgoISM-HOI a new multimodal dataset composed of synthetic EHOI images in an industrial environment with rich annotations of hands and objects. To demonstrate the utility and effectiveness of synthetic EHOI data produced by the proposed tool, we designed a new method that predicts and combines different multimodal signals to detect EHOIs in RGB images. Our study shows that exploiting synthetic data to pre-train the proposed method significantly improves performance when tested on real-world data. Moreover, to fully understand the usefulness of our method, we conducted an in-depth analysis in which we compared and highlighted the superiority of the proposed approach over different state-of-the-art class-agnostic methods. To support research in this field, we publicly release the datasets, source code, and pre-trained models at https://iplab.dmi.unict.it/egoism-hoi.

Exploiting Multimodal Synthetic Data for Egocentric Human-Object Interaction Detection in an Industrial Scenario

TL;DR

Abstract

Paper Structure (35 sections, 2 equations, 10 figures, 10 tables)

This paper contains 35 sections, 2 equations, 10 figures, 10 tables.

Introduction
Related Work
Datasets for Human-Object Interaction Detection
Human-Object Interaction simulators and synthetic datasets
Methods for Detecting Human-Object Interactions
Proposed EHOI Generation Pipeline
EgoISM-HOI dataset
EgoISM-HOI-Synth
EgoISM-HOI-Real
Proposed approach
Backbone
Object detector branch
Instance segmentation branch
Monocular depth estimation branch
Hand side classifier
...and 20 more sections

Figures (10)

Figure 1: Synthetic EHOI images generation pipeline. (a) We use 3D scanners to acquire 3D models of the objects and environment. (b) We hence use the proposed data generation tool to create the synthetic dataset.
Figure 2: A picture of the ENIGMA Lab.
Figure 3: 3D models of the 19 objects considered for the experiments.
Figure 4: Examples of synthetic images (left) with the corresponding annotations (center) and depth maps (right) generated with the proposed tool.
Figure 5: Our tool is able to randomize different aspects of the virtual scene, such as the camera and user positions or the shirt's texture and color.
...and 5 more figures

Exploiting Multimodal Synthetic Data for Egocentric Human-Object Interaction Detection in an Industrial Scenario

TL;DR

Abstract

Exploiting Multimodal Synthetic Data for Egocentric Human-Object Interaction Detection in an Industrial Scenario

Authors

TL;DR

Abstract

Table of Contents

Figures (10)