Exploiting Multimodal Synthetic Data for Egocentric Human-Object Interaction Detection in an Industrial Scenario
Rosario Leonardi, Francesco Ragusa, Antonino Furnari, Giovanni Maria Farinella
TL;DR
This work tackles egocentric human-object interaction detection in industrial environments by introducing a synthetic data generation pipeline and EgoISM-HOI, a multimodal dataset combining synthetic and real EHOI data with RGB, depth, and instance segmentation annotations. The authors design a multimodal EHOI detector that leverages RGB, depth, and masks, plus hand-side, hand-state, and offset-vector modules to predict <hand, contact state, active object> triplets, followed by a matching step to select active objects. They demonstrate that pretraining on domain-specific synthetic data substantially boosts performance on real-world data, especially for active-object metrics, and show further gains from full multimodal training and from pretraining on external datasets. The approach outperforms class-agnostic baselines (HiC, VISOR) across multiple settings, and the authors publicly release EgoISM-HOI data, code, and pretrained models to support research and practical deployment. Overall, synthetic-domain pretraining paired with multimodal hand-object modeling offers a practical path to robust EHOI detection in real industrial workflows.
Abstract
In this paper, we tackle the problem of Egocentric Human-Object Interaction (EHOI) detection in an industrial setting. To overcome the lack of public datasets in this context, we propose a pipeline and a tool for generating synthetic images of EHOIs paired with several annotations and data signals (e.g., depth maps or segmentation masks). Using the proposed pipeline, we present EgoISM-HOI a new multimodal dataset composed of synthetic EHOI images in an industrial environment with rich annotations of hands and objects. To demonstrate the utility and effectiveness of synthetic EHOI data produced by the proposed tool, we designed a new method that predicts and combines different multimodal signals to detect EHOIs in RGB images. Our study shows that exploiting synthetic data to pre-train the proposed method significantly improves performance when tested on real-world data. Moreover, to fully understand the usefulness of our method, we conducted an in-depth analysis in which we compared and highlighted the superiority of the proposed approach over different state-of-the-art class-agnostic methods. To support research in this field, we publicly release the datasets, source code, and pre-trained models at https://iplab.dmi.unict.it/egoism-hoi.
