Table of Contents
Fetching ...

Leveraging Synthetic Data for Enhancing Egocentric Hand-Object Interaction Detection

Rosario Leonardi, Antonino Furnari, Francesco Ragusa, Giovanni Maria Farinella

Abstract

In this work, we explore the role of synthetic data in improving the detection of Hand-Object Interactions from egocentric images. Through extensive experimentation and comparative analysis on VISOR, EgoHOS, and ENIGMA-51 datasets, our findings demonstrate the potential of synthetic data to significantly improve HOI detection, particularly when real labeled data are scarce or unavailable. By using synthetic data and only 10% of the real labeled data, we achieve improvements in Overall AP over models trained exclusively on real data, with gains of +5.67% on VISOR, +8.24% on EgoHOS, and +11.69% on ENIGMA-51. Furthermore, we systematically study how aligning synthetic data to specific real-world benchmarks with respect to objects, grasps, and environments, showing that the effectiveness of synthetic data consistently improves with better synthetic-real alignment. As a result of this work, we release a new data generation pipeline and the new HOI-Synth benchmark, which augments existing datasets with synthetic images of hand-object interaction. These data are automatically annotated with hand-object contact states, bounding boxes, and pixel-wise segmentation masks. All data, code, and tools for synthetic data generation are available at: https://fpv-iplab.github.io/HOI-Synth/.

Leveraging Synthetic Data for Enhancing Egocentric Hand-Object Interaction Detection

Abstract

In this work, we explore the role of synthetic data in improving the detection of Hand-Object Interactions from egocentric images. Through extensive experimentation and comparative analysis on VISOR, EgoHOS, and ENIGMA-51 datasets, our findings demonstrate the potential of synthetic data to significantly improve HOI detection, particularly when real labeled data are scarce or unavailable. By using synthetic data and only 10% of the real labeled data, we achieve improvements in Overall AP over models trained exclusively on real data, with gains of +5.67% on VISOR, +8.24% on EgoHOS, and +11.69% on ENIGMA-51. Furthermore, we systematically study how aligning synthetic data to specific real-world benchmarks with respect to objects, grasps, and environments, showing that the effectiveness of synthetic data consistently improves with better synthetic-real alignment. As a result of this work, we release a new data generation pipeline and the new HOI-Synth benchmark, which augments existing datasets with synthetic images of hand-object interaction. These data are automatically annotated with hand-object contact states, bounding boxes, and pixel-wise segmentation masks. All data, code, and tools for synthetic data generation are available at: https://fpv-iplab.github.io/HOI-Synth/.

Paper Structure

This paper contains 35 sections, 13 figures, 10 tables.

Figures (13)

  • Figure 1: We explore the role of synthetic data in egocentric hand-object interaction detection by generating and automatically labeling synthetic datasets (left). We then analyze domain adaptation scenarios where models are trained using both synthetic and real unlabeled data, with varying amounts of labeled real data (right).
  • Figure 2: The proposed data generation pipeline. (a) An object-grasp pair is selected from DexGraspNetwang2023dexgraspnet and integrated with a randomly generated human model. (b) The human + object model is placed in an environment randomly selected from the Habitat-Matterport 3D dataset ramakrishnan2021habitat. (c) Egocentric data of hand-object interactions is generated and automatically labeled. Labels include bounding boxes and segmentation masks of hands and interacted objects, contact-state, and hand-object relations.
  • Figure 3: Qualitative examples generated by our HOI-Synth Data Generation Pipeline. For each sample, the pipeline automatically generates rich ground-truth annotations, including 2D bounding boxes, pixel-wise segmentation masks, and interaction metadata describing the relationship between hands and manipulated objects.
  • Figure 4: A ENIGMA-51 image (left), a synthetic in-domain image (center), and a synthetic out-domain image (right).
  • Figure 5: The architecture of the domain adaptation approach used in our analysis. The method is based on the Adaptive Teacherli2022cross framework, extended with HOS recognition modules VISOR2022.
  • ...and 8 more figures