Table of Contents
Fetching ...

ChildPlay-Hand: A Dataset of Hand Manipulations in the Wild

Arya Farkhondeh, Samy Tafasca, Jean-Marc Odobez

TL;DR

The findings suggest that ChildPlay-Hand is a challenging new benchmark for modeling HOI in the wild, and benchmark various spatio-temporal and segmentation networks, exploring body vs. hand-region information and comparing pose and RGB modalities.

Abstract

Hand-Object Interaction (HOI) is gaining significant attention, particularly with the creation of numerous egocentric datasets driven by AR/VR applications. However, third-person view HOI has received less attention, especially in terms of datasets. Most third-person view datasets are curated for action recognition tasks and feature pre-segmented clips of high-level daily activities, leaving a gap for in-the-wild datasets. To address this gap, we propose ChildPlay-Hand, a novel dataset that includes person and object bounding boxes, as well as manipulation actions. ChildPlay-Hand is unique in: (1) providing per-hand annotations; (2) featuring videos in uncontrolled settings with natural interactions, involving both adults and children; (3) including gaze labels from the ChildPlay-Gaze dataset for joint modeling of manipulations and gaze. The manipulation actions cover the main stages of an HOI cycle, such as grasping, holding or operating, and different types of releasing. To illustrate the interest of the dataset, we study two tasks: object in hand detection (OiH), i.e. if a person has an object in their hand, and manipulation stages (ManiS), which is more fine-grained and targets the main stages of manipulation. We benchmark various spatio-temporal and segmentation networks, exploring body vs. hand-region information and comparing pose and RGB modalities. Our findings suggest that ChildPlay-Hand is a challenging new benchmark for modeling HOI in the wild.

ChildPlay-Hand: A Dataset of Hand Manipulations in the Wild

TL;DR

The findings suggest that ChildPlay-Hand is a challenging new benchmark for modeling HOI in the wild, and benchmark various spatio-temporal and segmentation networks, exploring body vs. hand-region information and comparing pose and RGB modalities.

Abstract

Hand-Object Interaction (HOI) is gaining significant attention, particularly with the creation of numerous egocentric datasets driven by AR/VR applications. However, third-person view HOI has received less attention, especially in terms of datasets. Most third-person view datasets are curated for action recognition tasks and feature pre-segmented clips of high-level daily activities, leaving a gap for in-the-wild datasets. To address this gap, we propose ChildPlay-Hand, a novel dataset that includes person and object bounding boxes, as well as manipulation actions. ChildPlay-Hand is unique in: (1) providing per-hand annotations; (2) featuring videos in uncontrolled settings with natural interactions, involving both adults and children; (3) including gaze labels from the ChildPlay-Gaze dataset for joint modeling of manipulations and gaze. The manipulation actions cover the main stages of an HOI cycle, such as grasping, holding or operating, and different types of releasing. To illustrate the interest of the dataset, we study two tasks: object in hand detection (OiH), i.e. if a person has an object in their hand, and manipulation stages (ManiS), which is more fine-grained and targets the main stages of manipulation. We benchmark various spatio-temporal and segmentation networks, exploring body vs. hand-region information and comparing pose and RGB modalities. Our findings suggest that ChildPlay-Hand is a challenging new benchmark for modeling HOI in the wild.
Paper Structure (22 sections, 2 equations, 9 figures, 3 tables)

This paper contains 22 sections, 2 equations, 9 figures, 3 tables.

Figures (9)

  • Figure 1: Sample instances from the ChildPlay-Hand dataset with person bounding boxes and the per-hand object bounding boxes and corresponding action classes.
  • Figure 2: Distribution of hand action classes in the dataset. We show the distribution in frames (top) and events (bottom).
  • Figure 3: Distribution of event duration (in frames) per action class. The violin plot shows the min, max and median values of each distribution.
  • Figure 4: Class-wise frame-based: Precision, Recall, and F1.
  • Figure 5: Class-wise segmental: Precision, Recall, and F1.
  • ...and 4 more figures