Table of Contents
Fetching ...

DynaHOI: Benchmarking Hand-Object Interaction for Dynamic Target

BoCheng Hu, Zhonghan Zhao, Kaiyue Zhou, Hongwei Wang, Gaoang Wang

TL;DR

DynaHOI tackles the gap in dynamic hand-object interaction benchmarking by introducing DynaHOI-Gym, an online closed-loop evaluation platform, and DynaHOI-10M, a large-scale dataset with moving targets across 8 motion types and 22 subtypes. The framework supports both observe-before-act and direct-act rollouts and evaluates diverse models, including policy-based diffusion controllers and generalist VLMs, using a hierarchical metric suite that covers localization, grasping, trajectory quality, and completion speed. The ObAct baseline demonstrates motion-aware benefits by incorporating short-term temporal observations and spatiotemporal attention, achieving notable gains (e.g., an 8.1% improvement in localization) over prior methods. Findings reveal that even state-of-the-art dynamic policies struggle with accurate anticipation and robust contact-rich grasping under motion, underscoring the need for motion-aware architectures and temporal context in dexterous manipulation tasks.

Abstract

Most existing hand motion generation benchmarks for hand-object interaction (HOI) focus on static objects, leaving dynamic scenarios with moving targets and time-critical coordination largely untested. To address this gap, we introduce the DynaHOI-Gym, a unified online closed-loop platform with parameterized motion generators and rollout-based metrics for dynamic capture evaluation. Built on DynaHOI-Gym, we release DynaHOI-10M, a large-scale benchmark with 10M frames and 180K hand capture trajectories, whose target motions are organized into 8 major categories and 22 fine-grained subcategories. We also provide a simple observe-before-act baseline (ObAct) that integrates short-term observations with the current frame via spatiotemporal attention to predict actions, achieving an 8.1% improvement in location success rate.

DynaHOI: Benchmarking Hand-Object Interaction for Dynamic Target

TL;DR

DynaHOI tackles the gap in dynamic hand-object interaction benchmarking by introducing DynaHOI-Gym, an online closed-loop evaluation platform, and DynaHOI-10M, a large-scale dataset with moving targets across 8 motion types and 22 subtypes. The framework supports both observe-before-act and direct-act rollouts and evaluates diverse models, including policy-based diffusion controllers and generalist VLMs, using a hierarchical metric suite that covers localization, grasping, trajectory quality, and completion speed. The ObAct baseline demonstrates motion-aware benefits by incorporating short-term temporal observations and spatiotemporal attention, achieving notable gains (e.g., an 8.1% improvement in localization) over prior methods. Findings reveal that even state-of-the-art dynamic policies struggle with accurate anticipation and robust contact-rich grasping under motion, underscoring the need for motion-aware architectures and temporal context in dexterous manipulation tasks.

Abstract

Most existing hand motion generation benchmarks for hand-object interaction (HOI) focus on static objects, leaving dynamic scenarios with moving targets and time-critical coordination largely untested. To address this gap, we introduce the DynaHOI-Gym, a unified online closed-loop platform with parameterized motion generators and rollout-based metrics for dynamic capture evaluation. Built on DynaHOI-Gym, we release DynaHOI-10M, a large-scale benchmark with 10M frames and 180K hand capture trajectories, whose target motions are organized into 8 major categories and 22 fine-grained subcategories. We also provide a simple observe-before-act baseline (ObAct) that integrates short-term observations with the current frame via spatiotemporal attention to predict actions, achieving an 8.1% improvement in location success rate.
Paper Structure (52 sections, 7 equations, 17 figures, 3 tables)

This paper contains 52 sections, 7 equations, 17 figures, 3 tables.

Figures (17)

  • Figure 1: Dynamic capture in DynaHOI-10M: 11 moving objects and 22 moving types (left). Given an instruction and a short observation window, the model must predict the interception point and capture the moving target (middle). Existing policy models and generalist VLMs achieve low localization success rate on our benchmark, highlighting the challenge of motion-aware anticipation (right).
  • Figure 2: The framework of DynaHOI-10M benchmark. Model zoo with three families, VLA, diffusion policies, and VLM-based controllers. Middle: DynaHOI-Gym supports trajectories (kinematic primitives, physically constrained, stochastic & complex) and two rollout formats: observe-before-act and direct act. Right: Multi-dimensional scoring includes runtime $R_{\text{time}}$, trajectory quality (spatial $Q_{\text{line}}$, temporal $Q_{\text{smooth}}$), and overall metrics (location/grasp success $S_{\text{loc}},S_{\text{gra}}$; deviations $E_{\text{loc}},E_{\text{gra}}$).
  • Figure 3: Motion diversity in the DynaHOI-10M benchmark. DynaHOI-10M spans 3 categories and 8 major motion types, ranging from kinematic primitives to physics-constrained and stochastic dynamics. The illustrated trajectories demonstrate rich and diverse manipulation behaviors over moving targets, enabling comprehensive evaluation of dynamic object grasping.
  • Figure 4: Data statistics and diversity of DynaHOI-10M. (a) Object motions are organized into a 3-level hierarchy with 8 major categories and 22 fine-grained subcategories. (b) Hand poses cover diverse object scales: larger objects correspond to smaller grasp magnitudes and smaller objects to larger grasp magnitudes; each object supports both near- and far-range manipulation. (c--d) Distributions of episode durations (frames) and trajectory lengths across the benchmark.
  • Figure 5: ObAct incorporates observation frames and spatiotemporal attention to condition action prediction on object dynamics.
  • ...and 12 more figures