DynaHOI: Benchmarking Hand-Object Interaction for Dynamic Target
BoCheng Hu, Zhonghan Zhao, Kaiyue Zhou, Hongwei Wang, Gaoang Wang
TL;DR
DynaHOI tackles the gap in dynamic hand-object interaction benchmarking by introducing DynaHOI-Gym, an online closed-loop evaluation platform, and DynaHOI-10M, a large-scale dataset with moving targets across 8 motion types and 22 subtypes. The framework supports both observe-before-act and direct-act rollouts and evaluates diverse models, including policy-based diffusion controllers and generalist VLMs, using a hierarchical metric suite that covers localization, grasping, trajectory quality, and completion speed. The ObAct baseline demonstrates motion-aware benefits by incorporating short-term temporal observations and spatiotemporal attention, achieving notable gains (e.g., an 8.1% improvement in localization) over prior methods. Findings reveal that even state-of-the-art dynamic policies struggle with accurate anticipation and robust contact-rich grasping under motion, underscoring the need for motion-aware architectures and temporal context in dexterous manipulation tasks.
Abstract
Most existing hand motion generation benchmarks for hand-object interaction (HOI) focus on static objects, leaving dynamic scenarios with moving targets and time-critical coordination largely untested. To address this gap, we introduce the DynaHOI-Gym, a unified online closed-loop platform with parameterized motion generators and rollout-based metrics for dynamic capture evaluation. Built on DynaHOI-Gym, we release DynaHOI-10M, a large-scale benchmark with 10M frames and 180K hand capture trajectories, whose target motions are organized into 8 major categories and 22 fine-grained subcategories. We also provide a simple observe-before-act baseline (ObAct) that integrates short-term observations with the current frame via spatiotemporal attention to predict actions, achieving an 8.1% improvement in location success rate.
