Learning Physics-Based Full-Body Human Reaching and Grasping from Brief Walking References
Yitang Li, Mingxian Lin, Zhuo Lin, Yipeng Deng, Yue Cao, Li Yi
TL;DR
The paper tackles the data bottleneck in physics-based full-body reaching and grasping by learning from brief walking MoCap references. It introduces an active data generation strategy and a local feature alignment mechanism to transfer natural walking patterns to task-specific grasping motions, implemented via a two-policy RL framework (low-level skill space and high-level task policy). Key contributions include a locality-aware critic architecture, an active augmentation pipeline that targets hard tasks, and a Mahalanobis-distance-based feature alignment to preserve walk-like dynamics during generation. Experiments across diverse scenes and unseen objects demonstrate high grasp success and natural motion, with ablations confirming the importance of both data augmentation and feature alignment. This work reduces dependence on large, task-specific motion datasets and has potential implications for animation, AR/VR, and humanoid robotics.
Abstract
Existing motion generation methods based on mocap data are often limited by data quality and coverage. In this work, we propose a framework that generates diverse, physically feasible full-body human reaching and grasping motions using only brief walking mocap data. Base on the observation that walking data captures valuable movement patterns transferable across tasks and, on the other hand, the advanced kinematic methods can generate diverse grasping poses, which can then be interpolated into motions to serve as task-specific guidance. Our approach incorporates an active data generation strategy to maximize the utility of the generated motions, along with a local feature alignment mechanism that transfers natural movement patterns from walking data to enhance both the success rate and naturalness of the synthesized motions. By combining the fidelity and stability of natural walking with the flexibility and generalizability of task-specific generated data, our method demonstrates strong performance and robust adaptability in diverse scenes and with unseen objects.
