Table of Contents
Fetching ...

Learning Physics-Based Full-Body Human Reaching and Grasping from Brief Walking References

Yitang Li, Mingxian Lin, Zhuo Lin, Yipeng Deng, Yue Cao, Li Yi

TL;DR

The paper tackles the data bottleneck in physics-based full-body reaching and grasping by learning from brief walking MoCap references. It introduces an active data generation strategy and a local feature alignment mechanism to transfer natural walking patterns to task-specific grasping motions, implemented via a two-policy RL framework (low-level skill space and high-level task policy). Key contributions include a locality-aware critic architecture, an active augmentation pipeline that targets hard tasks, and a Mahalanobis-distance-based feature alignment to preserve walk-like dynamics during generation. Experiments across diverse scenes and unseen objects demonstrate high grasp success and natural motion, with ablations confirming the importance of both data augmentation and feature alignment. This work reduces dependence on large, task-specific motion datasets and has potential implications for animation, AR/VR, and humanoid robotics.

Abstract

Existing motion generation methods based on mocap data are often limited by data quality and coverage. In this work, we propose a framework that generates diverse, physically feasible full-body human reaching and grasping motions using only brief walking mocap data. Base on the observation that walking data captures valuable movement patterns transferable across tasks and, on the other hand, the advanced kinematic methods can generate diverse grasping poses, which can then be interpolated into motions to serve as task-specific guidance. Our approach incorporates an active data generation strategy to maximize the utility of the generated motions, along with a local feature alignment mechanism that transfers natural movement patterns from walking data to enhance both the success rate and naturalness of the synthesized motions. By combining the fidelity and stability of natural walking with the flexibility and generalizability of task-specific generated data, our method demonstrates strong performance and robust adaptability in diverse scenes and with unseen objects.

Learning Physics-Based Full-Body Human Reaching and Grasping from Brief Walking References

TL;DR

The paper tackles the data bottleneck in physics-based full-body reaching and grasping by learning from brief walking MoCap references. It introduces an active data generation strategy and a local feature alignment mechanism to transfer natural walking patterns to task-specific grasping motions, implemented via a two-policy RL framework (low-level skill space and high-level task policy). Key contributions include a locality-aware critic architecture, an active augmentation pipeline that targets hard tasks, and a Mahalanobis-distance-based feature alignment to preserve walk-like dynamics during generation. Experiments across diverse scenes and unseen objects demonstrate high grasp success and natural motion, with ablations confirming the importance of both data augmentation and feature alignment. This work reduces dependence on large, task-specific motion datasets and has potential implications for animation, AR/VR, and humanoid robotics.

Abstract

Existing motion generation methods based on mocap data are often limited by data quality and coverage. In this work, we propose a framework that generates diverse, physically feasible full-body human reaching and grasping motions using only brief walking mocap data. Base on the observation that walking data captures valuable movement patterns transferable across tasks and, on the other hand, the advanced kinematic methods can generate diverse grasping poses, which can then be interpolated into motions to serve as task-specific guidance. Our approach incorporates an active data generation strategy to maximize the utility of the generated motions, along with a local feature alignment mechanism that transfers natural movement patterns from walking data to enhance both the success rate and naturalness of the synthesized motions. By combining the fidelity and stability of natural walking with the flexibility and generalizability of task-specific generated data, our method demonstrates strong performance and robust adaptability in diverse scenes and with unseen objects.

Paper Structure

This paper contains 54 sections, 21 equations, 13 figures, 11 tables.

Figures (13)

  • Figure 1: In this work, we design a framework that generates diverse, physically feasible full-body human reaching and grasping motions using only brief walking MoCap data.
  • Figure 2: Comparison of our modified critic architecture.
  • Figure 3: t-SNE plots of features extracted at different levels of the critic network: There is clear clustering within the MoCap data in shallow layers and this phenomenon is less evident in deeper layers.
  • Figure 4: Overview of our framework: We propose a pipeline that generates diverse reaching and grasping motions using brief walk MoCap data through the multi-iteration training. In each iteration, with the imitation and discovery objective specified respectively by the discriminator and encoder, we first train a low-level policy $\pi_{L}(a|z,s)$ to map a latent variable $z$ to motions in the dataset. Next, using a task-specific reward, a high-level policy $\pi_H(z|s)$is trained to select $z$ to output actions for downstream tasks. After the first iteration, the motion space (represented by the low-level policy) contains only limited walking motions, restricting performance in challenging reaching and grasping tasks. To address this, we estimate the performance, identify hard cases, and actively generate interpolated data to augment the dataset. In subsequent iterations, we fine-tune the low-level policy on the augmented dataset to expand the motion space, using a feature alignment mechanism to regularize the output motions and provide an additional reward $r^{feats}$. Then, we can train the next-iteration high-level policy and this iterative process continues until we achieve satisfactory results.
  • Figure 5: Visualization of the overall task compared to baselines: We visualized our method and baselines. Our methods can yield nature reaching and grasping in various scenes and tasks while baselines show significant unnatural movements.
  • ...and 8 more figures