Table of Contents
Fetching ...

Modeling Dynamic Hand-Object Interactions with Applications to Human-Robot Handovers

Sammy Christen

TL;DR

This dissertation advances the modeling of dynamic hand-object interactions by introducing two core tasks: dynamic grasp synthesis (D-Grasp) and bi-manual articulation (ArtiGrasp) using reinforcement learning within a physics-based simulation. It then applies these HOI synthesis capabilities to human-to-robot handovers, proposing end-to-end training frameworks that leverage simulated human-in-the-loop data and synthetic generation (SynH2R, SynH2R) to scale training diversity. A two-stage teacher-student framework enables sim-to-real transfer for handover policies, and user studies show that policies trained on purely synthetic data can match or exceed those trained on real motion data, underscoring synthetic data's feasibility for scalable humanoid-robot collaboration. The work also presents SynH2R to generate large-scale synthetic handover motions, enabling robust generalization to unseen objects and motions, and demonstrates promising sim-to-real transfer in real robot experiments. Overall, the dissertation demonstrates physically plausible 4D HOI synthesis and its practical utility for scalable, human-aware robotic systems in embodied AI, HRI training, and data augmentation for perception and planning.

Abstract

Humans frequently grasp, manipulate, and move objects. Interactive systems assist humans in these tasks, enabling applications in Embodied AI, human-robot interaction, and virtual reality. However, current methods in hand-object synthesis often neglect dynamics and focus on generating static grasps. The first part of this dissertation introduces dynamic grasp synthesis, where a hand grasps and moves an object to a target pose. We approach this task using physical simulation and reinforcement learning. We then extend this to bimanual manipulation and articulated objects, requiring fine-grained coordination between hands. In the second part of this dissertation, we study human-to-robot handovers. We integrate captured human motion into simulation and introduce a student-teacher framework that adapts to human behavior and transfers from sim to real. To overcome data scarcity, we generate synthetic interactions, increasing training diversity by 100x. Our user study finds no difference between policies trained on synthetic vs. real motions.

Modeling Dynamic Hand-Object Interactions with Applications to Human-Robot Handovers

TL;DR

This dissertation advances the modeling of dynamic hand-object interactions by introducing two core tasks: dynamic grasp synthesis (D-Grasp) and bi-manual articulation (ArtiGrasp) using reinforcement learning within a physics-based simulation. It then applies these HOI synthesis capabilities to human-to-robot handovers, proposing end-to-end training frameworks that leverage simulated human-in-the-loop data and synthetic generation (SynH2R, SynH2R) to scale training diversity. A two-stage teacher-student framework enables sim-to-real transfer for handover policies, and user studies show that policies trained on purely synthetic data can match or exceed those trained on real motion data, underscoring synthetic data's feasibility for scalable humanoid-robot collaboration. The work also presents SynH2R to generate large-scale synthetic handover motions, enabling robust generalization to unseen objects and motions, and demonstrates promising sim-to-real transfer in real robot experiments. Overall, the dissertation demonstrates physically plausible 4D HOI synthesis and its practical utility for scalable, human-aware robotic systems in embodied AI, HRI training, and data augmentation for perception and planning.

Abstract

Humans frequently grasp, manipulate, and move objects. Interactive systems assist humans in these tasks, enabling applications in Embodied AI, human-robot interaction, and virtual reality. However, current methods in hand-object synthesis often neglect dynamics and focus on generating static grasps. The first part of this dissertation introduces dynamic grasp synthesis, where a hand grasps and moves an object to a target pose. We approach this task using physical simulation and reinforcement learning. We then extend this to bimanual manipulation and articulated objects, requiring fine-grained coordination between hands. In the second part of this dissertation, we study human-to-robot handovers. We integrate captured human motion into simulation and introduce a student-teacher framework that adapts to human behavior and transfers from sim to real. To overcome data scarcity, we generate synthetic interactions, increasing training diversity by 100x. Our user study finds no difference between policies trained on synthetic vs. real motions.

Paper Structure

This paper contains 166 sections, 53 equations, 39 figures, 18 tables.

Figures (39)

  • Figure 1: Dynamic Grasp Synthesis Overview. We are given a task goal, such as a target 6D object pose, and a grasp reference of a static hand-object pose, which are passed to a model. The aim of dynamic grasp synthesis is to generate a sequence of hand and object poses that fulfill the task objective. We assume that the model's predictions are passed to an environment which returns an updated state of the hand and object.
  • Figure 2: Human-to-Robot Handover Overview. We set up a simulation environment (left) that contains a tabletop setting with different objects, a human hand model (green), and a robotic manipulator. The human grasps an object and moves it into a handover pose, from which the robot should securely grasp it and move it to a target location. A model (middle) is trained in this simulation environment before deployment to a real robotic platform (right).
  • Figure 3: The interaction between agent and environment in RL sutton1998introduction.
  • Figure 4: In deep reinforcement learning, both the Q-values and the policy can be approximated with neural networks.
  • Figure 5: Dynamic Grasp Synthesis: Our method learns diverse grasps from static grasp labels (shown in insets), originating from existing datasets, grasp synthesis or image-based estimates. Our approach can then synthesize diverse dynamic sequences with the objects in-hand. We decompose the task into: stable grasping -, followed by the synthesis of a 3D global motion to move the object into a 6D target pose -. The hand-pose is continuously adjusted to ensure a stable grasp, leading to physically plausible and human-like sequences.
  • ...and 34 more figures