Table of Contents
Fetching ...

Train Robots in a JIF: Joint Inverse and Forward Dynamics with Human and Robot Demonstrations

Gagan Khandate, Boxuan Wang, Sarah Park, Weizhe Ni, Joaquin Palacios, Kathyrn Lampo, Philippe Wu, Rosh Ho, Eric Chang, Matei Ciocarlie

TL;DR

The paper tackles data efficiency in robot manipulation by pre-training with multi-modal human demonstrations. It introduces Joint Inverse and Forward dynamics (JIF) learned in a latent space using a ViTacT encoder to fuse vision and touch, guided by a Dynamo loss and a teacher–student EMA to avoid latent collapse. A diffusion-policy is then fine-tuned on a small set of robot demonstrations, achieving strong task success and generalization, particularly when tactile information from instrumented human demonstrations is available. This approach demonstrates significant improvements in data efficiency and robustness for contact-rich manipulation, offering a scalable pathway toward broader imitation-learning foundations for robotics.

Abstract

Pre-training on large datasets of robot demonstrations is a powerful technique for learning diverse manipulation skills but is often limited by the high cost and complexity of collecting robot-centric data, especially for tasks requiring tactile feedback. This work addresses these challenges by introducing a novel method for pre-training with multi-modal human demonstrations. Our approach jointly learns inverse and forward dynamics to extract latent state representations, towards learning manipulation specific representations. This enables efficient fine-tuning with only a small number of robot demonstrations, significantly improving data efficiency. Furthermore, our method allows for the use of multi-modal data, such as combination of vision and touch for manipulation. By leveraging latent dynamics modeling and tactile sensing, this approach paves the way for scalable robot manipulation learning based on human demonstrations.

Train Robots in a JIF: Joint Inverse and Forward Dynamics with Human and Robot Demonstrations

TL;DR

The paper tackles data efficiency in robot manipulation by pre-training with multi-modal human demonstrations. It introduces Joint Inverse and Forward dynamics (JIF) learned in a latent space using a ViTacT encoder to fuse vision and touch, guided by a Dynamo loss and a teacher–student EMA to avoid latent collapse. A diffusion-policy is then fine-tuned on a small set of robot demonstrations, achieving strong task success and generalization, particularly when tactile information from instrumented human demonstrations is available. This approach demonstrates significant improvements in data efficiency and robustness for contact-rich manipulation, offering a scalable pathway toward broader imitation-learning foundations for robotics.

Abstract

Pre-training on large datasets of robot demonstrations is a powerful technique for learning diverse manipulation skills but is often limited by the high cost and complexity of collecting robot-centric data, especially for tasks requiring tactile feedback. This work addresses these challenges by introducing a novel method for pre-training with multi-modal human demonstrations. Our approach jointly learns inverse and forward dynamics to extract latent state representations, towards learning manipulation specific representations. This enables efficient fine-tuning with only a small number of robot demonstrations, significantly improving data efficiency. Furthermore, our method allows for the use of multi-modal data, such as combination of vision and touch for manipulation. By leveraging latent dynamics modeling and tactile sensing, this approach paves the way for scalable robot manipulation learning based on human demonstrations.

Paper Structure

This paper contains 18 sections, 9 equations, 6 figures, 1 table.

Figures (6)

  • Figure 2: Human and Robot Data Collection Setup. The robot setup includes three camera views—two side views and one wrist view—and a two-fingered gripper with embedded tactile sensors at the fingertips. The human data collection setup mirrors the robot's camera configuration, with tactile data collected using a fingertip cap device equipped with a Singletact capacitive sensor on index and the thumb fingers.
  • Figure 3: Objects used for pre-training, fine-tuning, and generalization evaluation. As illustrated, we pre-train using human demonstrations collected for all five objects but fine-tune the imitation learning model using only one object. To evaluate generalization, we test on the remaining four objects, making the evaluation out-of-distribution relative to the robot fine-tuning phase but in-distribution with respect to the human demonstrations used during pre-training.
  • Figure 4: Grasping success rate of our method against the baselines. While both pre-training approaches improve demonstration complexity, with JIF pre-training we achieve the highest success rate.
  • Figure 5: The importance of instrumented human demonstrations with tactile sensing is evaluated using a challenging peg-in-hole insertion task. The figure illustrates both human and robot demonstrations during the placement of a cube into a square hole.
  • Figure 6: The results underscore the importance of instrumented human demonstrations with tactile sensing, as evidenced by a success rate comparison across different tactile sensing configurations. The left plot shows notable improvements in grasping success rates, while the right plot highlights the successful execution of the challenging peg-in-hole insertion task, albeit with a lower success rate. Incorporating tactile feedback, particularly in both human and robot demonstrations, enhances performance and achieves higher success rates with fewer robot demonstrations.
  • ...and 1 more figures