Table of Contents
Fetching ...

InteRACT: Transformer Models for Human Intent Prediction Conditioned on Robot Actions

Kushal Kedia, Atiksh Bhardwaj, Prithwish Dan, Sanjiban Choudhury

TL;DR

InteRACT tackles the problem of predicting a human partner's future intent in collaborative manipulation by conditioning the prediction on the robot's planned actions. It uses a transformer-based architecture pre-trained on large-scale human-human data and transferred to human-robot data through a representation-alignment scheme, aided by a tele-operated paired dataset CoMaD. The approach introduces an action-conditioned decoding mechanism and two alignment losses to bridge human and robot representations, demonstrating improved accuracy over marginal baselines in both human-human and human-robot tasks and enabling safer, more proactive robot planning. The work contributes a novel dataset, an effective transfer learning pipeline, and practical insights for coordinating human-robot teams in real-world manipulation tasks.

Abstract

In collaborative human-robot manipulation, a robot must predict human intents and adapt its actions accordingly to smoothly execute tasks. However, the human's intent in turn depends on actions the robot takes, creating a chicken-or-egg problem. Prior methods ignore such inter-dependency and instead train marginal intent prediction models independent of robot actions. This is because training conditional models is hard given a lack of paired human-robot interaction datasets. Can we instead leverage large-scale human-human interaction data that is more easily accessible? Our key insight is to exploit a correspondence between human and robot actions that enables transfer learning from human-human to human-robot data. We propose a novel architecture, InteRACT, that pre-trains a conditional intent prediction model on large human-human datasets and fine-tunes on a small human-robot dataset. We evaluate on a set of real-world collaborative human-robot manipulation tasks and show that our conditional model improves over various marginal baselines. We also introduce new techniques to tele-operate a 7-DoF robot arm and collect a diverse range of human-robot collaborative manipulation data, which we open-source.

InteRACT: Transformer Models for Human Intent Prediction Conditioned on Robot Actions

TL;DR

InteRACT tackles the problem of predicting a human partner's future intent in collaborative manipulation by conditioning the prediction on the robot's planned actions. It uses a transformer-based architecture pre-trained on large-scale human-human data and transferred to human-robot data through a representation-alignment scheme, aided by a tele-operated paired dataset CoMaD. The approach introduces an action-conditioned decoding mechanism and two alignment losses to bridge human and robot representations, demonstrating improved accuracy over marginal baselines in both human-human and human-robot tasks and enabling safer, more proactive robot planning. The work contributes a novel dataset, an effective transfer learning pipeline, and practical insights for coordinating human-robot teams in real-world manipulation tasks.

Abstract

In collaborative human-robot manipulation, a robot must predict human intents and adapt its actions accordingly to smoothly execute tasks. However, the human's intent in turn depends on actions the robot takes, creating a chicken-or-egg problem. Prior methods ignore such inter-dependency and instead train marginal intent prediction models independent of robot actions. This is because training conditional models is hard given a lack of paired human-robot interaction datasets. Can we instead leverage large-scale human-human interaction data that is more easily accessible? Our key insight is to exploit a correspondence between human and robot actions that enables transfer learning from human-human to human-robot data. We propose a novel architecture, InteRACT, that pre-trains a conditional intent prediction model on large human-human datasets and fine-tunes on a small human-robot dataset. We evaluate on a set of real-world collaborative human-robot manipulation tasks and show that our conditional model improves over various marginal baselines. We also introduce new techniques to tele-operate a 7-DoF robot arm and collect a diverse range of human-robot collaborative manipulation data, which we open-source.
Paper Structure (13 sections, 5 equations, 7 figures)

This paper contains 13 sections, 5 equations, 7 figures.

Figures (7)

  • Figure 1: We present InteRACT, a model that predicts future human intent conditioned on the future robot action. Left: When a human passes an object over, InteRACT conditions on the future object handover action of one human and predicts that the other human will move towards it. Right: In this human-robot interaction, given the robot's plan to reach for the can on the right, InteRACT predicts the human will reach for the pepper. We transfer a model trained on human-human interactions to human-robot interactions.
  • Figure 2: InteRACT Model Architecture. The scene history $\phi$ is encoded by the local and global transformer encoders. The future action $a_{R}$ of the robot is passed as a query to the transformer decoder to generate an action-conditioned human intent prediction $z_{H}$. The robot pose embeddings are aligned with paired human pose embeddings via an alignment loss.
  • Figure 3: Collaborative Manipulation Dataset (CoMaD) consists of Human-Human and Human-Robot interaction data. We collect data on three different H-H tasks and three different H-R tasks across several subjects. The bottom right image shows our tele-operation setup for paired human-robot data collection.
  • Figure 4: All Joints Final Displacement Error (FDE) across all tasks in CoMaD H-H. InteRACT predictions have lowest FDE.
  • Figure 5: Top: Final Displacement Error (FDE) of all joints over time in a test-set episode of object handover. Highlighted windows indicate all object handovers in the episode, where we observe higher errors for Marginal. Bottom: Visualizations of the predictions when the error is at its peak (1s pre-RGB image) show InteRACT anticipates the other's human action and moves towards the handover location.
  • ...and 2 more figures