InteRACT: Transformer Models for Human Intent Prediction Conditioned on Robot Actions
Kushal Kedia, Atiksh Bhardwaj, Prithwish Dan, Sanjiban Choudhury
TL;DR
InteRACT tackles the problem of predicting a human partner's future intent in collaborative manipulation by conditioning the prediction on the robot's planned actions. It uses a transformer-based architecture pre-trained on large-scale human-human data and transferred to human-robot data through a representation-alignment scheme, aided by a tele-operated paired dataset CoMaD. The approach introduces an action-conditioned decoding mechanism and two alignment losses to bridge human and robot representations, demonstrating improved accuracy over marginal baselines in both human-human and human-robot tasks and enabling safer, more proactive robot planning. The work contributes a novel dataset, an effective transfer learning pipeline, and practical insights for coordinating human-robot teams in real-world manipulation tasks.
Abstract
In collaborative human-robot manipulation, a robot must predict human intents and adapt its actions accordingly to smoothly execute tasks. However, the human's intent in turn depends on actions the robot takes, creating a chicken-or-egg problem. Prior methods ignore such inter-dependency and instead train marginal intent prediction models independent of robot actions. This is because training conditional models is hard given a lack of paired human-robot interaction datasets. Can we instead leverage large-scale human-human interaction data that is more easily accessible? Our key insight is to exploit a correspondence between human and robot actions that enables transfer learning from human-human to human-robot data. We propose a novel architecture, InteRACT, that pre-trains a conditional intent prediction model on large human-human datasets and fine-tunes on a small human-robot dataset. We evaluate on a set of real-world collaborative human-robot manipulation tasks and show that our conditional model improves over various marginal baselines. We also introduce new techniques to tele-operate a 7-DoF robot arm and collect a diverse range of human-robot collaborative manipulation data, which we open-source.
