Table of Contents
Fetching ...

Learning Manipulation by Predicting Interaction

Jia Zeng, Qingwen Bu, Bangjun Wang, Wenke Xia, Li Chen, Hao Dong, Haoming Song, Dong Wang, Di Hu, Ping Luo, Heming Cui, Bin Zhao, Xuelong Li, Yu Qiao, Hongyang Li

TL;DR

MPI presents an interaction-oriented pre-training framework for robotic manipulation that learns how to interact and where to interact by predicting unseen transition frames and detecting interaction objects from keyframes, conditioned on language. The approach uses a multi-modal transformer encoder with causality modeling and a Prediction and a Detection Transformer to jointly optimize two complementary tasks, reinforced through cross-attention. Evaluations across real-world robots, Franka Kitchen, Meta-World, and a grounding task show state-of-the-art performance and robustness to distractions and variances, with ablations highlighting the benefits of keyframe-based data, decoupled encoders, and joint decoder design. The work advances data-efficient, interaction-aware representation learning for visuomotor control and vision-language robotics with publicly available code and checkpoints.

Abstract

Representation learning approaches for robotic manipulation have boomed in recent years. Due to the scarcity of in-domain robot data, prevailing methodologies tend to leverage large-scale human video datasets to extract generalizable features for visuomotor policy learning. Despite the progress achieved, prior endeavors disregard the interactive dynamics that capture behavior patterns and physical interaction during the manipulation process, resulting in an inadequate understanding of the relationship between objects and the environment. To this end, we propose a general pre-training pipeline that learns Manipulation by Predicting the Interaction (MPI) and enhances the visual representation.Given a pair of keyframes representing the initial and final states, along with language instructions, our algorithm predicts the transition frame and detects the interaction object, respectively. These two learning objectives achieve superior comprehension towards "how-to-interact" and "where-to-interact". We conduct a comprehensive evaluation of several challenging robotic tasks.The experimental results demonstrate that MPI exhibits remarkable improvement by 10% to 64% compared with previous state-of-the-art in real-world robot platforms as well as simulation environments. Code and checkpoints are publicly shared at https://github.com/OpenDriveLab/MPI.

Learning Manipulation by Predicting Interaction

TL;DR

MPI presents an interaction-oriented pre-training framework for robotic manipulation that learns how to interact and where to interact by predicting unseen transition frames and detecting interaction objects from keyframes, conditioned on language. The approach uses a multi-modal transformer encoder with causality modeling and a Prediction and a Detection Transformer to jointly optimize two complementary tasks, reinforced through cross-attention. Evaluations across real-world robots, Franka Kitchen, Meta-World, and a grounding task show state-of-the-art performance and robustness to distractions and variances, with ablations highlighting the benefits of keyframe-based data, decoupled encoders, and joint decoder design. The work advances data-efficient, interaction-aware representation learning for visuomotor control and vision-language robotics with publicly available code and checkpoints.

Abstract

Representation learning approaches for robotic manipulation have boomed in recent years. Due to the scarcity of in-domain robot data, prevailing methodologies tend to leverage large-scale human video datasets to extract generalizable features for visuomotor policy learning. Despite the progress achieved, prior endeavors disregard the interactive dynamics that capture behavior patterns and physical interaction during the manipulation process, resulting in an inadequate understanding of the relationship between objects and the environment. To this end, we propose a general pre-training pipeline that learns Manipulation by Predicting the Interaction (MPI) and enhances the visual representation.Given a pair of keyframes representing the initial and final states, along with language instructions, our algorithm predicts the transition frame and detects the interaction object, respectively. These two learning objectives achieve superior comprehension towards "how-to-interact" and "where-to-interact". We conduct a comprehensive evaluation of several challenging robotic tasks.The experimental results demonstrate that MPI exhibits remarkable improvement by 10% to 64% compared with previous state-of-the-art in real-world robot platforms as well as simulation environments. Code and checkpoints are publicly shared at https://github.com/OpenDriveLab/MPI.
Paper Structure (60 sections, 4 equations, 11 figures, 14 tables)

This paper contains 60 sections, 4 equations, 11 figures, 14 tables.

Figures (11)

  • Figure 1: MPI is an interaction-oriented representation learning pipeline for robotic manipulation. Diverging from prior arts grounded in (a) Contrastive Learning, (b) Masked Signal Modeling, or (c) Video Prediction using random frames, our proposed approach in (d) instructs the model towards predicting transition frames and detecting manipulated objects with keyframes as input. As such, the model fosters better comprehension of "how-to-interact" and "where-to-interact". MPI acquires more informative representations during pre-training and achieves evident improvement across downstream tasks.
  • Figure 2: The pipeline for pre-training. MPI comprises a multi-modal transformer encoder and a transformer decoder designed for predicting the image of the target interaction state and detecting interaction objects respectively. We achieve synergistic modeling and optimization of the two tasks through information transition between the prediction and detection transformers. The decoder is solely engaged during the pre-training phase while deprecated for downstream adaptations.
  • Figure 3: Real-world robot experiments. (a) Illustrations of real-world experiments in the kitchen environment. (b) Detailed success rate of ten tasks within a clean background. (c) Results of five tasks in the complex kitchen environment. (d) MPI outperforms previous state-of-the-art with an average elevation of 26.3% success rate across 15 tasks.
  • Figure 4: Illustration of real-world validation on generalization to (b) background (BG.) distraction when we put a banana into the drawer, and (c) object variation when we lift the lid.
  • Figure 5: Franka Kitchen simulation environment defined by nair2023r3m. In this environment, we include tasks of turning the stovetop knob, opening the microwave, sliding the right door open, turning on the light, and opening the left door. All tasks are trained with 25 demonstrations.
  • ...and 6 more figures