Table of Contents
Fetching ...

Mastering Robot Manipulation with Multimodal Prompts through Pretraining and Multi-task Fine-tuning

Jiachen Li, Qiaozi Gao, Michael Johnston, Xiaofeng Gao, Xuehai He, Suhaila Shakiah, Hangjie Shi, Reza Ghanadan, William Yang Wang

TL;DR

This work addresses robotic manipulation with multimodal prompts by introducing MIDAS, a decoder-only policy trained through inverse dynamics pretraining followed by multi-task finetuning. A key design is a multimodal prompt encoder augmented with a residual connection to preserve fine-grained visual cues, and per-action tokens decoded autoregressively to capture dependencies between initial and target poses. Empirical results on the VIMA-BENCH show state-of-the-art performance (~10% better) and strong in-context learning capabilities, including improved generalization to unseen tasks with in-prompt demonstrations. The approach advances multimodal understanding in robotics, enabling more robust instruction following and potential for richer human-robot collaboration.

Abstract

Prompt-based learning has been demonstrated as a compelling paradigm contributing to large language models' tremendous success (LLMs). Inspired by their success in language tasks, existing research has leveraged LLMs in embodied instruction following and task planning. In this work, we tackle the problem of training a robot to understand multimodal prompts, interleaving vision signals with text descriptions. This type of task poses a major challenge to robots' capability to understand the interconnection and complementarity between vision and language signals. In this work, we introduce an effective framework that learns a policy to perform robot manipulation with multimodal prompts from multi-task expert trajectories. Our methods consist of a two-stage training pipeline that performs inverse dynamics pretraining and multi-task finetuning. To facilitate multimodal understanding, we design our multimodal prompt encoder by augmenting a pretrained LM with a residual connection to the visual input and model the dependencies among action dimensions. Empirically, we evaluate the efficacy of our method on the VIMA-BENCH and establish a new state-of-the-art (10% improvement in success rate). Moreover, we demonstrate that our model exhibits remarkable in-context learning ability. Project page: \url{https://midas-icml.github.io/}.

Mastering Robot Manipulation with Multimodal Prompts through Pretraining and Multi-task Fine-tuning

TL;DR

This work addresses robotic manipulation with multimodal prompts by introducing MIDAS, a decoder-only policy trained through inverse dynamics pretraining followed by multi-task finetuning. A key design is a multimodal prompt encoder augmented with a residual connection to preserve fine-grained visual cues, and per-action tokens decoded autoregressively to capture dependencies between initial and target poses. Empirical results on the VIMA-BENCH show state-of-the-art performance (~10% better) and strong in-context learning capabilities, including improved generalization to unseen tasks with in-prompt demonstrations. The approach advances multimodal understanding in robotics, enabling more robust instruction following and potential for richer human-robot collaboration.

Abstract

Prompt-based learning has been demonstrated as a compelling paradigm contributing to large language models' tremendous success (LLMs). Inspired by their success in language tasks, existing research has leveraged LLMs in embodied instruction following and task planning. In this work, we tackle the problem of training a robot to understand multimodal prompts, interleaving vision signals with text descriptions. This type of task poses a major challenge to robots' capability to understand the interconnection and complementarity between vision and language signals. In this work, we introduce an effective framework that learns a policy to perform robot manipulation with multimodal prompts from multi-task expert trajectories. Our methods consist of a two-stage training pipeline that performs inverse dynamics pretraining and multi-task finetuning. To facilitate multimodal understanding, we design our multimodal prompt encoder by augmenting a pretrained LM with a residual connection to the visual input and model the dependencies among action dimensions. Empirically, we evaluate the efficacy of our method on the VIMA-BENCH and establish a new state-of-the-art (10% improvement in success rate). Moreover, we demonstrate that our model exhibits remarkable in-context learning ability. Project page: \url{https://midas-icml.github.io/}.
Paper Structure (24 sections, 4 equations, 11 figures, 23 tables, 1 algorithm)

This paper contains 24 sections, 4 equations, 11 figures, 23 tables, 1 algorithm.

Figures (11)

  • Figure 1: Model Architecture of our MIDAS. Our model adopts a decoder-only architecture. The multimodal prompt embeddings are concatenated with history observation and action tokens. We model each action dimension as an individual token and predict them auto-regressively.
  • Figure 2: (a) Object Encoder proposed in VIMA consists of a ViT dosovitskiy2020image that extracts visual embedding from cropped object images and a MLP that encodes bounding boxes. The two embeddings are concatenated before passing through a Fusion MLP to get the object tokens. (b) Multimodal Prompt Encoder adds a RC from the input object tokens to the pretrained LM output.
  • Figure 3: Task samples from the VIMA-BENCH. We refer readers to Appendix B of the VIMA paper jiang2023vima for detailed task description.
  • Figure 4: Given the any sequence of robot trajectory, we can always formulate a motion following task that requires the agent to replicate the demonstration trajectory.
  • Figure 5: At $t = 2$, the robot should move either the heart or the cross block. As the policy predicts each action dimension independently, different dimensions do not consistently manipulate the same object, resulting in a task failure.
  • ...and 6 more figures