Table of Contents
Fetching ...

Multimodal Diffusion Transformer: Learning Versatile Behavior from Multimodal Goals

Moritz Reuss, Ömer Erdinç Yağmurlu, Fabian Wenzel, Rudolf Lioutikov

TL;DR

MDT addresses learning versatile, long-horizon robotic manipulation from multimodal goals when language annotations are sparse. It presents a diffusion-based transformer backbone with a latent, modality-agnostic goal representation and two self-supervised losses (Masked Generative Foresight and Contrastive Latent Alignment) to align image- and language-goal embeddings and forecast future states. The approach achieves state-of-the-art results on CALVIN and LIBERO benchmarks, with strong data efficiency using as little as 2% language labels, and demonstrates feasibility in real-world play. The work highlights a scalable path for multimodal goal-conditioned policy learning and points to future pretraining on large, partially labeled datasets.

Abstract

This work introduces the Multimodal Diffusion Transformer (MDT), a novel diffusion policy framework, that excels at learning versatile behavior from multimodal goal specifications with few language annotations. MDT leverages a diffusion-based multimodal transformer backbone and two self-supervised auxiliary objectives to master long-horizon manipulation tasks based on multimodal goals. The vast majority of imitation learning methods only learn from individual goal modalities, e.g. either language or goal images. However, existing large-scale imitation learning datasets are only partially labeled with language annotations, which prohibits current methods from learning language conditioned behavior from these datasets. MDT addresses this challenge by introducing a latent goal-conditioned state representation that is simultaneously trained on multimodal goal instructions. This state representation aligns image and language based goal embeddings and encodes sufficient information to predict future states. The representation is trained via two self-supervised auxiliary objectives, enhancing the performance of the presented transformer backbone. MDT shows exceptional performance on 164 tasks provided by the challenging CALVIN and LIBERO benchmarks, including a LIBERO version that contains less than $2\%$ language annotations. Furthermore, MDT establishes a new record on the CALVIN manipulation challenge, demonstrating an absolute performance improvement of $15\%$ over prior state-of-the-art methods that require large-scale pretraining and contain $10\times$ more learnable parameters. MDT shows its ability to solve long-horizon manipulation from sparsely annotated data in both simulated and real-world environments. Demonstrations and Code are available at https://intuitive-robots.github.io/mdt_policy/.

Multimodal Diffusion Transformer: Learning Versatile Behavior from Multimodal Goals

TL;DR

MDT addresses learning versatile, long-horizon robotic manipulation from multimodal goals when language annotations are sparse. It presents a diffusion-based transformer backbone with a latent, modality-agnostic goal representation and two self-supervised losses (Masked Generative Foresight and Contrastive Latent Alignment) to align image- and language-goal embeddings and forecast future states. The approach achieves state-of-the-art results on CALVIN and LIBERO benchmarks, with strong data efficiency using as little as 2% language labels, and demonstrates feasibility in real-world play. The work highlights a scalable path for multimodal goal-conditioned policy learning and points to future pretraining on large, partially labeled datasets.

Abstract

This work introduces the Multimodal Diffusion Transformer (MDT), a novel diffusion policy framework, that excels at learning versatile behavior from multimodal goal specifications with few language annotations. MDT leverages a diffusion-based multimodal transformer backbone and two self-supervised auxiliary objectives to master long-horizon manipulation tasks based on multimodal goals. The vast majority of imitation learning methods only learn from individual goal modalities, e.g. either language or goal images. However, existing large-scale imitation learning datasets are only partially labeled with language annotations, which prohibits current methods from learning language conditioned behavior from these datasets. MDT addresses this challenge by introducing a latent goal-conditioned state representation that is simultaneously trained on multimodal goal instructions. This state representation aligns image and language based goal embeddings and encodes sufficient information to predict future states. The representation is trained via two self-supervised auxiliary objectives, enhancing the performance of the presented transformer backbone. MDT shows exceptional performance on 164 tasks provided by the challenging CALVIN and LIBERO benchmarks, including a LIBERO version that contains less than language annotations. Furthermore, MDT establishes a new record on the CALVIN manipulation challenge, demonstrating an absolute performance improvement of over prior state-of-the-art methods that require large-scale pretraining and contain more learnable parameters. MDT shows its ability to solve long-horizon manipulation from sparsely annotated data in both simulated and real-world environments. Demonstrations and Code are available at https://intuitive-robots.github.io/mdt_policy/.
Paper Structure (26 sections, 8 equations, 10 figures, 13 tables, 2 algorithms)

This paper contains 26 sections, 8 equations, 10 figures, 13 tables, 2 algorithms.

Figures (10)

  • Figure 1: (Left) Overview of the proposed multimodal Transformer-Encoder-Decoder Diffusion Policy used in . (Right) Specialized Diffusion Transformer Block for the Denoising of the Action Sequence. learns a goal-conditioned latent state representation from multiple image observations and multimodal goals. The camera images are processed either via frozen Voltron Encoders with a Perceiver or ResNets. The separate GPT denoising module iteratively denoises an action sequence of $10$ steps with a Transformer Decoder with Causal Attention. It consists of several Denoising Blocks, as visualized on the right side. These blocks process noisy action tokens with self-attention and fuse the conditioning information from the latent state representation via cross-attention. applies adaLN conditioning peebles2023scalable to condition the blocks on the current noise level. In addition, it aligns the latent representation tokens of the same state with different goal specifications using self-supervised contrastive learning. The latent representation tokens are also used as a context input for the masked Image Decoder module to reconstruct masked-out patches from future images.
  • Figure 2: The Masked Generative Foresight Auxiliary Task enhances the model. It starts by encoding the current observation and goal using the Encoder. The resulting latent state representations then serve as conditional inputs for the Future Image-Decoder. This decoder receives encoded patches of future camera images along with mask tokens. Its task is to reconstruct the occluded patches in future frames.
  • Figure 3: Overview of the different environments used to test : (Left) CALVIN Benchmark consisting of four environments each with unique positions and textures for slider, drawer, LED, and lightbulb. (Middle) Overview of the different tasks and scene diversity in the LIBERO benchmark, which is divided into $5$ different task suites. (Right) Example tasks from the real robot experiments at a toy kitchen, where models are tested after training on partially labeled play data.
  • Figure 4: Study on the performance of our proposed Masked Generative Foresight Loss and the Contrastive Latent Alignment Loss for our proposed policy. We analyse the impact of both auxiliary tasks on the ABCD CALVIN challenge. The results show the average rollout length over 1000 instruction chains averaged over 3 seeds.
  • Figure 5: Study on the performance of our proposed and objectives for pretraining on action-free data. We pretrain MDT on LIBERO-90 with the objectives and test the average performance on all LIBERO-Long tasks with different number of demonstrations. The results show the success rate averaged over 20 rollouts for all 10 tasks and 3 seeds. LfS refers to trained from scratch and PrT are all pretrained models.
  • ...and 5 more figures