Table of Contents
Fetching ...

Grounding Multimodal Large Language Models in Actions

Andrew Szot, Bogdan Mazoure, Harsh Agrawal, Devon Hjelm, Zsolt Kira, Alexander Toshev

TL;DR

This work tackles grounding Multimodal Large Language Models (MLLMs) into embodied action spaces using a unified Action Space Adapter (ASA) framework. It systematically compares seven ASAs across five embodied environments, revealing that learned tokenization (RVQ) best handles continuous actions while semantic-aligned language tokens (SemLang) excel for discrete actions. The study reports strong benchmarks, including $84\%$ on Meta-World, $72\%$ on CALVIN, and $51\%$ on LangR, demonstrating that exploiting the MLLM’s multimodal knowledge via appropriate tokenization or semantic grounding yields significant performance gains. The findings advance practical grounding of MLLMs for embodied tasks and highlight avenues for improving real-world deployment through broader MLLM baselines, more diverse training regimes, and safety considerations.

Abstract

Multimodal Large Language Models (MLLMs) have demonstrated a wide range of capabilities across many domains, including Embodied AI. In this work, we study how to best ground a MLLM into different embodiments and their associated action spaces, with the goal of leveraging the multimodal world knowledge of the MLLM. We first generalize a number of methods through a unified architecture and the lens of action space adaptors. For continuous actions, we show that a learned tokenization allows for sufficient modeling precision, yielding the best performance on downstream tasks. For discrete actions, we demonstrate that semantically aligning these actions with the native output token space of the MLLM leads to the strongest performance. We arrive at these lessons via a thorough study of seven action space adapters on five different environments, encompassing over 114 embodied tasks.

Grounding Multimodal Large Language Models in Actions

TL;DR

This work tackles grounding Multimodal Large Language Models (MLLMs) into embodied action spaces using a unified Action Space Adapter (ASA) framework. It systematically compares seven ASAs across five embodied environments, revealing that learned tokenization (RVQ) best handles continuous actions while semantic-aligned language tokens (SemLang) excel for discrete actions. The study reports strong benchmarks, including on Meta-World, on CALVIN, and on LangR, demonstrating that exploiting the MLLM’s multimodal knowledge via appropriate tokenization or semantic grounding yields significant performance gains. The findings advance practical grounding of MLLMs for embodied tasks and highlight avenues for improving real-world deployment through broader MLLM baselines, more diverse training regimes, and safety considerations.

Abstract

Multimodal Large Language Models (MLLMs) have demonstrated a wide range of capabilities across many domains, including Embodied AI. In this work, we study how to best ground a MLLM into different embodiments and their associated action spaces, with the goal of leveraging the multimodal world knowledge of the MLLM. We first generalize a number of methods through a unified architecture and the lens of action space adaptors. For continuous actions, we show that a learned tokenization allows for sufficient modeling precision, yielding the best performance on downstream tasks. For discrete actions, we demonstrate that semantically aligning these actions with the native output token space of the MLLM leads to the strongest performance. We arrive at these lessons via a thorough study of seven action space adapters on five different environments, encompassing over 114 embodied tasks.
Paper Structure (29 sections, 3 equations, 8 figures, 7 tables)

This paper contains 29 sections, 3 equations, 8 figures, 7 tables.

Figures (8)

  • Figure 1: We empirically analyze how to ground MLLMs in actions across 114 tasks in continuous and discrete action spaces. In each environment, we train a multi-task policy with different Action Space Adapters (ASAs) to re-parameterize the MLLM to output actions. For continuous actions, learning a tokenization with several tokens per-action performs best (Residual VQ). For discrete actions, mapping actions to semantically related language tokens performs best (Semantic Tokenization).
  • Figure 2: Generic architecture studied here for adapting MLLMs for action-specific decision making. The MLLM takes the embedding of the task instruction, prompt, and visual tokens as input. The MLLM then autoregressively predicts a sequence of $m$ action tokens. These action tokens are then decoded into an environment-specific action.
  • Figure 3: Comparing ASAs for continuous and discrete action spaces across 5 environments. For continuous actions, the RVQ tokenization performs best. For discrete actions, SemLang performs best. Each bar gives the average over all tasks in the environment with the full breakdown in \ref{['sec:per-task']}.
  • Figure 4: (a,b) show the effect of the number of codes in the codebook for RVQ and VQ on final policy success rate (see (a)) and reconstruction on unseen action trajectories in Meta-World (see (b)). (c,d) show the effect of number of codebooks on final policy success rate (see (c)) and action reconstruction (see (d)). All metrics are computed on Meta-World.
  • Figure 5: RVQ and VQ success per-task grouping (defined in Supp. \ref{['sec:task-groupings']}) on CALVIN and MetaWorld.
  • ...and 3 more figures