Grounding Multimodal Large Language Models in Actions

Andrew Szot; Bogdan Mazoure; Harsh Agrawal; Devon Hjelm; Zsolt Kira; Alexander Toshev

Grounding Multimodal Large Language Models in Actions

Andrew Szot, Bogdan Mazoure, Harsh Agrawal, Devon Hjelm, Zsolt Kira, Alexander Toshev

TL;DR

This work tackles grounding Multimodal Large Language Models (MLLMs) into embodied action spaces using a unified Action Space Adapter (ASA) framework. It systematically compares seven ASAs across five embodied environments, revealing that learned tokenization (RVQ) best handles continuous actions while semantic-aligned language tokens (SemLang) excel for discrete actions. The study reports strong benchmarks, including $84\%$ on Meta-World, $72\%$ on CALVIN, and $51\%$ on LangR, demonstrating that exploiting the MLLM’s multimodal knowledge via appropriate tokenization or semantic grounding yields significant performance gains. The findings advance practical grounding of MLLMs for embodied tasks and highlight avenues for improving real-world deployment through broader MLLM baselines, more diverse training regimes, and safety considerations.

Abstract

Multimodal Large Language Models (MLLMs) have demonstrated a wide range of capabilities across many domains, including Embodied AI. In this work, we study how to best ground a MLLM into different embodiments and their associated action spaces, with the goal of leveraging the multimodal world knowledge of the MLLM. We first generalize a number of methods through a unified architecture and the lens of action space adaptors. For continuous actions, we show that a learned tokenization allows for sufficient modeling precision, yielding the best performance on downstream tasks. For discrete actions, we demonstrate that semantically aligning these actions with the native output token space of the MLLM leads to the strongest performance. We arrive at these lessons via a thorough study of seven action space adapters on five different environments, encompassing over 114 embodied tasks.

Grounding Multimodal Large Language Models in Actions

TL;DR

on Meta-World,

on CALVIN, and

on LangR, demonstrating that exploiting the MLLM’s multimodal knowledge via appropriate tokenization or semantic grounding yields significant performance gains. The findings advance practical grounding of MLLMs for embodied tasks and highlight avenues for improving real-world deployment through broader MLLM baselines, more diverse training regimes, and safety considerations.

Abstract

Paper Structure (29 sections, 3 equations, 8 figures, 7 tables)

This paper contains 29 sections, 3 equations, 8 figures, 7 tables.

Introduction
Related Work
Method
Problem Setting
From Vision and Language to Action
Discrete Action Spaces
Continuous Action Space Adaptors
Training
Experiments
Experimental Settings
Continuous Action Space Adapter Comparison
Discrete Action Adapter Comparison
Empirical Comparison to Prior Work
Limitations and Conclusion
Prior Work Comparison
...and 14 more sections

Figures (8)

Figure 1: We empirically analyze how to ground MLLMs in actions across 114 tasks in continuous and discrete action spaces. In each environment, we train a multi-task policy with different Action Space Adapters (ASAs) to re-parameterize the MLLM to output actions. For continuous actions, learning a tokenization with several tokens per-action performs best (Residual VQ). For discrete actions, mapping actions to semantically related language tokens performs best (Semantic Tokenization).
Figure 2: Generic architecture studied here for adapting MLLMs for action-specific decision making. The MLLM takes the embedding of the task instruction, prompt, and visual tokens as input. The MLLM then autoregressively predicts a sequence of $m$ action tokens. These action tokens are then decoded into an environment-specific action.
Figure 3: Comparing ASAs for continuous and discrete action spaces across 5 environments. For continuous actions, the RVQ tokenization performs best. For discrete actions, SemLang performs best. Each bar gives the average over all tasks in the environment with the full breakdown in \ref{['sec:per-task']}.
Figure 4: (a,b) show the effect of the number of codes in the codebook for RVQ and VQ on final policy success rate (see (a)) and reconstruction on unseen action trajectories in Meta-World (see (b)). (c,d) show the effect of number of codebooks on final policy success rate (see (c)) and action reconstruction (see (d)). All metrics are computed on Meta-World.
Figure 5: RVQ and VQ success per-task grouping (defined in Supp. \ref{['sec:task-groupings']}) on CALVIN and MetaWorld.
...and 3 more figures

Grounding Multimodal Large Language Models in Actions

TL;DR

Abstract

Grounding Multimodal Large Language Models in Actions

Authors

TL;DR

Abstract

Table of Contents

Figures (8)