Grounding Multimodal Large Language Models in Actions
Andrew Szot, Bogdan Mazoure, Harsh Agrawal, Devon Hjelm, Zsolt Kira, Alexander Toshev
TL;DR
This work tackles grounding Multimodal Large Language Models (MLLMs) into embodied action spaces using a unified Action Space Adapter (ASA) framework. It systematically compares seven ASAs across five embodied environments, revealing that learned tokenization (RVQ) best handles continuous actions while semantic-aligned language tokens (SemLang) excel for discrete actions. The study reports strong benchmarks, including $84\%$ on Meta-World, $72\%$ on CALVIN, and $51\%$ on LangR, demonstrating that exploiting the MLLM’s multimodal knowledge via appropriate tokenization or semantic grounding yields significant performance gains. The findings advance practical grounding of MLLMs for embodied tasks and highlight avenues for improving real-world deployment through broader MLLM baselines, more diverse training regimes, and safety considerations.
Abstract
Multimodal Large Language Models (MLLMs) have demonstrated a wide range of capabilities across many domains, including Embodied AI. In this work, we study how to best ground a MLLM into different embodiments and their associated action spaces, with the goal of leveraging the multimodal world knowledge of the MLLM. We first generalize a number of methods through a unified architecture and the lens of action space adaptors. For continuous actions, we show that a learned tokenization allows for sufficient modeling precision, yielding the best performance on downstream tasks. For discrete actions, we demonstrate that semantically aligning these actions with the native output token space of the MLLM leads to the strongest performance. We arrive at these lessons via a thorough study of seven action space adapters on five different environments, encompassing over 114 embodied tasks.
