Table of Contents
Fetching ...

MLA: A Multisensory Language-Action Model for Multimodal Understanding and Forecasting in Robotic Manipulation

Zhuoyang Liu, Jiaming Liu, Jiadong Xu, Nuowei Han, Chenyang Gu, Hao Chen, Kaichen Zhou, Renrui Zhang, Kai Chin Hsieh, Kun Wu, Zhengping Che, Jian Tang, Shanghang Zhang

TL;DR

MLA addresses the shortcoming of existing vision–language–action models by jointly integrating vision, geometry, and tactile sensing through encoder-free multimodal alignment, repurposing an LLM as a unified perception module. A future multisensory generation post-training strategy enables joint forecasting of images, point clouds, and tactile signals to improve action generation, while preserving inference efficiency. The model is trained in three stages—large-scale pretraining, supervised fine-tuning with cross-modal alignment, and post-training for future-state prediction—and evaluated on six real-world tasks and RLBench, achieving state-of-the-art results and strong generalization to unseen configurations. This work advances a multisensory foundation-model paradigm for robotic manipulation with robust perception and dynamics modeling.

Abstract

Vision-language-action models (VLAs) have shown generalization capabilities in robotic manipulation tasks by inheriting from vision-language models (VLMs) and learning action generation. Most VLA models focus on interpreting vision and language to generate actions, whereas robots must perceive and interact within the spatial-physical world. This gap highlights the need for a comprehensive understanding of robotic-specific multisensory information, which is crucial for achieving complex and contact-rich control. To this end, we introduce a multisensory language-action (MLA) model that collaboratively perceives heterogeneous sensory modalities and predicts future multisensory objectives to facilitate physical world modeling. Specifically, to enhance perceptual representations, we propose an encoder-free multimodal alignment scheme that innovatively repurposes the large language model itself as a perception module, directly interpreting multimodal cues by aligning 2D images, 3D point clouds, and tactile tokens through positional correspondence. To further enhance MLA's understanding of physical dynamics, we design a future multisensory generation post-training strategy that enables MLA to reason about semantic, geometric, and interaction information, providing more robust conditions for action generation. For evaluation, the MLA model outperforms the previous state-of-the-art 2D and 3D VLA methods by 12% and 24% in complex, contact-rich real-world tasks, respectively, while also demonstrating improved generalization to unseen configurations. Project website: https://sites.google.com/view/open-mla

MLA: A Multisensory Language-Action Model for Multimodal Understanding and Forecasting in Robotic Manipulation

TL;DR

MLA addresses the shortcoming of existing vision–language–action models by jointly integrating vision, geometry, and tactile sensing through encoder-free multimodal alignment, repurposing an LLM as a unified perception module. A future multisensory generation post-training strategy enables joint forecasting of images, point clouds, and tactile signals to improve action generation, while preserving inference efficiency. The model is trained in three stages—large-scale pretraining, supervised fine-tuning with cross-modal alignment, and post-training for future-state prediction—and evaluated on six real-world tasks and RLBench, achieving state-of-the-art results and strong generalization to unseen configurations. This work advances a multisensory foundation-model paradigm for robotic manipulation with robust perception and dynamics modeling.

Abstract

Vision-language-action models (VLAs) have shown generalization capabilities in robotic manipulation tasks by inheriting from vision-language models (VLMs) and learning action generation. Most VLA models focus on interpreting vision and language to generate actions, whereas robots must perceive and interact within the spatial-physical world. This gap highlights the need for a comprehensive understanding of robotic-specific multisensory information, which is crucial for achieving complex and contact-rich control. To this end, we introduce a multisensory language-action (MLA) model that collaboratively perceives heterogeneous sensory modalities and predicts future multisensory objectives to facilitate physical world modeling. Specifically, to enhance perceptual representations, we propose an encoder-free multimodal alignment scheme that innovatively repurposes the large language model itself as a perception module, directly interpreting multimodal cues by aligning 2D images, 3D point clouds, and tactile tokens through positional correspondence. To further enhance MLA's understanding of physical dynamics, we design a future multisensory generation post-training strategy that enables MLA to reason about semantic, geometric, and interaction information, providing more robust conditions for action generation. For evaluation, the MLA model outperforms the previous state-of-the-art 2D and 3D VLA methods by 12% and 24% in complex, contact-rich real-world tasks, respectively, while also demonstrating improved generalization to unseen configurations. Project website: https://sites.google.com/view/open-mla

Paper Structure

This paper contains 18 sections, 3 equations, 9 figures, 3 tables.

Figures (9)

  • Figure 1: (a) Unlike vanilla VLA methods that rely on 2D images and human instructions to generate actions, (b) we propose MLA, a multisensory language–action model that collaboratively processes diverse robotic-specific modalities and predicts their future states to enhance physical dynamics modeling in robotic control. (c) MLA achieves SOTA performance across a variety of real-world and simulation tasks.
  • Figure 2: Overall Framework of MLA. a) Beyond language instructions and robot states, MLA introduces an innovative encoder-free multimodal alignment mechanism that directly enables the LLM to integrate RGB images, point clouds, and tactile signals, aligning them through token-level contrastive learning. At the output level, MLA further incorporates a future multisensory generation post-training strategy, allowing the model to generate future multisensory states and providing more robust conditions for action generation. b) MLA adopts a three-stage training paradigm: large-scale pretraining, supervised fine-tuning with cross-modal alignment, and post-training with future state prediction.
  • Figure 3: Real-world results. All models are evaluated over 15 rollouts from different manipulated object positions on the tabletop, with task completion determined by human judgment.
  • Figure 4: Visualization of real-world task progress and attention heatmaps from the final-layer output features of MLA.
  • Figure 5: Ablation study. We systematically analyze the contributions of each component in the MLA model.
  • ...and 4 more figures