Table of Contents
Fetching ...

MoLe-VLA: Dynamic Layer-skipping Vision Language Action Model via Mixture-of-Layers for Efficient Robot Manipulation

Rongyu Zhang, Menghang Dong, Yuan Zhang, Liang Heng, Xiaowei Chi, Gaole Dai, Li Du, Yuan Du, Shanghang Zhang

TL;DR

<3-5 sentence high-level summary> MoLe-VLA tackles the heavy computation of multimodal large language models in robotic manipulation by introducing dynamic layer activation through a Spatial-Temporal Aware Router (STAR) and a Cognition self-Knowledge Distillation (CogKD) framework. Treating each LLM layer as an expert, MoLe selectively activates layers to maintain task-relevant semantic processing while reducing FLOPs by up to 5.6× and boosting mean success by up to 8% on RLBench and real-world tasks. Grounded in the Shallow Brain Hypothesis, the approach combines SBH-inspired routing with self-distillation to preserve cognitive cues, enabling efficient, adaptable embodied AI on resource-constrained hardware. The results demonstrate strong efficiency-performance trade-offs and robust generalization across simulation and real-world robotic manipulation.

Abstract

Multimodal Large Language Models (MLLMs) excel in understanding complex language and visual data, enabling generalist robotic systems to interpret instructions and perform embodied tasks. Nevertheless, their real-world deployment is hindered by substantial computational and storage demands. Recent insights into the homogeneous patterns in the LLM layer have inspired sparsification techniques to address these challenges, such as early exit and token pruning. However, these methods often neglect the critical role of the final layers that encode the semantic information most relevant to downstream robotic tasks. Aligning with the recent breakthrough of the Shallow Brain Hypothesis (SBH) in neuroscience and the mixture of experts in model sparsification, we conceptualize each LLM layer as an expert and propose a Mixture-of-Layers Vision-Language-Action model (MoLe-VLA, or simply MoLe) architecture for dynamic LLM layer activation. We introduce a Spatial-Temporal Aware Router (STAR) for MoLe to selectively activate only parts of the layers based on the robot's current state, mimicking the brain's distinct signal pathways specialized for cognition and causal reasoning. Additionally, to compensate for the cognitive ability of LLMs lost in MoLe, we devise a Cognition Self-Knowledge Distillation (CogKD) framework. CogKD enhances the understanding of task demands and improves the generation of task-relevant action sequences by leveraging cognitive features. Extensive experiments conducted in both RLBench simulation and real-world environments demonstrate the superiority of MoLe-VLA in both efficiency and performance. Specifically, MoLe-VLA achieves an 8% improvement in the mean success rate across ten tasks while reducing computational costs by up to x5.6 compared to standard LLMs.

MoLe-VLA: Dynamic Layer-skipping Vision Language Action Model via Mixture-of-Layers for Efficient Robot Manipulation

TL;DR

<3-5 sentence high-level summary> MoLe-VLA tackles the heavy computation of multimodal large language models in robotic manipulation by introducing dynamic layer activation through a Spatial-Temporal Aware Router (STAR) and a Cognition self-Knowledge Distillation (CogKD) framework. Treating each LLM layer as an expert, MoLe selectively activates layers to maintain task-relevant semantic processing while reducing FLOPs by up to 5.6× and boosting mean success by up to 8% on RLBench and real-world tasks. Grounded in the Shallow Brain Hypothesis, the approach combines SBH-inspired routing with self-distillation to preserve cognitive cues, enabling efficient, adaptable embodied AI on resource-constrained hardware. The results demonstrate strong efficiency-performance trade-offs and robust generalization across simulation and real-world robotic manipulation.

Abstract

Multimodal Large Language Models (MLLMs) excel in understanding complex language and visual data, enabling generalist robotic systems to interpret instructions and perform embodied tasks. Nevertheless, their real-world deployment is hindered by substantial computational and storage demands. Recent insights into the homogeneous patterns in the LLM layer have inspired sparsification techniques to address these challenges, such as early exit and token pruning. However, these methods often neglect the critical role of the final layers that encode the semantic information most relevant to downstream robotic tasks. Aligning with the recent breakthrough of the Shallow Brain Hypothesis (SBH) in neuroscience and the mixture of experts in model sparsification, we conceptualize each LLM layer as an expert and propose a Mixture-of-Layers Vision-Language-Action model (MoLe-VLA, or simply MoLe) architecture for dynamic LLM layer activation. We introduce a Spatial-Temporal Aware Router (STAR) for MoLe to selectively activate only parts of the layers based on the robot's current state, mimicking the brain's distinct signal pathways specialized for cognition and causal reasoning. Additionally, to compensate for the cognitive ability of LLMs lost in MoLe, we devise a Cognition Self-Knowledge Distillation (CogKD) framework. CogKD enhances the understanding of task demands and improves the generation of task-relevant action sequences by leveraging cognitive features. Extensive experiments conducted in both RLBench simulation and real-world environments demonstrate the superiority of MoLe-VLA in both efficiency and performance. Specifically, MoLe-VLA achieves an 8% improvement in the mean success rate across ten tasks while reducing computational costs by up to x5.6 compared to standard LLMs.

Paper Structure

This paper contains 33 sections, 19 equations, 8 figures, 10 tables, 1 algorithm.

Figures (8)

  • Figure 1: Overview of our proposed MoLe-VLA: Our proposed framework integrates dynamic layer activation, a novel Spatial-Temporal Aware Router (STAR), and self-knowledge distillation (CogKD) to achieve efficient and adaptive performance in robotic applications. MoLe reduces computational costs while enhancing model performance, enabling resource-constrained platforms to benefit from MLLMs.
  • Figure 2: The overall framework of MoLe-VLA. Our proposed Mixture of Layers (MoLe) architecture consists of a Spatial-Temporal Aware Router (STAR) and a devised Cognition self-Knowledge Distillation (CogKD) for vision language action models.
  • Figure 3: Detailed illustration of our proposed CogKD loss.
  • Figure 4: Efficiency analysis compared with state-of-the-art baselines with FLOPs and inference time. (Left) Success rate v.s. the FLOPs reduction compared to model backbone. (Right) Inference time per iteration for different layers of MoLe and model backbones.
  • Figure 5: The qualitative results of MoLe-VLA in both RLBench and real-world, including the manipulation progress and the task completion end state for both simulation and real-world environments, are shown. More visualizations can be found in the Appendix.
  • ...and 3 more figures