Table of Contents
Fetching ...

DeeR-VLA: Dynamic Inference of Multimodal Large Language Models for Efficient Robot Execution

Yang Yue, Yulin Wang, Bingyi Kang, Yizeng Han, Shenzhi Wang, Shiji Song, Jiashi Feng, Gao Huang

TL;DR

DeeR introduces a Dynamic Early-Exit framework for robotic MLLMs that dynamically activates only portions of a multimodal LLM based on task difficulty, using a multi-exit architecture, an action-consistency termination criterion, and a temporal-training scheme. The method achieves substantial efficiency gains, reducing average LLM FLOPs by approximately $5.2$-$6.5\times$ and GPU memory by around $2$-$6\times$ on CALVIN LH-MTLC benchmarks, while maintaining competitive performance. By enabling budgeted inference and online threshold optimization, DeeR enhances real-world feasibility for resource-constrained robotic platforms. The work demonstrates practical implications for scalable, responsive, vision-language robotic control with large models.

Abstract

MLLMs have demonstrated remarkable comprehension and reasoning capabilities with complex language and visual data. These advances have spurred the vision of establishing a generalist robotic MLLM proficient in understanding complex human instructions and accomplishing various embodied tasks. However, developing MLLMs for real-world robots is challenging due to the typically limited computation and memory capacities available on robotic platforms. In contrast, the inference of MLLMs involves storing billions of parameters and performing tremendous computation, imposing significant hardware demands. In our paper, we propose a Dynamic Early-Exit Framework for Robotic Vision-Language-Action Model (DeeR-VLA, or simply DeeR) that automatically adjusts the size of the activated MLLM based on each situation at hand. The approach leverages a multi-exit architecture in MLLMs, which allows the model to terminate processing once a proper size of the model has been activated for a specific situation, thus avoiding further redundant computation. Additionally, we develop novel algorithms that establish early-termination criteria for DeeR, conditioned on predefined demands such as average computational cost (i.e., power consumption), as well as peak computational consumption (i.e., latency) and GPU memory usage. These enhancements ensure that DeeR operates efficiently under varying resource constraints while maintaining competitive performance. On the CALVIN robot manipulation benchmark, DeeR demonstrates significant reductions in computational costs of LLM by 5.2-6.5x and GPU memory of LLM by 2-6x without compromising performance. Code and checkpoints are available at https://github.com/yueyang130/DeeR-VLA.

DeeR-VLA: Dynamic Inference of Multimodal Large Language Models for Efficient Robot Execution

TL;DR

DeeR introduces a Dynamic Early-Exit framework for robotic MLLMs that dynamically activates only portions of a multimodal LLM based on task difficulty, using a multi-exit architecture, an action-consistency termination criterion, and a temporal-training scheme. The method achieves substantial efficiency gains, reducing average LLM FLOPs by approximately - and GPU memory by around - on CALVIN LH-MTLC benchmarks, while maintaining competitive performance. By enabling budgeted inference and online threshold optimization, DeeR enhances real-world feasibility for resource-constrained robotic platforms. The work demonstrates practical implications for scalable, responsive, vision-language robotic control with large models.

Abstract

MLLMs have demonstrated remarkable comprehension and reasoning capabilities with complex language and visual data. These advances have spurred the vision of establishing a generalist robotic MLLM proficient in understanding complex human instructions and accomplishing various embodied tasks. However, developing MLLMs for real-world robots is challenging due to the typically limited computation and memory capacities available on robotic platforms. In contrast, the inference of MLLMs involves storing billions of parameters and performing tremendous computation, imposing significant hardware demands. In our paper, we propose a Dynamic Early-Exit Framework for Robotic Vision-Language-Action Model (DeeR-VLA, or simply DeeR) that automatically adjusts the size of the activated MLLM based on each situation at hand. The approach leverages a multi-exit architecture in MLLMs, which allows the model to terminate processing once a proper size of the model has been activated for a specific situation, thus avoiding further redundant computation. Additionally, we develop novel algorithms that establish early-termination criteria for DeeR, conditioned on predefined demands such as average computational cost (i.e., power consumption), as well as peak computational consumption (i.e., latency) and GPU memory usage. These enhancements ensure that DeeR operates efficiently under varying resource constraints while maintaining competitive performance. On the CALVIN robot manipulation benchmark, DeeR demonstrates significant reductions in computational costs of LLM by 5.2-6.5x and GPU memory of LLM by 2-6x without compromising performance. Code and checkpoints are available at https://github.com/yueyang130/DeeR-VLA.

Paper Structure

This paper contains 21 sections, 9 equations, 6 figures, 11 tables.

Figures (6)

  • Figure 1: Left: Dynamic inference of DeeR. For inference, we adaptively activate an appropriate size of MLLM based on an exit criterion $c$, which accounts for the current situation (including task instruction $l$ and observation $o_t$) and predefined computational and GPU memory budgets. The language instruction and gripper camera image, not shown in this figure, are also inputs to the MLLM. An action is then obtained using the intermediate feature $\tilde{x}^{c(t)}_t$ and historical information. Right: Training of DeeR.We randomly sample features from all exits during training. This strategy helps minimize the discrepancy between training and dynamic inference. Moreover, we employ several auxiliary action heads (AuxH) to better optimize the MLLM.
  • Figure 2: Multi-exit MLLM architecture for robot.
  • Figure 3: Results atop OpenFlamingo 3B. Upper: Avg. successful len v.s. avg. LLM GFLOPs. Bottom: Peak GLOPs and GPU memory for LLM. Different colors indicate different peak FLOPs and GPU memory budgets, denoted as DeeR-S and DeeR-B (they share a fixed model). DeeR preserve all the architecture and hyperparameters from RoboFlamingo++ for fair comparisons, except for our dynamic early-exit paradigm.
  • Figure 4: Results on the top of OpenFlamingo 9B. Left: Avg. successful len v.s. average LLM GFLOPs. Right: Maxinum GLOPs and GPU memory budget for DeeR-S and DeeR-B. The activated LLM in $\text{DeeR\xspace-S}$ and $\text{DeeR\xspace-B}$ consumes 12GB memory, whereas RoboFlamingo 9B requires 32GB.
  • Figure 5: Visualization of DeeR rollouts in the CALVIN environment. Please zoom in to view details. The numbers indicate the termination exit index. Situations with a lower exit index are recognized as 'easier' ones.
  • ...and 1 more figures