From Inference Efficiency to Embodied Efficiency: Revisiting Efficiency Metrics for Vision-Language-Action Models

Zhuofan Li; Hongkun Yang; Zhenyang Chen; Yangxuan Chen; Yingyan; Lin; Chaojian Li

From Inference Efficiency to Embodied Efficiency: Revisiting Efficiency Metrics for Vision-Language-Action Models

Zhuofan Li, Hongkun Yang, Zhenyang Chen, Yangxuan Chen, Yingyan, Lin, Chaojian Li

Abstract

Vision-Language-Action (VLA) models have recently enabled embodied agents to perform increasingly complex tasks by jointly reasoning over visual, linguistic, and motor modalities. However, we find that the prevailing notion of ``efficiency'' in current VLA research, characterized by parameters, FLOPs, or token decoding throughput, does not reflect actual performance on robotic platforms. In real-world execution, efficiency is determined by system-level embodied behaviors such as task completion time, trajectory smoothness, cumulative joint rotation, and motion energy. Through controlled studies across model compression, token sparsification, and action sequence compression, we make several observations that challenge common assumptions. (1) Methods that reduce computation under conventional metrics often increase end-to-end execution cost or degrade motion quality, despite maintaining task success rates. (2) System-level embodied efficiency metrics reveal performance differences in the learned action policies that remain hidden under conventional evaluations. (3) Common adaptation methods such as in-context prompting or supervised fine-tuning show only mild and metric-specific improvements in embodied efficiency. While these methods can reduce targeted embodied-efficiency metrics such as jerk or action rate, the resulting gains may come with trade-offs in other metrics, such as longer completion time. Taken together, our results suggest that conventional inference efficiency metrics can overlook important aspects of embodied execution. Incorporating embodied efficiency provides a more complete view of policy behavior and practical performance, enabling fairer and more comprehensive comparisons of VLA models.

From Inference Efficiency to Embodied Efficiency: Revisiting Efficiency Metrics for Vision-Language-Action Models

Abstract

Paper Structure (15 sections, 2 equations, 5 figures, 5 tables)

This paper contains 15 sections, 2 equations, 5 figures, 5 tables.

Introduction
Related Work
VLA for Robotic Control
Efficient VLA
Preliminaries
From Model Inference to Robotic Actuation
Evaluation Metrics for Embodied Efficiency
Inference Efficiency vs. Embodied Efficiency
Experiments on Model Compression
Experiments on Token Compression
Experiments on Action Compression
Evaluating Adaptation Strategies in Terms of Embodied Efficiency
Experiments on In-Context Learning
Experiments on Supervised Fine-Tuning
Conclusion

Figures (5)

Figure 1: Comparison of $\pi_{0.5}$black2025pi05 VLA models with 5% and 20% of weights pruned using magnitude pruning han2015learning, evaluated on a Libero liu2023libero task with the instruction "pick up the cream cheese and place it in the basket." The comparison includes task success rate (averaged over 50 test rollout trajectories, following liu2023libero), parameter count, completion time, and end-effector height over time. Notably, in this extreme case, although the 20% pruned model achieves a larger parameter reduction without any drop in success rate, it fails to execute the grasp smoothly, as shown in the zoomed-in sequence. This results in a longer completion time and therefore higher overall system energy consumption. This makes the model "inference-efficient" but not "embodied-efficient".
Figure 2: An overview of the deployment of a VLA model on embodied robotic platforms. Conventional model inference efficiency metrics (e.g., number of parameters, FLOPs, and decoding throughput) apply only to the model inference stage, where the VLA model processes captured images and language instructions to output corresponding action tokens. However, assessing embodied efficiency for the robotic actuation stage requires a different set of metrics, including end-effector path length, joint-space path length, action smoothness, and task completion time.
Figure 3: Visualization of the trajectories of Model A (a baseline $\pi_{0}$) and Model B (a $\pi_{0}$ model with 10% of weights pruned) when performing the 4th task of the Libero-Goal suite liu2023libero. The trajectories are projected onto the X--Y, Y--Z, and X--Z planes for improved visualization clarity, revealing that Model B induces a longer path.
Figure 4: Comparison of the baseline $\pi_{0}$black2024pi0fang2025intention and its pruned variant (5% weights removed via magnitude pruning han2015learning) on the Bridge benchmark walke2024bridgedatav2datasetrobot.
Figure 5: Evaluation of VLA models compressed with visual token pruningyang_efficientvla_2025 across different embodied efficiency metrics. Each subplot corresponds to a specific model-task-suite pair (e.g., "$\pi_{0.5}$ - Libero-Goal" represents evaluating the $\pi_{0.5}$ model black2025pi05 on the Libero-Goal liu2023libero task suite under different token pruning ratios). All embodied efficiency metrics are shown as percentages normalized to each model’s unpruned baseline.

From Inference Efficiency to Embodied Efficiency: Revisiting Efficiency Metrics for Vision-Language-Action Models

Abstract

From Inference Efficiency to Embodied Efficiency: Revisiting Efficiency Metrics for Vision-Language-Action Models

Authors

Abstract

Table of Contents

Figures (5)