Table of Contents
Fetching ...

From Inference Efficiency to Embodied Efficiency: Revisiting Efficiency Metrics for Vision-Language-Action Models

Zhuofan Li, Hongkun Yang, Zhenyang Chen, Yangxuan Chen, Yingyan, Lin, Chaojian Li

Abstract

Vision-Language-Action (VLA) models have recently enabled embodied agents to perform increasingly complex tasks by jointly reasoning over visual, linguistic, and motor modalities. However, we find that the prevailing notion of ``efficiency'' in current VLA research, characterized by parameters, FLOPs, or token decoding throughput, does not reflect actual performance on robotic platforms. In real-world execution, efficiency is determined by system-level embodied behaviors such as task completion time, trajectory smoothness, cumulative joint rotation, and motion energy. Through controlled studies across model compression, token sparsification, and action sequence compression, we make several observations that challenge common assumptions. (1) Methods that reduce computation under conventional metrics often increase end-to-end execution cost or degrade motion quality, despite maintaining task success rates. (2) System-level embodied efficiency metrics reveal performance differences in the learned action policies that remain hidden under conventional evaluations. (3) Common adaptation methods such as in-context prompting or supervised fine-tuning show only mild and metric-specific improvements in embodied efficiency. While these methods can reduce targeted embodied-efficiency metrics such as jerk or action rate, the resulting gains may come with trade-offs in other metrics, such as longer completion time. Taken together, our results suggest that conventional inference efficiency metrics can overlook important aspects of embodied execution. Incorporating embodied efficiency provides a more complete view of policy behavior and practical performance, enabling fairer and more comprehensive comparisons of VLA models.

From Inference Efficiency to Embodied Efficiency: Revisiting Efficiency Metrics for Vision-Language-Action Models

Abstract

Vision-Language-Action (VLA) models have recently enabled embodied agents to perform increasingly complex tasks by jointly reasoning over visual, linguistic, and motor modalities. However, we find that the prevailing notion of ``efficiency'' in current VLA research, characterized by parameters, FLOPs, or token decoding throughput, does not reflect actual performance on robotic platforms. In real-world execution, efficiency is determined by system-level embodied behaviors such as task completion time, trajectory smoothness, cumulative joint rotation, and motion energy. Through controlled studies across model compression, token sparsification, and action sequence compression, we make several observations that challenge common assumptions. (1) Methods that reduce computation under conventional metrics often increase end-to-end execution cost or degrade motion quality, despite maintaining task success rates. (2) System-level embodied efficiency metrics reveal performance differences in the learned action policies that remain hidden under conventional evaluations. (3) Common adaptation methods such as in-context prompting or supervised fine-tuning show only mild and metric-specific improvements in embodied efficiency. While these methods can reduce targeted embodied-efficiency metrics such as jerk or action rate, the resulting gains may come with trade-offs in other metrics, such as longer completion time. Taken together, our results suggest that conventional inference efficiency metrics can overlook important aspects of embodied execution. Incorporating embodied efficiency provides a more complete view of policy behavior and practical performance, enabling fairer and more comprehensive comparisons of VLA models.
Paper Structure (15 sections, 2 equations, 5 figures, 5 tables)

This paper contains 15 sections, 2 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Comparison of $\pi_{0.5}$black2025pi05 VLA models with 5% and 20% of weights pruned using magnitude pruning han2015learning, evaluated on a Libero liu2023libero task with the instruction "pick up the cream cheese and place it in the basket." The comparison includes task success rate (averaged over 50 test rollout trajectories, following liu2023libero), parameter count, completion time, and end-effector height over time. Notably, in this extreme case, although the 20% pruned model achieves a larger parameter reduction without any drop in success rate, it fails to execute the grasp smoothly, as shown in the zoomed-in sequence. This results in a longer completion time and therefore higher overall system energy consumption. This makes the model "inference-efficient" but not "embodied-efficient".
  • Figure 2: An overview of the deployment of a VLA model on embodied robotic platforms. Conventional model inference efficiency metrics (e.g., number of parameters, FLOPs, and decoding throughput) apply only to the model inference stage, where the VLA model processes captured images and language instructions to output corresponding action tokens. However, assessing embodied efficiency for the robotic actuation stage requires a different set of metrics, including end-effector path length, joint-space path length, action smoothness, and task completion time.
  • Figure 3: Visualization of the trajectories of Model A (a baseline $\pi_{0}$) and Model B (a $\pi_{0}$ model with 10% of weights pruned) when performing the 4th task of the Libero-Goal suite liu2023libero. The trajectories are projected onto the X--Y, Y--Z, and X--Z planes for improved visualization clarity, revealing that Model B induces a longer path.
  • Figure 4: Comparison of the baseline $\pi_{0}$black2024pi0fang2025intention and its pruned variant (5% weights removed via magnitude pruning han2015learning) on the Bridge benchmark walke2024bridgedatav2datasetrobot.
  • Figure 5: Evaluation of VLA models compressed with visual token pruningyang_efficientvla_2025 across different embodied efficiency metrics. Each subplot corresponds to a specific model-task-suite pair (e.g., "$\pi_{0.5}$ - Libero-Goal" represents evaluating the $\pi_{0.5}$ model black2025pi05 on the Libero-Goal liu2023libero task suite under different token pruning ratios). All embodied efficiency metrics are shown as percentages normalized to each model’s unpruned baseline.