Reshaping Action Error Distributions for Reliable Vision-Language-Action Models
Shuanghao Bai, Dakai Wang, Cheng Chi, Wanqi Zhou, Jing Lyu, Xiaoguang Zhao, Pengwei Wang, Zhongyuan Wang, Lei Xing, Shanghang Zhang, Badong Chen
TL;DR
This work addresses the limitation of traditional MSE-based supervision in continuous-action Vision-Language-Action (VLA) models by introducing trajectory-level Minimum Error Entropy (T-MEE) and two weighted variants. By treating trajectory-level action errors as samples from a shared distribution and applying a quadratic Rényi entropy objective, the method reshapes the entire error distribution rather than individual predictions, yielding more compact and structured error patterns. The authors provide theoretical analyses showing similarity-weighted error interactions, bounded influence of outliers, and controllable cross-task coupling, and they empirically validate the approach across LIBERO, SimplerEnv, and real-robot tasks with architectures ranging from small to large. Across near-balanced, few-shot, and noisy scenarios, TMEE-based supervision improves success rates and robustness with negligible training overhead, making it a practical, architecture-agnostic enhancement for scalable VLA training.
Abstract
In robotic manipulation, vision-language-action (VLA) models have emerged as a promising paradigm for learning generalizable and scalable robot policies. Most existing VLA frameworks rely on standard supervised objectives, typically cross-entropy for discrete actions and mean squared error (MSE) for continuous action regression, which impose strong pointwise constraints on individual predictions. In this work, we focus on continuous-action VLA models and move beyond conventional MSE-based regression by reshaping action error distributions during training. Drawing on information-theoretic principles, we introduce Minimum Error Entropy (MEE) into modern VLA architectures and propose a trajectory-level MEE objective, together with two weighted variants, combined with MSE for continuous-action VLA training. We evaluate our approaches across standard, few-shot, and noisy settings on multiple representative VLA architectures, using simulation benchmarks such as LIBERO and SimplerEnv as well as real-world robotic manipulation tasks. Experimental results demonstrate consistent improvements in success rates and robustness across these settings. Under imbalanced data regimes, the gains persist within a well-characterized operating range, while incurring negligible additional training cost and no impact on inference efficiency. We further provide theoretical analyses that explain why MEE-based supervision is effective and characterize its practical range. Project Page: https://cognition2actionlab.github.io/VLA-TMEE.github.io/
