Reshaping Action Error Distributions for Reliable Vision-Language-Action Models

Shuanghao Bai; Dakai Wang; Cheng Chi; Wanqi Zhou; Jing Lyu; Xiaoguang Zhao; Pengwei Wang; Zhongyuan Wang; Lei Xing; Shanghang Zhang; Badong Chen

Reshaping Action Error Distributions for Reliable Vision-Language-Action Models

Shuanghao Bai, Dakai Wang, Cheng Chi, Wanqi Zhou, Jing Lyu, Xiaoguang Zhao, Pengwei Wang, Zhongyuan Wang, Lei Xing, Shanghang Zhang, Badong Chen

TL;DR

This work addresses the limitation of traditional MSE-based supervision in continuous-action Vision-Language-Action (VLA) models by introducing trajectory-level Minimum Error Entropy (T-MEE) and two weighted variants. By treating trajectory-level action errors as samples from a shared distribution and applying a quadratic Rényi entropy objective, the method reshapes the entire error distribution rather than individual predictions, yielding more compact and structured error patterns. The authors provide theoretical analyses showing similarity-weighted error interactions, bounded influence of outliers, and controllable cross-task coupling, and they empirically validate the approach across LIBERO, SimplerEnv, and real-robot tasks with architectures ranging from small to large. Across near-balanced, few-shot, and noisy scenarios, TMEE-based supervision improves success rates and robustness with negligible training overhead, making it a practical, architecture-agnostic enhancement for scalable VLA training.

Abstract

In robotic manipulation, vision-language-action (VLA) models have emerged as a promising paradigm for learning generalizable and scalable robot policies. Most existing VLA frameworks rely on standard supervised objectives, typically cross-entropy for discrete actions and mean squared error (MSE) for continuous action regression, which impose strong pointwise constraints on individual predictions. In this work, we focus on continuous-action VLA models and move beyond conventional MSE-based regression by reshaping action error distributions during training. Drawing on information-theoretic principles, we introduce Minimum Error Entropy (MEE) into modern VLA architectures and propose a trajectory-level MEE objective, together with two weighted variants, combined with MSE for continuous-action VLA training. We evaluate our approaches across standard, few-shot, and noisy settings on multiple representative VLA architectures, using simulation benchmarks such as LIBERO and SimplerEnv as well as real-world robotic manipulation tasks. Experimental results demonstrate consistent improvements in success rates and robustness across these settings. Under imbalanced data regimes, the gains persist within a well-characterized operating range, while incurring negligible additional training cost and no impact on inference efficiency. We further provide theoretical analyses that explain why MEE-based supervision is effective and characterize its practical range. Project Page: https://cognition2actionlab.github.io/VLA-TMEE.github.io/

Reshaping Action Error Distributions for Reliable Vision-Language-Action Models

TL;DR

Abstract

Paper Structure (42 sections, 29 equations, 11 figures, 7 tables)

This paper contains 42 sections, 29 equations, 11 figures, 7 tables.

Introduction
Related Work
Preliminaries
Method
Adapt MEE to VLA Models
Model Architecture
Theoretical Analysis
Similarity-Weighted Interaction Between Trajectory Errors
Robustness to Non-Gaussian Noise and Outliers
Interaction Structure in Multi-Task Settings
Experiments
Experiment Setup
Main Results under Near-Balanced Data
More Analyses
Conclusion
...and 27 more sections

Figures (11)

Figure 1: PCA visualization of action error distributions with and without trajectory-level MEE (T-MEE). Each point represents an action error at a specific time step along a trajectory. The top-10 most extreme outliers are highlighted with numeric labels, while red circles indicate compact action error clusters. Results are shown for BC-Transformer and GR00T trained with standard MSE-based behavior cloning and with the proposed T-MEE objective on LIBERO-Object. Per-task success rates (SR) for the two visualized tasks are annotated in the figure. For reference, the overall SR on the full 10-task LIBERO-Object suite improves from 57.4% to 68.2% for BC-Transformer and from 94.4% to 97.8% for GR00T. Across both architectures and tasks, incorporating T-MEE leads to more compact and coherent action error distributions in the projected space.
Figure 2: Architectural taxonomy of continuous-action VLA models evaluated in this work. We summarize representative small- and large-scale VLA architectures. (a–b) Small-scale models regress actions from multimodal features using lightweight backbones: (a) BC-RNN / BC-Transformer with MLP policy heads, and (b) BC-DP with a diffusion-based action expert. (c–f) Large-scale models build upon pretrained VLMs: (c) OFT introduces learnable action queries into autoregressive VLMs; (d) GR00T conditions an action expert (AE) on final-layer VLM features; (e) $\pi_0$ variants enable tighter VLM–action coupling via multi-layer conditioning or shared attention; and (f) DS-VLA adopts a dual-system design with a fast System 1 for action execution and a slower System 2 for contextual guidance. Here, V denotes image tokens, L denotes language tokens, Q denotes query tokens, and N denotes noise inputs.
Figure 3: Performance comparison of MEE-based variants on LIBERO. We compare the baseline regression objective with three information-theoretic variants, including T-MEE, Chunk-weighted T-MEE (Cw-TMEE), and Element-weighted T-MEE (Ew-TMEE). Results are reported for representative continuous-action VLA architectures. All MEE-based objectives consistently improve performance over the baseline, while different variants exhibit complementary advantages across architectures and task suites, highlighting the flexibility of distribution-level error shaping.
Figure 4: Real-world evaluation. (a) Real-world robotic setup and representative manipulation tasks. (b) Task success rates comparing GR00T and GR00T + T-MEE, showing consistent performance gains from T-MEE across all tasks.
Figure 5: Average success rates of GR00T with and without T-MEE under different noise corruptions on LIBERO.
...and 6 more figures

Theorems & Definitions (2)

proof
proof

Reshaping Action Error Distributions for Reliable Vision-Language-Action Models

TL;DR

Abstract

Reshaping Action Error Distributions for Reliable Vision-Language-Action Models

Authors

TL;DR

Abstract

Table of Contents

Figures (11)

Theorems & Definitions (2)