Table of Contents
Fetching ...

ALOE: Action-Level Off-Policy Evaluation for Vision-Language-Action Model Post-Training

Rushuai Yang, Hecheng Wang, Chiming Liu, Xiaohan Yan, Yunlong Wang, Xuan Du, Shuoyu Yue, Yongcheng Liu, Chuheng Zhang, Lizhe Qi, Yi Chen, Wei Shan, Maoqing Yao

TL;DR

This work addresses the challenge of reintroducing off-policy reinforcement learning for real-world vision-language-action (VLA) policies by enabling action-level value estimation. It introduces ALOE, a framework that uses TD bootstrapping on action chunks, a pessimistic ensemble of $K$ critics, and advantage-weighted policy updates to achieve stable, data-efficient policy improvements from heterogeneous, human-in-the-loop data. The approach is demonstrated on three real-world robotic tasks—Pack Smart Phone, Folding Laundry, and Product Sorting—where ALOE achieves higher success rates, improved robustness, and better generalization than trajectory-based or imitation-centric baselines. These results indicate that action-level off-policy RL can be reliably integrated into real-world VLA post-training, enabling more flexible credit assignment and faster learning in complex manipulation scenarios.

Abstract

We study how to improve large foundation vision-language-action (VLA) systems through online reinforcement learning (RL) in real-world settings. Central to this process is the value function, which provides learning signals to guide VLA learning from experience. In practice, the value function is estimated from trajectory fragments collected from different data sources, including historical policies and intermittent human interventions. Estimating the value function of current behavior quality from the mixture data is inherently an off-policy evaluation problem. However, prior work often adopts conservative on-policy estimation for stability, which avoids direct evaluation of the current high-capacity policy and limits learning effectiveness. In this paper, we propose ALOE, an action-level off-policy evaluation framework for VLA post-training. ALOE applies chunking-based temporal-difference bootstrapping to evaluate individual action sequences instead of predicting final task outcomes. This design improves effective credit assignment to critical action chunks under sparse rewards and supports stable policy improvement. We evaluate our method on three real-world manipulation tasks, including smartphone packing as a high-precision task, laundry folding as a long-horizon deformable-object task, and bimanual pick-and-place involving multi-object perception. Across all tasks, ALOE improves learning efficiency without compromising execution speed, showing that off-policy RL can be reintroduced in a reliable manner for real-world VLA post-training. Videos and additional materials are available at our project website.

ALOE: Action-Level Off-Policy Evaluation for Vision-Language-Action Model Post-Training

TL;DR

This work addresses the challenge of reintroducing off-policy reinforcement learning for real-world vision-language-action (VLA) policies by enabling action-level value estimation. It introduces ALOE, a framework that uses TD bootstrapping on action chunks, a pessimistic ensemble of critics, and advantage-weighted policy updates to achieve stable, data-efficient policy improvements from heterogeneous, human-in-the-loop data. The approach is demonstrated on three real-world robotic tasks—Pack Smart Phone, Folding Laundry, and Product Sorting—where ALOE achieves higher success rates, improved robustness, and better generalization than trajectory-based or imitation-centric baselines. These results indicate that action-level off-policy RL can be reliably integrated into real-world VLA post-training, enabling more flexible credit assignment and faster learning in complex manipulation scenarios.

Abstract

We study how to improve large foundation vision-language-action (VLA) systems through online reinforcement learning (RL) in real-world settings. Central to this process is the value function, which provides learning signals to guide VLA learning from experience. In practice, the value function is estimated from trajectory fragments collected from different data sources, including historical policies and intermittent human interventions. Estimating the value function of current behavior quality from the mixture data is inherently an off-policy evaluation problem. However, prior work often adopts conservative on-policy estimation for stability, which avoids direct evaluation of the current high-capacity policy and limits learning effectiveness. In this paper, we propose ALOE, an action-level off-policy evaluation framework for VLA post-training. ALOE applies chunking-based temporal-difference bootstrapping to evaluate individual action sequences instead of predicting final task outcomes. This design improves effective credit assignment to critical action chunks under sparse rewards and supports stable policy improvement. We evaluate our method on three real-world manipulation tasks, including smartphone packing as a high-precision task, laundry folding as a long-horizon deformable-object task, and bimanual pick-and-place involving multi-object perception. Across all tasks, ALOE improves learning efficiency without compromising execution speed, showing that off-policy RL can be reintroduced in a reliable manner for real-world VLA post-training. Videos and additional materials are available at our project website.
Paper Structure (36 sections, 1 theorem, 27 equations, 9 figures, 4 tables, 1 algorithm)

This paper contains 36 sections, 1 theorem, 27 equations, 9 figures, 4 tables, 1 algorithm.

Key Result

Theorem 4.1

For a given state $s$, let $\pi_{\mathrm{ref}}$ be the implicit behavior policy representing the data distribution in $\mathcal{D}$. The optimal solution to the advantage-weighted objective is the analytical solution to the following constrained optimization problem: where the constraint $\epsilon$ is implicitly controlled by the temperature parameter $\beta$ (distinct from the clipping parameter

Figures (9)

  • Figure 1: Overview of our real-world actor–critic framework for VLA post-training.Left: Our method adopts an actor–critic framework in which the actor is a flow-matching–based foundation VLA model and the critic is a lightweight ensemble Q-network. The actor outputs action sequences for online real-world rollouts, while the critic predicts ensemble Q-values to assess the quality of the actor’s action chunks under the current observation. Middle: Real-world RL is conducted in three stages. (1) Data collection under human intervention: the VLA policy is first warm-started with offline behavior cloning and then deployed on real robots. Both successful and failed rollouts are stored in buffer. When failures or unsafe behaviors occur, a human intervenes via teleoperation to take over from the failure state and guide the robot to the goal state. (2) Off-policy critic estimation: the critic is trained on the aggregated dataset using Q-chunking TD updates. (3) Policy improvement: the actor is optimized using pessimistic value estimation and advantage-weighted maximum likelihood. Right: We evaluate the method on real-world manipulation tasks and demonstrate improvements in task success rate, robustness to disturbances, and zero-shot generalization to unseen objects.
  • Figure 2: Illustrations of the real-world manipulation tasks and robot setup used in our experiments. We evaluate on phone packing, laundry folding, and multi-object pick and place, showing representative task stages, object variations, and the physical robot platform with three RGB cameras.
  • Figure 3: Visualization of Q-value. Q-values predicted by the learned off-policy critic along a representative trajectory from the target policy (50Hz). The critic assigns sharply lower values to actions leading to failure and higher values to successful behaviors, demonstrating fine-grained action-level credit assignment in long-horizon manipulation tasks.
  • Figure 4: The average success rate of the three manipulation tasks. We evaluate the baselines under real-world setting with several runs and calculate the average success rate.
  • Figure 5: Evaluation of efficiency, zero-shot generalization, and robustness across tasks. The left plot reports execution throughput on the phone-case task, where higher throughput indicates more efficient task completion. The middle plot evaluates generalization to unseen objects in a pick-and-place setting. The right plot measures robustness by injecting disturbances into the robot’s actions and assessing its ability to recover and successfully complete the task.
  • ...and 4 more figures

Theorems & Definitions (2)

  • Theorem 4.1: Constraint Policy Improvement
  • proof