Table of Contents
Fetching ...

VLA-RFT: Vision-Language-Action Reinforcement Fine-tuning with Verified Rewards in World Simulators

Hengtao Li, Pengxiang Ding, Runze Suo, Yihao Wang, Zirui Ge, Dongyuan Zang, Kexian Yu, Mingyang Sun, Hongyin Zhang, Donglin Wang, Weihua Su

TL;DR

Vision-Language-Action models trained by imitation struggle under distribution shift. The authors propose VLA-RFT, which uses a data-driven world model as a controllable simulator to roll out action sequences and generate dense, goal-aligned rewards, optimized with GRPO. Stage I pretrains both the world model and a VLA policy; Stage II fine-tunes the policy via world-model interactions and verified rewards, with a stochastic extension to the flow-matching head. Experiments on LIBERO show that VLA-RFT achieves strong performance with under 400 fine-tuning steps, surpassing supervised baselines and exceeding simulator-based RL in efficiency, while also demonstrating robustness to perturbations. This approach offers a practical, data-efficient post-training paradigm to improve the generalization and robustness of VLA systems.

Abstract

Vision-Language-Action (VLA) models enable embodied decision-making but rely heavily on imitation learning, leading to compounding errors and poor robustness under distribution shift. Reinforcement learning (RL) can mitigate these issues yet typically demands costly real-world interactions or suffers from sim-to-real gaps. We introduce VLA-RFT, a reinforcement fine-tuning framework that leverages a data-driven world model as a controllable simulator. Trained from real interaction data, the simulator predicts future visual observations conditioned on actions, allowing policy rollouts with dense, trajectory-level rewards derived from goal-achieving references. This design delivers an efficient and action-aligned learning signal, drastically lowering sample requirements. With fewer than 400 fine-tuning steps, VLA-RFT surpasses strong supervised baselines and achieves greater efficiency than simulator-based RL. Moreover, it exhibits strong robustness under perturbed conditions, sustaining stable task execution. Our results establish world-model-based RFT as a practical post-training paradigm to enhance the generalization and robustness of VLA models. For more details, please refer to https://vla-rft.github.io/.

VLA-RFT: Vision-Language-Action Reinforcement Fine-tuning with Verified Rewards in World Simulators

TL;DR

Vision-Language-Action models trained by imitation struggle under distribution shift. The authors propose VLA-RFT, which uses a data-driven world model as a controllable simulator to roll out action sequences and generate dense, goal-aligned rewards, optimized with GRPO. Stage I pretrains both the world model and a VLA policy; Stage II fine-tunes the policy via world-model interactions and verified rewards, with a stochastic extension to the flow-matching head. Experiments on LIBERO show that VLA-RFT achieves strong performance with under 400 fine-tuning steps, surpassing supervised baselines and exceeding simulator-based RL in efficiency, while also demonstrating robustness to perturbations. This approach offers a practical, data-efficient post-training paradigm to improve the generalization and robustness of VLA systems.

Abstract

Vision-Language-Action (VLA) models enable embodied decision-making but rely heavily on imitation learning, leading to compounding errors and poor robustness under distribution shift. Reinforcement learning (RL) can mitigate these issues yet typically demands costly real-world interactions or suffers from sim-to-real gaps. We introduce VLA-RFT, a reinforcement fine-tuning framework that leverages a data-driven world model as a controllable simulator. Trained from real interaction data, the simulator predicts future visual observations conditioned on actions, allowing policy rollouts with dense, trajectory-level rewards derived from goal-achieving references. This design delivers an efficient and action-aligned learning signal, drastically lowering sample requirements. With fewer than 400 fine-tuning steps, VLA-RFT surpasses strong supervised baselines and achieves greater efficiency than simulator-based RL. Moreover, it exhibits strong robustness under perturbed conditions, sustaining stable task execution. Our results establish world-model-based RFT as a practical post-training paradigm to enhance the generalization and robustness of VLA models. For more details, please refer to https://vla-rft.github.io/.

Paper Structure

This paper contains 17 sections, 14 equations, 8 figures, 10 tables, 1 algorithm.

Figures (8)

  • Figure 1: The Framework of VLA-RFT. A world model functions as a simulator that processes multi-rollout VLA action sequences to generate corresponding future states. By incorporating verified rewards through a GRPO optimization framework, we perform end-to-end updates of the VLA. Our approach achieves superior performance with remarkably fewer optimization steps—requiring only 0.4K iterations compared to 150K iterations for a strongly supervised baseline—demonstrating advantages in both standard and perturbed environments. Furthermore, the method exhibits enhanced execution-time robustness, characterized by reliable failure recovery and retry capabilities. For more details, please refer to our https://vla-rft.github.io/.
  • Figure 2: Training Paradigm of VLA-RFT. In the pre-training stage, both the world model and VLA policy are initialized, where the world model takes a 7-dimensional action input that is consistent in format with the VLA’s action output. In the reinforcement fine-tuning stage, the VLA generates action chunks based on an initial frame and language instruction, which are rolled out in the world model to predict future states. Verified rewards are then computed from the predicted states and used to optimize the VLA via GRPO Optimization.
  • Figure 3: Action distribution visualization of VLA-RFT and VLA-SFT. The plots show distributions along $X$ and $Z$ action dimensions: the left plot corresponds to the RFT-trained policy, and the right plot to the SFT-only base policy.
  • Figure 4: Illustration of perturbed task settings in LIBERO. We consider four perturbation types to evaluate out-of-distribution robustness: (Object Position) shifting the initial $(x,y)$ coordinates of the manipulated object; (Goal Position) displacing the target object in the $(x,y)$ plane; (Robot State) modifying the gripper’s vertical height and horizontal offset; and (Combination) applying all perturbations together. Each row shows the original setting (Origin), the perturbed variant (Disturb), and a side-by-side comparison (Contrast).
  • Figure 5: Illustration of World Model Generation. The initial image $I_0$ and input action sequence $a_{0:T-1}$ are first encoded into image and action tokens. These tokens are then fed into the world model to autoregressively predict the future state token sequence. Finally, decoders transform the generated image tokens into predicted future images $I_1, I_2, \dots, I_T$.
  • ...and 3 more figures