Table of Contents
Fetching ...

What Can RL Bring to VLA Generalization? An Empirical Study

Jijia Liu, Feng Gao, Bingwen Wei, Xinlei Chen, Qingmin Liao, Yi Wu, Chao Yu, Yu Wang

TL;DR

The paper tackles generalization gaps in Vision-Language-Action models trained via supervised fine-tuning by introducing a demanding benchmark that probes Vision, Semantics, and Execution under distribution shifts. It systematically compares PPO, GRPO, and DPO for RL fine-tuning on a pick-and-place VLA task and proposes an efficient PPO-based training recipe with a shared actor-critic backbone, warm-up, and minimal epochs. Results show PPO-based RL markedly improves semantic grounding and execution robustness relative to SFT, while maintaining similar visual robustness, with additional preliminary sim-to-real evidence. These findings offer practical guidelines for RL fine-tuning of VLAs and underscore RL’s potential to yield more generalizable embodied agents.

Abstract

Large Vision-Language Action (VLA) models have shown significant potential for embodied AI. However, their predominant training via supervised fine-tuning (SFT) limits generalization due to susceptibility to compounding errors under distribution shifts. Reinforcement learning (RL) offers a path to overcome these limitations by optimizing for task objectives via trial-and-error, yet a systematic understanding of its specific generalization benefits for VLAs compared to SFT is lacking. To address this, our study introduces a comprehensive benchmark for evaluating VLA generalization and systematically investigates the impact of RL fine-tuning across diverse visual, semantic, and execution dimensions. Our extensive experiments reveal that RL fine-tuning, particularly with PPO, significantly enhances generalization in semantic understanding and execution robustness over SFT, while maintaining comparable visual robustness. We identify PPO as a more effective RL algorithm for VLAs than LLM-derived methods like DPO and GRPO. We also develop a simple recipe for efficient PPO training on VLAs, and demonstrate its practical utility for improving VLA generalization. The project page is at https://rlvla.github.io

What Can RL Bring to VLA Generalization? An Empirical Study

TL;DR

The paper tackles generalization gaps in Vision-Language-Action models trained via supervised fine-tuning by introducing a demanding benchmark that probes Vision, Semantics, and Execution under distribution shifts. It systematically compares PPO, GRPO, and DPO for RL fine-tuning on a pick-and-place VLA task and proposes an efficient PPO-based training recipe with a shared actor-critic backbone, warm-up, and minimal epochs. Results show PPO-based RL markedly improves semantic grounding and execution robustness relative to SFT, while maintaining similar visual robustness, with additional preliminary sim-to-real evidence. These findings offer practical guidelines for RL fine-tuning of VLAs and underscore RL’s potential to yield more generalizable embodied agents.

Abstract

Large Vision-Language Action (VLA) models have shown significant potential for embodied AI. However, their predominant training via supervised fine-tuning (SFT) limits generalization due to susceptibility to compounding errors under distribution shifts. Reinforcement learning (RL) offers a path to overcome these limitations by optimizing for task objectives via trial-and-error, yet a systematic understanding of its specific generalization benefits for VLAs compared to SFT is lacking. To address this, our study introduces a comprehensive benchmark for evaluating VLA generalization and systematically investigates the impact of RL fine-tuning across diverse visual, semantic, and execution dimensions. Our extensive experiments reveal that RL fine-tuning, particularly with PPO, significantly enhances generalization in semantic understanding and execution robustness over SFT, while maintaining comparable visual robustness. We identify PPO as a more effective RL algorithm for VLAs than LLM-derived methods like DPO and GRPO. We also develop a simple recipe for efficient PPO training on VLAs, and demonstrate its practical utility for improving VLA generalization. The project page is at https://rlvla.github.io

Paper Structure

This paper contains 40 sections, 7 equations, 14 figures, 4 tables.

Figures (14)

  • Figure 1: Overview of our study for evaluating how RL enhances VLA generalization in terms of Vision, Semantics, and Execution: in out-of-distribution tests, RL yields substantial gains in Execution, moderate improvements in Semantics, and performance on par with SFT for Vision.
  • Figure 2: Architecture of the OpenVLA model OpenVLA, reproduced from the official open-source code and checkpoints.
  • Figure 3: Overview of VLA fine-tuning methods: SFT learns from offline demonstrations, whereas DPO, GRPO, and PPO use RL-based updates—employing preference alignment, group-relative advantage estimation, and standard actor-critic PPO with generalized advantage estimation (GAE), respectively; and performance comparison between different RL fine-tuning algorithms.
  • Figure 4: PPO with shared actor-critic backbone, where $V(s)$ is predicted by a three-layer MLP. We compare the performance of different critic designs, as well as a separate actor-critic architecture.
  • Figure 5: Performance comparison to verify our efficient training designs.
  • ...and 9 more figures