What Can RL Bring to VLA Generalization? An Empirical Study

Jijia Liu; Feng Gao; Bingwen Wei; Xinlei Chen; Qingmin Liao; Yi Wu; Chao Yu; Yu Wang

What Can RL Bring to VLA Generalization? An Empirical Study

Jijia Liu, Feng Gao, Bingwen Wei, Xinlei Chen, Qingmin Liao, Yi Wu, Chao Yu, Yu Wang

TL;DR

The paper tackles generalization gaps in Vision-Language-Action models trained via supervised fine-tuning by introducing a demanding benchmark that probes Vision, Semantics, and Execution under distribution shifts. It systematically compares PPO, GRPO, and DPO for RL fine-tuning on a pick-and-place VLA task and proposes an efficient PPO-based training recipe with a shared actor-critic backbone, warm-up, and minimal epochs. Results show PPO-based RL markedly improves semantic grounding and execution robustness relative to SFT, while maintaining similar visual robustness, with additional preliminary sim-to-real evidence. These findings offer practical guidelines for RL fine-tuning of VLAs and underscore RL’s potential to yield more generalizable embodied agents.

Abstract

Large Vision-Language Action (VLA) models have shown significant potential for embodied AI. However, their predominant training via supervised fine-tuning (SFT) limits generalization due to susceptibility to compounding errors under distribution shifts. Reinforcement learning (RL) offers a path to overcome these limitations by optimizing for task objectives via trial-and-error, yet a systematic understanding of its specific generalization benefits for VLAs compared to SFT is lacking. To address this, our study introduces a comprehensive benchmark for evaluating VLA generalization and systematically investigates the impact of RL fine-tuning across diverse visual, semantic, and execution dimensions. Our extensive experiments reveal that RL fine-tuning, particularly with PPO, significantly enhances generalization in semantic understanding and execution robustness over SFT, while maintaining comparable visual robustness. We identify PPO as a more effective RL algorithm for VLAs than LLM-derived methods like DPO and GRPO. We also develop a simple recipe for efficient PPO training on VLAs, and demonstrate its practical utility for improving VLA generalization. The project page is at https://rlvla.github.io

What Can RL Bring to VLA Generalization? An Empirical Study

TL;DR

Abstract

What Can RL Bring to VLA Generalization? An Empirical Study

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (14)