IRL-VLA: Training an Vision-Language-Action Policy via Reward World Model
Anqing Jiang, Yu Gao, Yiru Wang, Zhigang Sun, Shuo Wang, Yuwen Heng, Hao Sun, Shichen Tang, Lijuan Zhu, Jinhao Chai, Jijun Wang, Zichong Gu, Hao Jiang, Li Sun
TL;DR
This work introduces IRL-VLA, a three-stage framework that combines imitation policy learning, a lightweight Reward World Model learned via inverse reinforcement learning, and PPO-based reinforcement learning to train a closed-loop Vision-Language-Action driving policy. By replacing heavy simulators with a data-driven reward model, the method enables scalable, multi-objective optimization (safety, comfort, efficiency) in end-to-end driving. Results on NAVSIM v2 demonstrate state-of-the-art performance and robust improvements over prior VLA approaches, with strong ablations validating the contribution of hierarchical 3D/semantic reasoning and diffusion-based planning. The approach offers a practical path toward high-capacity VLA models capable of real-time, multi-target driving without reliance on simulated environments.
Abstract
Vision-Language-Action (VLA) models have demonstrated potential in autonomous driving. However, two critical challenges hinder their development: (1) Existing VLA architectures are typically based on imitation learning in open-loop setup which tends to capture the recorded behaviors in the dataset, leading to suboptimal and constrained performance, (2) Close-loop training relies heavily on high-fidelity sensor simulation, where domain gaps and computational inefficiencies pose significant barriers. In this paper, we introduce IRL-VLA, a novel close-loop Reinforcement Learning via \textbf{I}nverse \textbf{R}einforcement \textbf{L}earning reward world model with a self-built VLA approach. Our framework proceeds in a three-stage paradigm: In the first stage, we propose a VLA architecture and pretrain the VLA policy via imitation learning. In the second stage, we construct a lightweight reward world model via inverse reinforcement learning to enable efficient close-loop reward computation. To further enhance planning performance, finally, we design specialized reward world model guidence reinforcement learning via PPO(Proximal Policy Optimization) to effectively balance the safety incidents, comfortable driving, and traffic efficiency. Our approach achieves state-of-the-art performance in NAVSIM v2 end-to-end driving benchmark, 1st runner up in CVPR2025 Autonomous Grand Challenge. We hope that our framework will accelerate VLA research in close-loop autonomous driving.
