Table of Contents
Fetching ...

IRL-VLA: Training an Vision-Language-Action Policy via Reward World Model

Anqing Jiang, Yu Gao, Yiru Wang, Zhigang Sun, Shuo Wang, Yuwen Heng, Hao Sun, Shichen Tang, Lijuan Zhu, Jinhao Chai, Jijun Wang, Zichong Gu, Hao Jiang, Li Sun

TL;DR

This work introduces IRL-VLA, a three-stage framework that combines imitation policy learning, a lightweight Reward World Model learned via inverse reinforcement learning, and PPO-based reinforcement learning to train a closed-loop Vision-Language-Action driving policy. By replacing heavy simulators with a data-driven reward model, the method enables scalable, multi-objective optimization (safety, comfort, efficiency) in end-to-end driving. Results on NAVSIM v2 demonstrate state-of-the-art performance and robust improvements over prior VLA approaches, with strong ablations validating the contribution of hierarchical 3D/semantic reasoning and diffusion-based planning. The approach offers a practical path toward high-capacity VLA models capable of real-time, multi-target driving without reliance on simulated environments.

Abstract

Vision-Language-Action (VLA) models have demonstrated potential in autonomous driving. However, two critical challenges hinder their development: (1) Existing VLA architectures are typically based on imitation learning in open-loop setup which tends to capture the recorded behaviors in the dataset, leading to suboptimal and constrained performance, (2) Close-loop training relies heavily on high-fidelity sensor simulation, where domain gaps and computational inefficiencies pose significant barriers. In this paper, we introduce IRL-VLA, a novel close-loop Reinforcement Learning via \textbf{I}nverse \textbf{R}einforcement \textbf{L}earning reward world model with a self-built VLA approach. Our framework proceeds in a three-stage paradigm: In the first stage, we propose a VLA architecture and pretrain the VLA policy via imitation learning. In the second stage, we construct a lightweight reward world model via inverse reinforcement learning to enable efficient close-loop reward computation. To further enhance planning performance, finally, we design specialized reward world model guidence reinforcement learning via PPO(Proximal Policy Optimization) to effectively balance the safety incidents, comfortable driving, and traffic efficiency. Our approach achieves state-of-the-art performance in NAVSIM v2 end-to-end driving benchmark, 1st runner up in CVPR2025 Autonomous Grand Challenge. We hope that our framework will accelerate VLA research in close-loop autonomous driving.

IRL-VLA: Training an Vision-Language-Action Policy via Reward World Model

TL;DR

This work introduces IRL-VLA, a three-stage framework that combines imitation policy learning, a lightweight Reward World Model learned via inverse reinforcement learning, and PPO-based reinforcement learning to train a closed-loop Vision-Language-Action driving policy. By replacing heavy simulators with a data-driven reward model, the method enables scalable, multi-objective optimization (safety, comfort, efficiency) in end-to-end driving. Results on NAVSIM v2 demonstrate state-of-the-art performance and robust improvements over prior VLA approaches, with strong ablations validating the contribution of hierarchical 3D/semantic reasoning and diffusion-based planning. The approach offers a practical path toward high-capacity VLA models capable of real-time, multi-target driving without reliance on simulated environments.

Abstract

Vision-Language-Action (VLA) models have demonstrated potential in autonomous driving. However, two critical challenges hinder their development: (1) Existing VLA architectures are typically based on imitation learning in open-loop setup which tends to capture the recorded behaviors in the dataset, leading to suboptimal and constrained performance, (2) Close-loop training relies heavily on high-fidelity sensor simulation, where domain gaps and computational inefficiencies pose significant barriers. In this paper, we introduce IRL-VLA, a novel close-loop Reinforcement Learning via \textbf{I}nverse \textbf{R}einforcement \textbf{L}earning reward world model with a self-built VLA approach. Our framework proceeds in a three-stage paradigm: In the first stage, we propose a VLA architecture and pretrain the VLA policy via imitation learning. In the second stage, we construct a lightweight reward world model via inverse reinforcement learning to enable efficient close-loop reward computation. To further enhance planning performance, finally, we design specialized reward world model guidence reinforcement learning via PPO(Proximal Policy Optimization) to effectively balance the safety incidents, comfortable driving, and traffic efficiency. Our approach achieves state-of-the-art performance in NAVSIM v2 end-to-end driving benchmark, 1st runner up in CVPR2025 Autonomous Grand Challenge. We hope that our framework will accelerate VLA research in close-loop autonomous driving.

Paper Structure

This paper contains 13 sections, 12 equations, 2 figures, 3 tables, 1 algorithm.

Figures (2)

  • Figure 1: Different paradigms of VLA autonomous driving (AD) a).Imitation learning for VLA AD. b). Simulator-based reinforcement learning for VLA AD. c). IRL-VLA explores improving high-capacity VLA with scalable reinforcement learning without heavy simulator.
  • Figure 2: Overview of the IRL-VLA Framework. This figure illustrates the three-stage pipeline of our close-loop Reinforcement Learning via Reward World Model framework for Vision-Language-Action (VLA) in autonomous driving. a) Imitation Policy Learning initializes the VLA model as a supervised policy via sensor input and planning trajectories. b) Inverse Environment Learning constructs the Reward World Model (RWM) via pretrained VLA planning trajectories. c) Close-Loop Reinforcement Learning optimizes the policy using PPO and the RWM. Subfigures (e), (f), and (g) detail for the Unified Diffusion Policy, Semantic Reasoning, and Reward World Model, respectively