Table of Contents
Fetching ...

WPT: World-to-Policy Transfer via Online World Model Distillation

Guangfeng Jiang, Yueru Luo, Jun Liu, Yi Huang, Yiyao Zhu, Zhan Qu, Dave Zhenyu Chen, Bingbing Liu, Xu Yan

TL;DR

WPT introduces a training-time World-to-Policy Transfer framework that leverages an online world model to guide policy learning without incurring test-time overhead. A trainable reward model evaluates trajectories against predicted future world states, and two distillation schemes transfer this world-informed reasoning from a multi-modal teacher to a lightweight student suitable for real-time deployment. Across open-loop nuScenes and closed-loop Bench2Drive benchmarks, WPT achieves state-of-the-art planning accuracy and safety (e.g., low collision rates and high driving scores) while delivering up to 4.9x faster inference for the student. The approach demonstrates that end-to-end driving can combine world-model predictive power with efficient, deployable planning through online training-time supervision and distillation, offering interpretability and practical impact for real-time autonomous systems.

Abstract

Recent years have witnessed remarkable progress in world models, which primarily aim to capture the spatio-temporal correlations between an agent's actions and the evolving environment. However, existing approaches often suffer from tight runtime coupling or depend on offline reward signals, resulting in substantial inference overhead or hindering end-to-end optimization. To overcome these limitations, we introduce WPT, a World-to-Policy Transfer training paradigm that enables online distillation under the guidance of an end-to-end world model. Specifically, we develop a trainable reward model that infuses world knowledge into a teacher policy by aligning candidate trajectories with the future dynamics predicted by the world model. Subsequently, we propose policy distillation and world reward distillation to transfer the teacher's reasoning ability into a lightweight student policy, enhancing planning performance while preserving real-time deployability. Extensive experiments on both open-loop and closed-loop benchmarks show that our WPT achieves state-of-the-art performance with a simple policy architecture: it attains a 0.11 collision rate (open-loop) and achieves a 79.23 driving score (closed-loop) surpassing both world-model-based and imitation-learning methods in accuracy and safety. Moreover, the student sustains up to 4.9x faster inference, while retaining most of the gains.

WPT: World-to-Policy Transfer via Online World Model Distillation

TL;DR

WPT introduces a training-time World-to-Policy Transfer framework that leverages an online world model to guide policy learning without incurring test-time overhead. A trainable reward model evaluates trajectories against predicted future world states, and two distillation schemes transfer this world-informed reasoning from a multi-modal teacher to a lightweight student suitable for real-time deployment. Across open-loop nuScenes and closed-loop Bench2Drive benchmarks, WPT achieves state-of-the-art planning accuracy and safety (e.g., low collision rates and high driving scores) while delivering up to 4.9x faster inference for the student. The approach demonstrates that end-to-end driving can combine world-model predictive power with efficient, deployable planning through online training-time supervision and distillation, offering interpretability and practical impact for real-time autonomous systems.

Abstract

Recent years have witnessed remarkable progress in world models, which primarily aim to capture the spatio-temporal correlations between an agent's actions and the evolving environment. However, existing approaches often suffer from tight runtime coupling or depend on offline reward signals, resulting in substantial inference overhead or hindering end-to-end optimization. To overcome these limitations, we introduce WPT, a World-to-Policy Transfer training paradigm that enables online distillation under the guidance of an end-to-end world model. Specifically, we develop a trainable reward model that infuses world knowledge into a teacher policy by aligning candidate trajectories with the future dynamics predicted by the world model. Subsequently, we propose policy distillation and world reward distillation to transfer the teacher's reasoning ability into a lightweight student policy, enhancing planning performance while preserving real-time deployability. Extensive experiments on both open-loop and closed-loop benchmarks show that our WPT achieves state-of-the-art performance with a simple policy architecture: it attains a 0.11 collision rate (open-loop) and achieves a 79.23 driving score (closed-loop) surpassing both world-model-based and imitation-learning methods in accuracy and safety. Moreover, the student sustains up to 4.9x faster inference, while retaining most of the gains.

Paper Structure

This paper contains 23 sections, 24 equations, 7 figures, 9 tables.

Figures (7)

  • Figure 1: Different training paradigms of AD policy with world model. (a) Imitation learning where the policy is trained using expert supervision. (b) World model (WM) directly integrated into the AD policy for enhanced feature evolution and trajectory reasoning. (c) Simulator-based reinforcement learning for AD policy training using a simulated world. (d) Our WPT, where the policy interacts with the WM during training, with both the teacher policy (T) and the student policy (S) leveraging the WM for knowledge transfer. After training, the WM will be discarded.
  • Figure 2: Overview of WPT framework. During training (top), the pretrained world model predicts future world under given action conditions, and the teacher AD policy (T) generates multi-modal trajectories. The reward model evaluates these trajectories to produce world reward. During distillation (bottom), the student AD policy (S) learns from the teacher through two mechanisms: (1) policy distillation, which aligns the planning representations between teacher and student; and (2) world reward distillation, which encourages the student to match the teacher’s optimal reward trajectory in the predicted future world.
  • Figure 3: Overview of reward model. The reward model consists of multiple components: the world encoder processes the latent world representation, while the plan queries are refined through the plan decoder and plan head to generate multi-modal candidate trajectories. These trajectories are then passed to the trajectory encoder, which encodes them for evaluation by two distinct reward heads: the simulation reward head and the imitation reward head. The final reward is computed by combining these reward values, with the best trajectory selected via the argmax operation. The supervisory signals of the reward model come from simulation and imitation. For the detailed process, please refer to Sec. \ref{['sec:reward_model']}.
  • Figure 4: Detailed structure of the occupancy-based world model, which predicts the future world states through an autoregressive manner. The model utilizes an observation encoder to process multi-view images, a feature aggregation module to capture temporal consistency, and a world decoder to predict the future BEV embedding based on historical and current world features. This approach allows the model to predict future occupancy states.
  • Figure 5: Detailed structure of the instance-based world model, which predicts the future map elements and agent motion. The model uses a ResNet encoder to process multi-view images, a BEV encoder to transform the features into BEV space, and two decoders: the map decoder for static road elements and the world decoder for dynamic agents.
  • ...and 2 more figures