WPT: World-to-Policy Transfer via Online World Model Distillation
Guangfeng Jiang, Yueru Luo, Jun Liu, Yi Huang, Yiyao Zhu, Zhan Qu, Dave Zhenyu Chen, Bingbing Liu, Xu Yan
TL;DR
WPT introduces a training-time World-to-Policy Transfer framework that leverages an online world model to guide policy learning without incurring test-time overhead. A trainable reward model evaluates trajectories against predicted future world states, and two distillation schemes transfer this world-informed reasoning from a multi-modal teacher to a lightweight student suitable for real-time deployment. Across open-loop nuScenes and closed-loop Bench2Drive benchmarks, WPT achieves state-of-the-art planning accuracy and safety (e.g., low collision rates and high driving scores) while delivering up to 4.9x faster inference for the student. The approach demonstrates that end-to-end driving can combine world-model predictive power with efficient, deployable planning through online training-time supervision and distillation, offering interpretability and practical impact for real-time autonomous systems.
Abstract
Recent years have witnessed remarkable progress in world models, which primarily aim to capture the spatio-temporal correlations between an agent's actions and the evolving environment. However, existing approaches often suffer from tight runtime coupling or depend on offline reward signals, resulting in substantial inference overhead or hindering end-to-end optimization. To overcome these limitations, we introduce WPT, a World-to-Policy Transfer training paradigm that enables online distillation under the guidance of an end-to-end world model. Specifically, we develop a trainable reward model that infuses world knowledge into a teacher policy by aligning candidate trajectories with the future dynamics predicted by the world model. Subsequently, we propose policy distillation and world reward distillation to transfer the teacher's reasoning ability into a lightweight student policy, enhancing planning performance while preserving real-time deployability. Extensive experiments on both open-loop and closed-loop benchmarks show that our WPT achieves state-of-the-art performance with a simple policy architecture: it attains a 0.11 collision rate (open-loop) and achieves a 79.23 driving score (closed-loop) surpassing both world-model-based and imitation-learning methods in accuracy and safety. Moreover, the student sustains up to 4.9x faster inference, while retaining most of the gains.
