Autonomous Vehicle Controllers From End-to-End Differentiable Simulation
Asen Nachkov, Danda Pani Paudel, Luc Van Gool
TL;DR
The paper tackles the generalization gap in autonomous-vehicle control by using a differentiable simulator and Analytic Policy Gradients (APG) to train controllers end-to-end on the Waymax platform built atop the Waymo Open Motion Dataset. It introduces gradient-through-dynamics with a recurrent policy and a scene-encoder plus multi-agent transformer, along with gradient-detachment and incremental learning strategies to stabilize training. Empirical results on large-scale WOMD-derived scenarios show APG outperforms behavioural cloning in ADE, collision (overlap), and offroad metrics, with improved robustness to dynamic noise and reduced mode-count requirements. This work demonstrates that differentiable simulation can yield grounded, efficient AV controllers suitable for real-world deployment, while outlining directions for future integration with IDM and further sim-to-real transfer improvements.
Abstract
Current methods to learn controllers for autonomous vehicles (AVs) focus on behavioural cloning. Being trained only on exact historic data, the resulting agents often generalize poorly to novel scenarios. Simulators provide the opportunity to go beyond offline datasets, but they are still treated as complicated black boxes, only used to update the global simulation state. As a result, these RL algorithms are slow, sample-inefficient, and prior-agnostic. In this work, we leverage a differentiable simulator and design an analytic policy gradients (APG) approach to training AV controllers on the large-scale Waymo Open Motion Dataset. Our proposed framework brings the differentiable simulator into an end-to-end training loop, where gradients of the environment dynamics serve as a useful prior to help the agent learn a more grounded policy. We combine this setup with a recurrent architecture that can efficiently propagate temporal information across long simulated trajectories. This APG method allows us to learn robust, accurate, and fast policies, while only requiring widely-available expert trajectories, instead of scarce expert actions. We compare to behavioural cloning and find significant improvements in performance and robustness to noise in the dynamics, as well as overall more intuitive human-like handling.
