Autonomous Vehicle Controllers From End-to-End Differentiable Simulation

Asen Nachkov; Danda Pani Paudel; Luc Van Gool

Autonomous Vehicle Controllers From End-to-End Differentiable Simulation

Asen Nachkov, Danda Pani Paudel, Luc Van Gool

TL;DR

The paper tackles the generalization gap in autonomous-vehicle control by using a differentiable simulator and Analytic Policy Gradients (APG) to train controllers end-to-end on the Waymax platform built atop the Waymo Open Motion Dataset. It introduces gradient-through-dynamics with a recurrent policy and a scene-encoder plus multi-agent transformer, along with gradient-detachment and incremental learning strategies to stabilize training. Empirical results on large-scale WOMD-derived scenarios show APG outperforms behavioural cloning in ADE, collision (overlap), and offroad metrics, with improved robustness to dynamic noise and reduced mode-count requirements. This work demonstrates that differentiable simulation can yield grounded, efficient AV controllers suitable for real-world deployment, while outlining directions for future integration with IDM and further sim-to-real transfer improvements.

Abstract

Current methods to learn controllers for autonomous vehicles (AVs) focus on behavioural cloning. Being trained only on exact historic data, the resulting agents often generalize poorly to novel scenarios. Simulators provide the opportunity to go beyond offline datasets, but they are still treated as complicated black boxes, only used to update the global simulation state. As a result, these RL algorithms are slow, sample-inefficient, and prior-agnostic. In this work, we leverage a differentiable simulator and design an analytic policy gradients (APG) approach to training AV controllers on the large-scale Waymo Open Motion Dataset. Our proposed framework brings the differentiable simulator into an end-to-end training loop, where gradients of the environment dynamics serve as a useful prior to help the agent learn a more grounded policy. We combine this setup with a recurrent architecture that can efficiently propagate temporal information across long simulated trajectories. This APG method allows us to learn robust, accurate, and fast policies, while only requiring widely-available expert trajectories, instead of scarce expert actions. We compare to behavioural cloning and find significant improvements in performance and robustness to noise in the dynamics, as well as overall more intuitive human-like handling.

Autonomous Vehicle Controllers From End-to-End Differentiable Simulation

TL;DR

Abstract

Paper Structure (9 sections, 2 equations, 8 figures, 5 tables)

This paper contains 9 sections, 2 equations, 8 figures, 5 tables.

Introduction
Related Work
Method
Differentiating through Waymax
End-to-end training with the simulator
Model overview
Experiments
Large-scale experiments in Waymax
Conclusion

Figures (8)

Figure 1: End-to-end learning of controllers. Our framework uses the gradients of the dynamics in a differentiable simulator to learn vehicle controllers from the corrections between the simulated new states and the target states.
Figure 2: Learning with and without simulator. Left: learning by behaviour cloning where we replay the GT trajectory and supervise the predicted actions. Middle: APG where we roll-out and supervise the trajectories without detaching gradients (shown in colored arrows). Right: APG where we detach gradients from past timesteps. The slanted arrows from $\textbf{a}_t$ to $\textbf{s}_{t+1}$ are the environment dynamics. The proposed detachment during simulation offers efficient and lightweight training.
Figure 3: Unrolling the model in time with gradient detachment inside the differentiable simulator. Starting from the simulator state $\mathbf{s}_t$, we obtain an observation $\mathbf{o}_t$, containing the scene elements such as agents locations, traffic lights, and roadgraph, which gets encoded into features $\mathbf{x}_t$. An RNN (recurrent over time) with a policy head outputs actions $\mathbf{a}_t$ which are executed in the simulated environment to obtain the new state $\mathbf{s}_{t + 1}$. When applying a loss between $\mathbf{s}_{t + 1}$ and $\hat{\mathbf{s}}_{t + 1}$ the gradients flow back through the environment and update the policy head, RNN, and the scene encoder. Similar to BPTT, gradients through the RNN hidden state accumulate. We do not backpropagate through the observation or the simulator state.
Figure 4: Resetting agent's state during incremental learning. Once every $n$ steps, we reset the agent's position to the corresponding log state. Blue arrows show the GT trajectory, red is the discontinuous simulated trajectory, and gray lines show the supervision.
Figure 5: Initial results on a toy task.$x$ and $y$ axes represent spatial coordinates. Left: deterministic beh. cloning fails to reproduce the trajectory on which it was fitted, while APG succeeds. Middle: APG struggles with a stochastic policy because of the sequential nature of the task. Right: APG with periodical resetting while training improves sample efficiency.
...and 3 more figures

Autonomous Vehicle Controllers From End-to-End Differentiable Simulation

TL;DR

Abstract

Autonomous Vehicle Controllers From End-to-End Differentiable Simulation

Authors

TL;DR

Abstract

Table of Contents

Figures (8)