Table of Contents
Fetching ...

Addressing the Waypoint-Action Gap in End-to-End Autonomous Driving via Vehicle Motion Models

Jorge Daniel Rodríguez-Vidal, Gabriel Villalonga, Diego Porres, Antonio M. López Peña

TL;DR

This work addresses the waypoint--action gap in end-to-end autonomous driving by introducing a differentiable vehicle-model framework that lifts action sequences into ego-frame waypoint trajectories via a modular three-component operator, $\mathcal{F}_\phi$, enabling action-based policies to train and be evaluated within waypoint-based benchmarks. Instantiations with the Kinematic Bicycle Model ($\mathcal{F}_{\text{KBM}}$) and a Continuous Curvature Path Planner ($\mathcal{F}_{\text{CCPP}}$), along with an MLP option, provide a unified, differentiable bridge that preserves dynamics while supporting gradient-based training. End-to-end training uses a waypoint loss $L_{\mathrm{wp}}$, allowing gradients to flow through the lifting operator and improving correlation with driving outcomes across NAVSIM, Bench2Drive, and CARLA benchmarks. Across navhard, navtest, Bench2Drive, and CARLA, the framework achieves state-of-the-art or near-state-of-the-art results for action-based policies and demonstrates improved training stability and stronger offline–online correlations, highlighting the practical impact of integrating differentiable vehicle models into action-based E2E driving.

Abstract

End-to-End Autonomous Driving (E2E-AD) systems are typically grouped by the nature of their outputs: (i) waypoint-based models that predict a future trajectory, and (ii) action-based models that directly output throttle, steer and brake. Most recent benchmark protocols and training pipelines are waypoint-based, which makes action-based policies harder to train and compare, slowing their progress. To bridge this waypoint-action gap, we propose a novel, differentiable vehicle-model framework that rolls out predicted action sequences to their corresponding ego-frame waypoint trajectories while supervising in waypoint space. Our approach enables action-based architectures to be trained and evaluated, for the first time, within waypoint-based benchmarks without modifying the underlying evaluation protocol. We extensively evaluate our framework across multiple challenging benchmarks and observe consistent improvements over the baselines. In particular, on NAVSIM \texttt{navhard} our approach achieves state-of-the-art performance. Our code will be made publicly available upon acceptance.

Addressing the Waypoint-Action Gap in End-to-End Autonomous Driving via Vehicle Motion Models

TL;DR

This work addresses the waypoint--action gap in end-to-end autonomous driving by introducing a differentiable vehicle-model framework that lifts action sequences into ego-frame waypoint trajectories via a modular three-component operator, , enabling action-based policies to train and be evaluated within waypoint-based benchmarks. Instantiations with the Kinematic Bicycle Model () and a Continuous Curvature Path Planner (), along with an MLP option, provide a unified, differentiable bridge that preserves dynamics while supporting gradient-based training. End-to-end training uses a waypoint loss , allowing gradients to flow through the lifting operator and improving correlation with driving outcomes across NAVSIM, Bench2Drive, and CARLA benchmarks. Across navhard, navtest, Bench2Drive, and CARLA, the framework achieves state-of-the-art or near-state-of-the-art results for action-based policies and demonstrates improved training stability and stronger offline–online correlations, highlighting the practical impact of integrating differentiable vehicle models into action-based E2E driving.

Abstract

End-to-End Autonomous Driving (E2E-AD) systems are typically grouped by the nature of their outputs: (i) waypoint-based models that predict a future trajectory, and (ii) action-based models that directly output throttle, steer and brake. Most recent benchmark protocols and training pipelines are waypoint-based, which makes action-based policies harder to train and compare, slowing their progress. To bridge this waypoint-action gap, we propose a novel, differentiable vehicle-model framework that rolls out predicted action sequences to their corresponding ego-frame waypoint trajectories while supervising in waypoint space. Our approach enables action-based architectures to be trained and evaluated, for the first time, within waypoint-based benchmarks without modifying the underlying evaluation protocol. We extensively evaluate our framework across multiple challenging benchmarks and observe consistent improvements over the baselines. In particular, on NAVSIM \texttt{navhard} our approach achieves state-of-the-art performance. Our code will be made publicly available upon acceptance.
Paper Structure (77 sections, 1 theorem, 39 equations, 9 figures, 8 tables, 1 algorithm)

This paper contains 77 sections, 1 theorem, 39 equations, 9 figures, 8 tables, 1 algorithm.

Key Result

Proposition 3.1

Let $\mathcal{F}_\phi: \mathbb{R}^{C_f \times d_u} \times \mathcal{S} \to \mathbb{R}^{C_f \times 2}$ be a lifting operator decomposed as $\psi_\phi: \mathbb{R}^{d_u} \to \mathcal{U}$, $f_\phi: \mathcal{Z} \times \mathcal{U} \to \mathcal{Z}$, and $h: \mathcal{Z} \to \mathbb{R}^2$, where $\mathcal{U}$

Figures (9)

  • Figure 1: Bridging the Waypoints--Actions gap with our differentiable framework. Black arrows denote the forward pass and red arrows denote gradient flow. Here, $\mathcal{N}_\theta$ is the policy network, which takes as input the current observations $o_{t}$ and high-level command $c_{t}$. Two paradigms exist for the outputs of the policy: planning-based, which outputs the future waypoints $\mathbf{w}_{t}$ the ego vehicle will follow, and control-based, which outputs low-level controls $\mathbf{a}_{t}$. $\mathcal{F}_\phi$, the lifting operator, will take these actions along the state $s_{t}$ of the vehicle, and lift them to a set of waypoints. This allows us to use the gradients from a waypoint or planning-based loss on an action-based policy.
  • Figure 2: Pearson correlation of different metric errors of different CIL++ models
  • Figure 3: Numerical error vs. waypoint. Average and std. dev. of the $L_1$ waypoint error (meters) for $C_f=8$, $\Delta t = 0.5$s. CCPP maintains lower error throughout the horizon.
  • Figure 4: Qualitative trajectory predictions across diverse NAVSIM driving scenes. Each row shows a different example scene. Green waypoints represent ground-truth (GT), and red waypoints represent agent predictions.
  • Figure 5: LTF Chitta22TransFuser qualitative predictions across four NAVSIM scenes. From left to right: example scenes 1, 2, 3, and 4.
  • ...and 4 more figures

Theorems & Definitions (3)

  • Proposition 3.1: Determinism and differentiability of lifting operators
  • proof
  • proof : Proof of Proposition \ref{['prop:lifting-smooth']}