Table of Contents
Fetching ...

Hidden Biases of End-to-End Driving Models

Bernhard Jaeger, Kashyap Chitta, Andreas Geiger

TL;DR

The paper investigates why end-to-end driving methods improve on CARLA by identifying two recurrent biases: a lateral recovery shortcut tied to target-point conditioning and the multi-modal nature of future velocities rendered as waypoints. It demonstrates that a transformer-based pooling mechanism and data augmentations can mitigate shortcut effects, and proposes disentangling target speeds from path predictions with a confidence-weighted controller to handle uncertainty. Building on these insights,TF++ (TransFuser++) combines architectural refinements, two-stage training, and dataset scaling to achieve state-of-the-art results on Longest6 and LAV benchmarks, while reducing data requirements. The work highlights the importance of understanding architectural biases and representation ambiguities for robust, interpretable end-to-end driving systems, and it discusses limitations and broader implications for real-world deployment.

Abstract

End-to-end driving systems have recently made rapid progress, in particular on CARLA. Independent of their major contribution, they introduce changes to minor system components. Consequently, the source of improvements is unclear. We identify two biases that recur in nearly all state-of-the-art methods and are critical for the observed progress on CARLA: (1) lateral recovery via a strong inductive bias towards target point following, and (2) longitudinal averaging of multimodal waypoint predictions for slowing down. We investigate the drawbacks of these biases and identify principled alternatives. By incorporating our insights, we develop TF++, a simple end-to-end method that ranks first on the Longest6 and LAV benchmarks, gaining 11 driving score over the best prior work on Longest6.

Hidden Biases of End-to-End Driving Models

TL;DR

The paper investigates why end-to-end driving methods improve on CARLA by identifying two recurrent biases: a lateral recovery shortcut tied to target-point conditioning and the multi-modal nature of future velocities rendered as waypoints. It demonstrates that a transformer-based pooling mechanism and data augmentations can mitigate shortcut effects, and proposes disentangling target speeds from path predictions with a confidence-weighted controller to handle uncertainty. Building on these insights,TF++ (TransFuser++) combines architectural refinements, two-stage training, and dataset scaling to achieve state-of-the-art results on Longest6 and LAV benchmarks, while reducing data requirements. The work highlights the importance of understanding architectural biases and representation ambiguities for robust, interpretable end-to-end driving systems, and it discusses limitations and broader implications for real-world deployment.

Abstract

End-to-end driving systems have recently made rapid progress, in particular on CARLA. Independent of their major contribution, they introduce changes to minor system components. Consequently, the source of improvements is unclear. We identify two biases that recur in nearly all state-of-the-art methods and are critical for the observed progress on CARLA: (1) lateral recovery via a strong inductive bias towards target point following, and (2) longitudinal averaging of multimodal waypoint predictions for slowing down. We investigate the drawbacks of these biases and identify principled alternatives. By incorporating our insights, we develop TF++, a simple end-to-end method that ranks first on the Longest6 and LAV benchmarks, gaining 11 driving score over the best prior work on Longest6.
Paper Structure (25 sections, 1 equation, 10 figures, 17 tables)

This paper contains 25 sections, 1 equation, 10 figures, 17 tables.

Figures (10)

  • Figure 1: Hidden biases. (a) When outside their training distribution, current methods extrapolate waypoint predictions to the nearest target point, helping them recover. (b) The future velocity is multi-modal, but current methods commit to a single plan, which leads to interpolation.
  • Figure 2: Extrapolation to target point. In unknown situations, TP conditioned methods extrapolate their waypoints towards target points. This periodically resets steering errors and is a form of implicit map based recovery. However, relying on extrapolation is a shortcut that can lead to catastrophic errors in certain situations (see Fig. \ref{['fig:target_point_failure_tcp']} and Fig. \ref{['fig:target_point_failure']}).
  • Figure 3: Target point shortcut. When TP conditioned methods extrapolate to spatially distant waypoints, they incur large steering errors. Replacing global average pooling in TransFuser with a cross-attention mechanism mitigates the issue.
  • Figure 4: Pooling. Existing approaches vectorize feature grids either by global average pooling (top, e.g. Chitta2022PAMIChen2022CVPRa) or with attention mechanisms (bottom, e.g. Shao2022CORLWu2022NeurIPS). The latter retains spatial information gained via auxiliary tasks.
  • Figure 5: Transfuser interpolates between modes.
  • ...and 5 more figures