Table of Contents
Fetching ...

Guiding Attention in End-to-End Driving Models

Diego Porres, Yi Xiao, Gabriel Villalonga, Alexandre Levy, Antonio M. López

TL;DR

This work addresses the opacity and data-inefficiency of vision-based end-to-end driving by guiding the model's attention during training through a KL-divergence–based loss that aligns Transformer self-attention with task-relevant semantic masks. The Attention Guidance Learning approach uses noisy masks available at training time, avoids test-time mask requirements, and does not modify the underlying architecture. Experiments on CARLA with the CIL++ model show significant improvements, especially in low-data regimes, and yield more interpretable attention maps, indicating practical benefits for reliability and interpretability in autonomous driving. The method demonstrates robustness to mask noise and suggests avenues for real-world deployment and causal understanding of model behavior.

Abstract

Vision-based end-to-end driving models trained by imitation learning can lead to affordable solutions for autonomous driving. However, training these well-performing models usually requires a huge amount of data, while still lacking explicit and intuitive activation maps to reveal the inner workings of these models while driving. In this paper, we study how to guide the attention of these models to improve their driving quality and obtain more intuitive activation maps by adding a loss term during training using salient semantic maps. In contrast to previous work, our method does not require these salient semantic maps to be available during testing time, as well as removing the need to modify the model's architecture to which it is applied. We perform tests using perfect and noisy salient semantic maps with encouraging results in both, the latter of which is inspired by possible errors encountered with real data. Using CIL++ as a representative state-of-the-art model and the CARLA simulator with its standard benchmarks, we conduct experiments that show the effectiveness of our method in training better autonomous driving models, especially when data and computational resources are scarce.

Guiding Attention in End-to-End Driving Models

TL;DR

This work addresses the opacity and data-inefficiency of vision-based end-to-end driving by guiding the model's attention during training through a KL-divergence–based loss that aligns Transformer self-attention with task-relevant semantic masks. The Attention Guidance Learning approach uses noisy masks available at training time, avoids test-time mask requirements, and does not modify the underlying architecture. Experiments on CARLA with the CIL++ model show significant improvements, especially in low-data regimes, and yield more interpretable attention maps, indicating practical benefits for reliability and interpretability in autonomous driving. The method demonstrates robustness to mask noise and suggests avenues for real-world deployment and causal understanding of model behavior.

Abstract

Vision-based end-to-end driving models trained by imitation learning can lead to affordable solutions for autonomous driving. However, training these well-performing models usually requires a huge amount of data, while still lacking explicit and intuitive activation maps to reveal the inner workings of these models while driving. In this paper, we study how to guide the attention of these models to improve their driving quality and obtain more intuitive activation maps by adding a loss term during training using salient semantic maps. In contrast to previous work, our method does not require these salient semantic maps to be available during testing time, as well as removing the need to modify the model's architecture to which it is applied. We perform tests using perfect and noisy salient semantic maps with encouraging results in both, the latter of which is inspired by possible errors encountered with real data. Using CIL++ as a representative state-of-the-art model and the CARLA simulator with its standard benchmarks, we conduct experiments that show the effectiveness of our method in training better autonomous driving models, especially when data and computational resources are scarce.
Paper Structure (24 sections, 4 equations, 5 figures, 2 tables)

This paper contains 24 sections, 4 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Our proposed pipeline. Left: the CIL++ Xiao:2023 architecture. Right: our proposed Attention Loss $\mathcal{L}_{\text{att}}$ obtained from masks using pre-computed data, on-board sensors, or a pre-trained network. For additional details, refer to Section \ref{['subsec:architecture']} and \ref{['subsec:loss_func']}, respectively.
  • Figure 2: $\mathbf{x}_{c, t}$ (central RGB images at a timestep $t$) and their corresponding masks $\mathcal{M}_{c, t}$ for Town01, Town02, and Town03. For the single-lane (top rows), we use a maximum depth of 20 meters to generate the masks, whereas we use a maximum depth of 40 meters for the multi-lane towns. Note that the U$^2$-NET was trained only with data from Town01, so the failure to detect the lanes on Town03 is merely illustrative.
  • Figure 3: Comparison between the baseline (CIL++ default training) and our method (with $\mathcal{L}_{\text{att}}$) while increasing the amount of training data.
  • Figure 4: Driving results by incrementally adding a weather condition to the training set (2 hours of data per weather).
  • Figure 5: Visualization of the average attention map of the last layer of the Transformer Encoder using three RGB cameras as input for CIL++ (top row) and CIL++ with the Attention Loss $\mathcal{L}_\text{att}$ (bottom row), for Town01 (left column) and Town03 (right column).