Guiding Attention in End-to-End Driving Models
Diego Porres, Yi Xiao, Gabriel Villalonga, Alexandre Levy, Antonio M. López
TL;DR
This work addresses the opacity and data-inefficiency of vision-based end-to-end driving by guiding the model's attention during training through a KL-divergence–based loss that aligns Transformer self-attention with task-relevant semantic masks. The Attention Guidance Learning approach uses noisy masks available at training time, avoids test-time mask requirements, and does not modify the underlying architecture. Experiments on CARLA with the CIL++ model show significant improvements, especially in low-data regimes, and yield more interpretable attention maps, indicating practical benefits for reliability and interpretability in autonomous driving. The method demonstrates robustness to mask noise and suggests avenues for real-world deployment and causal understanding of model behavior.
Abstract
Vision-based end-to-end driving models trained by imitation learning can lead to affordable solutions for autonomous driving. However, training these well-performing models usually requires a huge amount of data, while still lacking explicit and intuitive activation maps to reveal the inner workings of these models while driving. In this paper, we study how to guide the attention of these models to improve their driving quality and obtain more intuitive activation maps by adding a loss term during training using salient semantic maps. In contrast to previous work, our method does not require these salient semantic maps to be available during testing time, as well as removing the need to modify the model's architecture to which it is applied. We perform tests using perfect and noisy salient semantic maps with encouraging results in both, the latter of which is inspired by possible errors encountered with real data. Using CIL++ as a representative state-of-the-art model and the CARLA simulator with its standard benchmarks, we conduct experiments that show the effectiveness of our method in training better autonomous driving models, especially when data and computational resources are scarce.
