Cognitive TransFuser: Semantics-guided Transformer-based Sensor Fusion for Improved Waypoint Prediction
Hwan-Soo Choi, Jongoh Jeong, Young Hoo Cho, Kuk-Jin Yoon, Jong-Hwan Kim
TL;DR
The paper tackles local waypoint prediction for autonomous driving in urban environments, addressing the fragility of single-sensor systems by fusing RGB and LiDAR data with Transformer-based fusion. It introduces Cognitive TransFuser, a semantics-guided multi-task network that integrates auxiliary tasks—semantic segmentation and traffic-light recognition—via dedicated heads and uses imitation learning to predict waypoint deltas $\\Delta w_t$, accumulating to $w_t$ through $w_t = w_{t-1} + \\\Delta w_t$. On CARLA Town05 benchmarks, the approach achieves substantial gains in Driving Score and Route Completion while maintaining real-time performance around 44.2 FPS, demonstrating the value of early fusion with semantic features and auxiliary guidance. These results highlight the practical potential of semantics-guided multi-task fusion for safer, more reliable navigation in complex urban scenes.
Abstract
Sensor fusion approaches for intelligent self-driving agents remain key to driving scene understanding given visual global contexts acquired from input sensors. Specifically, for the local waypoint prediction task, single-modality networks are still limited by strong dependency on the sensitivity of the input sensor, and thus recent works therefore promote the use of multiple sensors in fusion in feature level in practice. While it is well known that multiple data modalities encourage mutual contextual exchange, it requires global 3D scene understanding in real-time with minimal computation upon deployment to practical driving scenarios, thereby placing greater significance on the training strategy given a limited number of practically usable sensors. In this light, we exploit carefully selected auxiliary tasks that are highly correlated with the target task of interest (e.g., traffic light recognition and semantic segmentation) by fusing auxiliary task features and also using auxiliary heads for waypoint prediction based on imitation learning. Our RGB-LIDAR-based multi-task feature fusion network, coined Cognitive TransFuser, augments and exceeds the baseline network by a significant margin for safer and more complete road navigation in the CARLA simulator. We validate the proposed network on the Town05 Short and Town05 Long Benchmark through extensive experiments, achieving up to 44.2 FPS real-time inference time.
