Table of Contents
Fetching ...

Cognitive TransFuser: Semantics-guided Transformer-based Sensor Fusion for Improved Waypoint Prediction

Hwan-Soo Choi, Jongoh Jeong, Young Hoo Cho, Kuk-Jin Yoon, Jong-Hwan Kim

TL;DR

The paper tackles local waypoint prediction for autonomous driving in urban environments, addressing the fragility of single-sensor systems by fusing RGB and LiDAR data with Transformer-based fusion. It introduces Cognitive TransFuser, a semantics-guided multi-task network that integrates auxiliary tasks—semantic segmentation and traffic-light recognition—via dedicated heads and uses imitation learning to predict waypoint deltas $\\Delta w_t$, accumulating to $w_t$ through $w_t = w_{t-1} + \\\Delta w_t$. On CARLA Town05 benchmarks, the approach achieves substantial gains in Driving Score and Route Completion while maintaining real-time performance around 44.2 FPS, demonstrating the value of early fusion with semantic features and auxiliary guidance. These results highlight the practical potential of semantics-guided multi-task fusion for safer, more reliable navigation in complex urban scenes.

Abstract

Sensor fusion approaches for intelligent self-driving agents remain key to driving scene understanding given visual global contexts acquired from input sensors. Specifically, for the local waypoint prediction task, single-modality networks are still limited by strong dependency on the sensitivity of the input sensor, and thus recent works therefore promote the use of multiple sensors in fusion in feature level in practice. While it is well known that multiple data modalities encourage mutual contextual exchange, it requires global 3D scene understanding in real-time with minimal computation upon deployment to practical driving scenarios, thereby placing greater significance on the training strategy given a limited number of practically usable sensors. In this light, we exploit carefully selected auxiliary tasks that are highly correlated with the target task of interest (e.g., traffic light recognition and semantic segmentation) by fusing auxiliary task features and also using auxiliary heads for waypoint prediction based on imitation learning. Our RGB-LIDAR-based multi-task feature fusion network, coined Cognitive TransFuser, augments and exceeds the baseline network by a significant margin for safer and more complete road navigation in the CARLA simulator. We validate the proposed network on the Town05 Short and Town05 Long Benchmark through extensive experiments, achieving up to 44.2 FPS real-time inference time.

Cognitive TransFuser: Semantics-guided Transformer-based Sensor Fusion for Improved Waypoint Prediction

TL;DR

The paper tackles local waypoint prediction for autonomous driving in urban environments, addressing the fragility of single-sensor systems by fusing RGB and LiDAR data with Transformer-based fusion. It introduces Cognitive TransFuser, a semantics-guided multi-task network that integrates auxiliary tasks—semantic segmentation and traffic-light recognition—via dedicated heads and uses imitation learning to predict waypoint deltas , accumulating to through . On CARLA Town05 benchmarks, the approach achieves substantial gains in Driving Score and Route Completion while maintaining real-time performance around 44.2 FPS, demonstrating the value of early fusion with semantic features and auxiliary guidance. These results highlight the practical potential of semantics-guided multi-task fusion for safer, more reliable navigation in complex urban scenes.

Abstract

Sensor fusion approaches for intelligent self-driving agents remain key to driving scene understanding given visual global contexts acquired from input sensors. Specifically, for the local waypoint prediction task, single-modality networks are still limited by strong dependency on the sensitivity of the input sensor, and thus recent works therefore promote the use of multiple sensors in fusion in feature level in practice. While it is well known that multiple data modalities encourage mutual contextual exchange, it requires global 3D scene understanding in real-time with minimal computation upon deployment to practical driving scenarios, thereby placing greater significance on the training strategy given a limited number of practically usable sensors. In this light, we exploit carefully selected auxiliary tasks that are highly correlated with the target task of interest (e.g., traffic light recognition and semantic segmentation) by fusing auxiliary task features and also using auxiliary heads for waypoint prediction based on imitation learning. Our RGB-LIDAR-based multi-task feature fusion network, coined Cognitive TransFuser, augments and exceeds the baseline network by a significant margin for safer and more complete road navigation in the CARLA simulator. We validate the proposed network on the Town05 Short and Town05 Long Benchmark through extensive experiments, achieving up to 44.2 FPS real-time inference time.
Paper Structure (12 sections, 4 equations, 3 figures, 1 table)

This paper contains 12 sections, 4 equations, 3 figures, 1 table.

Figures (3)

  • Figure 1: Overview of Cognitive TransFuser. Given the front RGB image and the BEV LiDAR data, RGB is encoded by ResNet-34 backbone blocks in gray and LiDAR by ResNet-18 blocks in blue. We fuse the semantic segmentation feature map$^{\ast}$ into the first transformer fusion block, and sequentially extracted and merged features are used to predict the auxiliary traffic light classification label and local waypoints via a GRU sub-network. Please view in zoom and color for details.
  • Figure 2: Top: Sample observation instances of the ego vehicle at various traffic light signs (at a crossroad and along the sidewalk). Bottom: comparison of the driving scenes between the vanilla TransFuser and our Cognitive TransFuser on the same time frame (see supplementary video).
  • Figure 3: Sample semantic segmentation result using STDC-Seg50 stdc.