Exploring Contextual Representation and Multi-Modality for End-to-End Autonomous Driving

Shoaib Azam; Farzeen Munir; Ville Kyrki; Moongu Jeon; Witold Pedrycz

Exploring Contextual Representation and Multi-Modality for End-to-End Autonomous Driving

Shoaib Azam, Farzeen Munir, Ville Kyrki, Moongu Jeon, Witold Pedrycz

TL;DR

The paper addresses the challenge of achieving human-like contextual understanding in end-to-end autonomous driving by fusing three RGB camera views with top-down BEV semantic maps. It proposes a vision-transformer–based perception module that jointly encodes cross-modal spatial and temporal context, feeding a GRU-based auto-regressive waypoint predictor to generate future trajectories. Empirical results on open-loop NuScenes show superior displacement accuracy (avg L2 ≈ 0.66 m) over strong baselines, while closed-loop CARLA experiments on Town05 Long and Longest6 benchmarks demonstrate improved driving scores, route completion, and reduced infractions. The work highlights the practical impact of integrating BEV-derived context with transformer-based fusion to enhance policy learning in autonomous driving, and suggests extending the framework with additional sensors and more sophisticated controllers in future work.

Abstract

Learning contextual and spatial environmental representations enhances autonomous vehicle's hazard anticipation and decision-making in complex scenarios. Recent perception systems enhance spatial understanding with sensor fusion but often lack full environmental context. Humans, when driving, naturally employ neural maps that integrate various factors such as historical data, situational subtleties, and behavioral predictions of other road users to form a rich contextual understanding of their surroundings. This neural map-based comprehension is integral to making informed decisions on the road. In contrast, even with their significant advancements, autonomous systems have yet to fully harness this depth of human-like contextual understanding. Motivated by this, our work draws inspiration from human driving patterns and seeks to formalize the sensor fusion approach within an end-to-end autonomous driving framework. We introduce a framework that integrates three cameras (left, right, and center) to emulate the human field of view, coupled with top-down bird-eye-view semantic data to enhance contextual representation. The sensor data is fused and encoded using a self-attention mechanism, leading to an auto-regressive waypoint prediction module. We treat feature representation as a sequential problem, employing a vision transformer to distill the contextual interplay between sensor modalities. The efficacy of the proposed method is experimentally evaluated in both open and closed-loop settings. Our method achieves displacement error by 0.67m in open-loop settings, surpassing current methods by 6.9% on the nuScenes dataset. In closed-loop evaluations on CARLA's Town05 Long and Longest6 benchmarks, the proposed method enhances driving performance, route completion, and reduces infractions.

Exploring Contextual Representation and Multi-Modality for End-to-End Autonomous Driving

TL;DR

Abstract

Paper Structure (16 sections, 9 equations, 4 figures, 4 tables)

This paper contains 16 sections, 9 equations, 4 figures, 4 tables.

Introduction
Related Work
Multi-modal End-to-end Learning Frameworks for Autonomous Driving
BEV Representation End-to-end Autonomous Driving
Transformer in End-to-End Autonomous driving
Method
Problem Formulation
Model Architecture
Perception Module
Waypoint Prediction Module
Experiments
Open-loop Experiments on nuScenes
Closed-loop Experiments on CARLA
Training Details
Results
...and 1 more sections

Figures (4)

Figure 1: The architecture of the proposed method which is comprised of two modules: perception block and waypoint prediction block. The perception module generates the features extracted from the input three RGB cameras (center, left, right) and the top-down semantic maps. These extracted features are then embedded with the velocity information to be utilized by the transformer encoder. The encoded features are then passed to the GRU-based waypoint prediction module for the generation of next waypoints. (Best view in color)
Figure 2: Qualitative results for the proposed method in different driving conditions using nuScenes dataset in open-loop evaluation.
Figure 3: Qualitative results for the proposed method in different driving conditions on Town05 Long benchmark.
Figure 4: Qualitative results for the proposed method in different driving conditions on Longest6 benchmark.

Exploring Contextual Representation and Multi-Modality for End-to-End Autonomous Driving

TL;DR

Abstract

Exploring Contextual Representation and Multi-Modality for End-to-End Autonomous Driving

Authors

TL;DR

Abstract

Table of Contents

Figures (4)