Deep Attention Driven Reinforcement Learning (DAD-RL) for Autonomous Decision-Making in Dynamic Environment
Jayabrata Chowdhury, Venkataramanan Shivaraman, Sumit Dangi, Suresh Sundaram, P. B. Sujit
TL;DR
The paper tackles autonomous-vehicle decision making in dynamic urban traffic by introducing DAD-RL, a lightweight framework that uses an ego-AV-centric Spatio-Temporal Attention Encoder (STAE) and a BEV Context Encoder (CE) to produce a compact state $s_t$ for reinforcement learning. It employs Soft Actor-Critic to train a policy over mid-level actions $\mathcal{A}_t=[V_t^{target}, \Lambda_t]$, combining continuous speed with discrete lane commands, and leverages dense rewards to promote safety and progress. Evaluations on SMARTS show that DAD-RL outperforms state-of-the-art baselines, including transformer-based Scene-Rep-Transformer, with notable gains in success rate and reduced collisions; ablations demonstrate the complementary benefits of STAE and CE. The results suggest that a focused attention-based state encoding can deliver competitive driving performance with lower computational complexity, enabling more scalable and real-time autonomous decision-making in complex traffic scenarios.
Abstract
Autonomous Vehicle (AV) decision making in urban environments is inherently challenging due to the dynamic interactions with surrounding vehicles. For safe planning, AV must understand the weightage of various spatiotemporal interactions in a scene. Contemporary works use colossal transformer architectures to encode interactions mainly for trajectory prediction, resulting in increased computational complexity. To address this issue without compromising spatiotemporal understanding and performance, we propose the simple Deep Attention Driven Reinforcement Learning (DADRL) framework, which dynamically assigns and incorporates the significance of surrounding vehicles into the ego's RL driven decision making process. We introduce an AV centric spatiotemporal attention encoding (STAE) mechanism for learning the dynamic interactions with different surrounding vehicles. To understand map and route context, we employ a context encoder to extract features from context maps. The spatiotemporal representations combined with contextual encoding provide a comprehensive state representation. The resulting model is trained using the Soft Actor Critic (SAC) algorithm. We evaluate the proposed framework on the SMARTS urban benchmarking scenarios without traffic signals to demonstrate that DADRL outperforms recent state of the art methods. Furthermore, an ablation study underscores the importance of the context-encoder and spatio temporal attention encoder in achieving superior performance.
