Bilevel Multi-Armed Bandit-Based Hierarchical Reinforcement Learning for Interaction-Aware Self-Driving at Unsignalized Intersections

Zengqi Peng; Yubin Wang; Lei Zheng; Jun Ma

Bilevel Multi-Armed Bandit-Based Hierarchical Reinforcement Learning for Interaction-Aware Self-Driving at Unsignalized Intersections

Zengqi Peng, Yubin Wang, Lei Zheng, Jun Ma

TL;DR

This paper tackles the challenge of interaction-aware decision-making for autonomous vehicles at unsignalized intersections under uncertain, multi-modal SV behaviors. It proposes BiM-ACPPO, a hierarchical framework that combines Exp3.S-based BiMAB automated curriculum scheduling with a high-level PPO that produces intermediate references for a low-level RL-guided MPC, enhancing sample efficiency and generalization. A bilevel curriculum models task difficulty via SV-count clusters and task-type arms, with rescaled rewards and stabilized updates guiding curriculum shifts. Empirical results in CARLA show superior success rates and robust generalization to unseen scenarios, including zero-shot single-lane intersections and few-shot overtaking tasks, highlighting the framework’s practical potential for safe, efficient urban driving.

Abstract

In this work, we present BiM-ACPPO, a bilevel multi-armed bandit-based hierarchical reinforcement learning framework for interaction-aware decision-making and planning at unsignalized intersections. Essentially, it proactively takes the uncertainties associated with surrounding vehicles (SVs) into consideration, which encompass those stemming from the driver's intention, interactive behaviors, and the varying number of SVs. Intermediate decision variables are introduced to enable the high-level RL policy to provide an interaction-aware reference, for guiding low-level model predictive control (MPC) and further enhancing the generalization ability of the proposed framework. By leveraging the structured nature of self-driving at unsignalized intersections, the training problem of the RL policy is modeled as a bilevel curriculum learning task, which is addressed by the proposed Exp3.S-based BiMAB algorithm. It is noteworthy that the training curricula are dynamically adjusted, thereby facilitating the sample efficiency of the RL training process. Comparative experiments are conducted in the high-fidelity CARLA simulator, and the results indicate that our approach achieves superior performance compared to all baseline methods. Furthermore, experimental results in two new urban driving scenarios clearly demonstrate the commendable generalization performance of the proposed method.

Bilevel Multi-Armed Bandit-Based Hierarchical Reinforcement Learning for Interaction-Aware Self-Driving at Unsignalized Intersections

TL;DR

Abstract

Bilevel Multi-Armed Bandit-Based Hierarchical Reinforcement Learning for Interaction-Aware Self-Driving at Unsignalized Intersections

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (8)

Theorems & Definitions (1)