Table of Contents
Fetching ...

Bilevel Multi-Armed Bandit-Based Hierarchical Reinforcement Learning for Interaction-Aware Self-Driving at Unsignalized Intersections

Zengqi Peng, Yubin Wang, Lei Zheng, Jun Ma

TL;DR

This paper tackles the challenge of interaction-aware decision-making for autonomous vehicles at unsignalized intersections under uncertain, multi-modal SV behaviors. It proposes BiM-ACPPO, a hierarchical framework that combines Exp3.S-based BiMAB automated curriculum scheduling with a high-level PPO that produces intermediate references for a low-level RL-guided MPC, enhancing sample efficiency and generalization. A bilevel curriculum models task difficulty via SV-count clusters and task-type arms, with rescaled rewards and stabilized updates guiding curriculum shifts. Empirical results in CARLA show superior success rates and robust generalization to unseen scenarios, including zero-shot single-lane intersections and few-shot overtaking tasks, highlighting the framework’s practical potential for safe, efficient urban driving.

Abstract

In this work, we present BiM-ACPPO, a bilevel multi-armed bandit-based hierarchical reinforcement learning framework for interaction-aware decision-making and planning at unsignalized intersections. Essentially, it proactively takes the uncertainties associated with surrounding vehicles (SVs) into consideration, which encompass those stemming from the driver's intention, interactive behaviors, and the varying number of SVs. Intermediate decision variables are introduced to enable the high-level RL policy to provide an interaction-aware reference, for guiding low-level model predictive control (MPC) and further enhancing the generalization ability of the proposed framework. By leveraging the structured nature of self-driving at unsignalized intersections, the training problem of the RL policy is modeled as a bilevel curriculum learning task, which is addressed by the proposed Exp3.S-based BiMAB algorithm. It is noteworthy that the training curricula are dynamically adjusted, thereby facilitating the sample efficiency of the RL training process. Comparative experiments are conducted in the high-fidelity CARLA simulator, and the results indicate that our approach achieves superior performance compared to all baseline methods. Furthermore, experimental results in two new urban driving scenarios clearly demonstrate the commendable generalization performance of the proposed method.

Bilevel Multi-Armed Bandit-Based Hierarchical Reinforcement Learning for Interaction-Aware Self-Driving at Unsignalized Intersections

TL;DR

This paper tackles the challenge of interaction-aware decision-making for autonomous vehicles at unsignalized intersections under uncertain, multi-modal SV behaviors. It proposes BiM-ACPPO, a hierarchical framework that combines Exp3.S-based BiMAB automated curriculum scheduling with a high-level PPO that produces intermediate references for a low-level RL-guided MPC, enhancing sample efficiency and generalization. A bilevel curriculum models task difficulty via SV-count clusters and task-type arms, with rescaled rewards and stabilized updates guiding curriculum shifts. Empirical results in CARLA show superior success rates and robust generalization to unseen scenarios, including zero-shot single-lane intersections and few-shot overtaking tasks, highlighting the framework’s practical potential for safe, efficient urban driving.

Abstract

In this work, we present BiM-ACPPO, a bilevel multi-armed bandit-based hierarchical reinforcement learning framework for interaction-aware decision-making and planning at unsignalized intersections. Essentially, it proactively takes the uncertainties associated with surrounding vehicles (SVs) into consideration, which encompass those stemming from the driver's intention, interactive behaviors, and the varying number of SVs. Intermediate decision variables are introduced to enable the high-level RL policy to provide an interaction-aware reference, for guiding low-level model predictive control (MPC) and further enhancing the generalization ability of the proposed framework. By leveraging the structured nature of self-driving at unsignalized intersections, the training problem of the RL policy is modeled as a bilevel curriculum learning task, which is addressed by the proposed Exp3.S-based BiMAB algorithm. It is noteworthy that the training curricula are dynamically adjusted, thereby facilitating the sample efficiency of the RL training process. Comparative experiments are conducted in the high-fidelity CARLA simulator, and the results indicate that our approach achieves superior performance compared to all baseline methods. Furthermore, experimental results in two new urban driving scenarios clearly demonstrate the commendable generalization performance of the proposed method.

Paper Structure

This paper contains 25 sections, 33 equations, 8 figures, 3 tables, 2 algorithms.

Figures (8)

  • Figure 1: Overview of the task scenario. The EV is illustrated in red, while SVs are illustrated in blue. The uncertainties include the task and intention uncertainty of SVs and the varying number of SVs. All SVs will respond to the behavior of other vehicles.
  • Figure 2: Overview of the BiM-ACPPO approach for interaction-aware self-driving at unsignalized intersections with interactive SVs. The EV and SVs are illustrated in red and blue, respectively. The proposed BiMAB framework models the training process as a bilevel clustered structure. The variables within the first layer and the second layer, are the number of SVs and the task type of the EV, respectively. The solid car and the semi-transparent cars within the BiMAB module denote the start position and the target position of the EV, respectively.
  • Figure 3: Potential collision points for EV (red car) crossing the intersection with the SV (blue car). L, S, and R represent the number of potential collision points when the EV performs diverse tasks.
  • Figure 4: Visualization of designed action space of RL agent.
  • Figure 5: Weight updates of four clusters in BiMAB during the training process.
  • ...and 3 more figures

Theorems & Definitions (1)

  • Remark 1