Table of Contents
Fetching ...

Multi-agent Uncertainty-Aware Pessimistic Model-Based Reinforcement Learning for Connected Autonomous Vehicles

Ruoqi Wen, Rongpeng Li, Xing Xu, Zhifeng Zhao

TL;DR

This work tackles sample inefficiency and uncertain reward design in autonomous-vehicle control by introducing MA-PMBRL, a fully decentralized, pessimistic, multi-agent model-based RL framework with PAC guarantees under partial data coverage and communication constraints. It combines a virtual dynamics model learned from fixed data with a constrained, pessimistic Soft Actor-Critic objective optimized via Projected Gradient Descent, to bound potential failures and improve robustness. Theoretical results provide per-agent and group PAC bounds that leverage inter-agent communication through clique-cover structures, demonstrating improved sample efficiency in multi-agent settings. Empirically, MA-PMBRL achieves faster convergence and higher utility than competitive baselines in mixed-autonomy traffic scenarios, illustrating practical impact for scalable, reliable CAV decision-making under realistic communication limits.

Abstract

Deep Reinforcement Learning (DRL) holds significant promise for achieving human-like Autonomous Vehicle (AV) capabilities, but suffers from low sample efficiency and challenges in reward design. Model-Based Reinforcement Learning (MBRL) offers improved sample efficiency and generalizability compared to Model-Free Reinforcement Learning (MFRL) in various multi-agent decision-making scenarios. Nevertheless, MBRL faces critical difficulties in estimating uncertainty during the model learning phase, thereby limiting its scalability and applicability in real-world scenarios. Additionally, most Connected Autonomous Vehicle (CAV) studies focus on single-agent decision-making, while existing multi-agent MBRL solutions lack computationally tractable algorithms with Probably Approximately Correct (PAC) guarantees, an essential factor for ensuring policy reliability with limited training data. To address these challenges, we propose MA-PMBRL, a novel Multi-Agent Pessimistic Model-Based Reinforcement Learning framework for CAVs, incorporating a max-min optimization approach to enhance robustness and decision-making. To mitigate the inherent subjectivity of uncertainty estimation in MBRL and avoid incurring catastrophic failures in AV, MA-PMBRL employs a pessimistic optimization framework combined with Projected Gradient Descent (PGD) for both model and policy learning. MA-PMBRL also employs general function approximations under partial dataset coverage to enhance learning efficiency and system-level performance. By bounding the suboptimality of the resulting policy under mild theoretical assumptions, we successfully establish PAC guarantees for MA-PMBRL, demonstrating that the proposed framework represents a significant step toward scalable, efficient, and reliable multi-agent decision-making for CAVs.

Multi-agent Uncertainty-Aware Pessimistic Model-Based Reinforcement Learning for Connected Autonomous Vehicles

TL;DR

This work tackles sample inefficiency and uncertain reward design in autonomous-vehicle control by introducing MA-PMBRL, a fully decentralized, pessimistic, multi-agent model-based RL framework with PAC guarantees under partial data coverage and communication constraints. It combines a virtual dynamics model learned from fixed data with a constrained, pessimistic Soft Actor-Critic objective optimized via Projected Gradient Descent, to bound potential failures and improve robustness. Theoretical results provide per-agent and group PAC bounds that leverage inter-agent communication through clique-cover structures, demonstrating improved sample efficiency in multi-agent settings. Empirically, MA-PMBRL achieves faster convergence and higher utility than competitive baselines in mixed-autonomy traffic scenarios, illustrating practical impact for scalable, reliable CAV decision-making under realistic communication limits.

Abstract

Deep Reinforcement Learning (DRL) holds significant promise for achieving human-like Autonomous Vehicle (AV) capabilities, but suffers from low sample efficiency and challenges in reward design. Model-Based Reinforcement Learning (MBRL) offers improved sample efficiency and generalizability compared to Model-Free Reinforcement Learning (MFRL) in various multi-agent decision-making scenarios. Nevertheless, MBRL faces critical difficulties in estimating uncertainty during the model learning phase, thereby limiting its scalability and applicability in real-world scenarios. Additionally, most Connected Autonomous Vehicle (CAV) studies focus on single-agent decision-making, while existing multi-agent MBRL solutions lack computationally tractable algorithms with Probably Approximately Correct (PAC) guarantees, an essential factor for ensuring policy reliability with limited training data. To address these challenges, we propose MA-PMBRL, a novel Multi-Agent Pessimistic Model-Based Reinforcement Learning framework for CAVs, incorporating a max-min optimization approach to enhance robustness and decision-making. To mitigate the inherent subjectivity of uncertainty estimation in MBRL and avoid incurring catastrophic failures in AV, MA-PMBRL employs a pessimistic optimization framework combined with Projected Gradient Descent (PGD) for both model and policy learning. MA-PMBRL also employs general function approximations under partial dataset coverage to enhance learning efficiency and system-level performance. By bounding the suboptimality of the resulting policy under mild theoretical assumptions, we successfully establish PAC guarantees for MA-PMBRL, demonstrating that the proposed framework represents a significant step toward scalable, efficient, and reliable multi-agent decision-making for CAVs.

Paper Structure

This paper contains 30 sections, 12 theorems, 45 equations, 7 figures, 3 tables, 1 algorithm.

Key Result

Lemma 1

Let $T$ denote the true MDP transition function, and define $\pi^\ast$ as the optimal policy. Meanwhile, let $\tilde{T}_\phi$ be the solution MDP obtained from solving the pessimistic model-based min-max RL problem for the dataset $\mathcal{D}$. Then, with probability at least $1 - \delta$, for any where Here, $C_{\pi^\ast}$ is defined according to Definition def: Concentrability_coefficient, $\

Figures (7)

  • Figure 1: The illustration of the MA-PMBRL algorithm for CAVs.
  • Figure 2: The "Unprotected Intersection" scenario in the closed "Figure Eight" loop for Simulations. (a) presents an aerial view of the "Figure Eight" loop, while (b) provides the regional enlarged view of the "Unprotected Intersection".
  • Figure 3: Comparison of utility in the single-lane "Unprotected Intersection" scenario.
  • Figure 4: Comparison of utility of different choices of PGD in the single-lane "Unprotected Intersection" scenario.
  • Figure 5: Performance comparison under different communication range $d$.
  • ...and 2 more figures

Theorems & Definitions (14)

  • Definition 1: Partial Coverage from uehara2022pessimistic
  • Definition 2: Concentrability coefficient from uehara2022pessimistic
  • Lemma 1: Eq. (9) from uehara2022pessimistic
  • Lemma 2
  • Theorem 1: PAC bound for MA-PMBRL
  • Lemma 3: Policy improvement of Lemma 6.1 of Kakade2002ApproximatelyOA
  • Lemma 4: MLE guarantee from Section E of agarwal2020optimality
  • Lemma 5: Lemma 7 of hongpessimistic
  • Lemma 6: Boundedness of Discounted Return
  • Lemma 7: Distribution Conversion Lemma of JMLR:v22:19-736
  • ...and 4 more