Table of Contents
Fetching ...

Mix Q-learning for Lane Changing: A Collaborative Decision-Making Method in Multi-Agent Deep Reinforcement Learning

Xiaojun Bi, Mingjie He, Yiwen Sun

TL;DR

This work addresses multi-agent lane-changing under CVIS by introducing Mix Q-learning for Lane Changing (MQLC), a collaborative framework with two interconnected value networks: an individual Q and a global Q. It augments decision-making with a deep intent-prediction module and a trajectory-informed observation pipeline, and formalizes coordination through a consistency regularization loss $\\mathcal{L}_{reg}$ weighted by $\\lambda$ and an urgency-based priority $\\varepsilon$ guiding joint-action arbitration. The approach leverages a GCN-enabled value network and a GC N+MLP encoder to capture spatial and environmental context, enabling robust, safe, and efficient lane changes across varying traffic densities. Empirical results in highway-env with B-GAP show MQLC outperforming strong baselines (including QCOMBO and DQN variants) in safety and reward metrics, highlighting the benefits of explicit inter-agent coordination for traffic efficiency.

Abstract

Lane-changing decisions, which are crucial for autonomous vehicle path planning, face practical challenges due to rule-based constraints and limited data. Deep reinforcement learning has become a major research focus due to its advantages in data acquisition and interpretability. However, current models often overlook collaboration, which affects not only impacts overall traffic efficiency but also hinders the vehicle's own normal driving in the long run. To address the aforementioned issue, this paper proposes a method named Mix Q-learning for Lane Changing(MQLC) that integrates a hybrid value Q network, taking into account both collective and individual benefits for the greater good. At the collective level, our method coordinates the individual Q and global Q networks by utilizing global information. This enables agents to effectively balance their individual interests with the collective benefit. At the individual level, we integrated a deep learning-based intent recognition module into our observation and enhanced the decision network. These changes provide agents with richer decision information and more accurate feature extraction for improved lane-changing decisions. This strategy enables the multi-agent system to learn and formulate optimal decision-making strategies effectively. Our MQLC model, through extensive experimental results, impressively outperforms other state-of-the-art multi-agent decision-making methods, achieving significantly safer and faster lane-changing decisions. The code is available at https:github.com/pku-smart-city/source_code/tree/main/MQLC.

Mix Q-learning for Lane Changing: A Collaborative Decision-Making Method in Multi-Agent Deep Reinforcement Learning

TL;DR

This work addresses multi-agent lane-changing under CVIS by introducing Mix Q-learning for Lane Changing (MQLC), a collaborative framework with two interconnected value networks: an individual Q and a global Q. It augments decision-making with a deep intent-prediction module and a trajectory-informed observation pipeline, and formalizes coordination through a consistency regularization loss weighted by and an urgency-based priority guiding joint-action arbitration. The approach leverages a GCN-enabled value network and a GC N+MLP encoder to capture spatial and environmental context, enabling robust, safe, and efficient lane changes across varying traffic densities. Empirical results in highway-env with B-GAP show MQLC outperforming strong baselines (including QCOMBO and DQN variants) in safety and reward metrics, highlighting the benefits of explicit inter-agent coordination for traffic efficiency.

Abstract

Lane-changing decisions, which are crucial for autonomous vehicle path planning, face practical challenges due to rule-based constraints and limited data. Deep reinforcement learning has become a major research focus due to its advantages in data acquisition and interpretability. However, current models often overlook collaboration, which affects not only impacts overall traffic efficiency but also hinders the vehicle's own normal driving in the long run. To address the aforementioned issue, this paper proposes a method named Mix Q-learning for Lane Changing(MQLC) that integrates a hybrid value Q network, taking into account both collective and individual benefits for the greater good. At the collective level, our method coordinates the individual Q and global Q networks by utilizing global information. This enables agents to effectively balance their individual interests with the collective benefit. At the individual level, we integrated a deep learning-based intent recognition module into our observation and enhanced the decision network. These changes provide agents with richer decision information and more accurate feature extraction for improved lane-changing decisions. This strategy enables the multi-agent system to learn and formulate optimal decision-making strategies effectively. Our MQLC model, through extensive experimental results, impressively outperforms other state-of-the-art multi-agent decision-making methods, achieving significantly safer and faster lane-changing decisions. The code is available at https:github.com/pku-smart-city/source_code/tree/main/MQLC.
Paper Structure (21 sections, 11 equations, 9 figures, 9 tables, 2 algorithms)

This paper contains 21 sections, 11 equations, 9 figures, 9 tables, 2 algorithms.

Figures (9)

  • Figure 1: The importance of intent prediction and cooperative driving. Figure 1(a) highlights the importance of intent prediction. As autonomous vehicles approach an intersection, the ability to accurately predict the intentions of all nearby vehicles becomes critical. Figure 1(b) highlights the importance of cooperative driving for autonomous vehicles. This scenario shows that when there is a potential conflict between the intentions of the three autonomous vehicles, the unified coordination mechanism enables them to effectively obey traffic rules or resolve the conflict.
  • Figure 2: The overall framework of MQLC. The decision making process of the intelligent agent can be described as follows: first, MQLC estimates the intentions of the surrounding vehicles based on their trajectories. Next, at the individual level, the agent uses an innovative network structure to optimise information embedding and estimate feasible decisions. Finally, at the global level, an advantage estimation is performed for the joint actions that all intelligent agents can take.
  • Figure 3: The architecture of the MQLC decision network. The network architecture of the global Q function is presented. In this case, when the input is $(o_t^i, a_t^i)$, the network transforms into an individual Q network that estimates discrete actions based on observations. For any given network, it additionally extracts vehicle position information and global traffic flow information from the input. This information is encoded using GCN and MLP respectively. The final output is an estimate of the benefits of the actions.
  • Figure 4: An illustration of the execution process. This figure is a graphical explanation of Algorithm 1. During the execution phase, the agent's observations are fed into the decision framework. First, each intelligent agent evaluates the benefit values of different actions and prioritises its own decisions based on the observations. Then, a few feasible actions are selected probabilistically. The global Q-function evaluates the benefits of joint actions based on the state formed by aggregating all observations, and finally selects the optimal joint actions for execution by each individual agent.
  • Figure 5: The training process of the comparative experiment. It can be seen from the figure that the ddpg-type method performs the worst. QCOMBO achieves the highest reward effect at the beginning of training, which proves the effectiveness of global decision-making. On this basis, our MQLC method effectively improves the performance due to additional observation acquisition and feature extraction, and reaches the highest level compared with other models in the later stage of training.
  • ...and 4 more figures