Table of Contents
Fetching ...

Cooperative Reward Shaping for Multi-Agent Pathfinding

Zhenyu Song, Ronghao Zheng, Senlin Zhang, Meiqin Liu

TL;DR

The paper tackles scalable cooperative path planning in Multi-Agent Pathfinding (MAPF) using distributed MARL where agents lack global information. It introduces Cooperative Reward Shaping (CoRS), a mechanism that augments Independent Q-Learning with a neighborhood-based cooperative signal and a tunable cooperation coefficient $\alpha$, within a DTDE framework enhanced by simple attention-based communication. Theoretical analysis shows that, under certain assumptions and with $\alpha = \tfrac{1}{2}$, the approach satisfies the Individual-Global Max (IGM) condition, aligning individual incentives with collective performance. Empirically, CoRS-DHC outperforms baselines like DHC and DCC, achieving higher success rates and shorter makespans in dense MAPF scenarios without changing the underlying network architecture.

Abstract

The primary objective of Multi-Agent Pathfinding (MAPF) is to plan efficient and conflict-free paths for all agents. Traditional multi-agent path planning algorithms struggle to achieve efficient distributed path planning for multiple agents. In contrast, Multi-Agent Reinforcement Learning (MARL) has been demonstrated as an effective approach to achieve this objective. By modeling the MAPF problem as a MARL problem, agents can achieve efficient path planning and collision avoidance through distributed strategies under partial observation. However, MARL strategies often lack cooperation among agents due to the absence of global information, which subsequently leads to reduced MAPF efficiency. To address this challenge, this letter introduces a unique reward shaping technique based on Independent Q-Learning (IQL). The aim of this method is to evaluate the influence of one agent on its neighbors and integrate such an interaction into the reward function, leading to active cooperation among agents. This reward shaping method facilitates cooperation among agents while operating in a distributed manner. The proposed approach has been evaluated through experiments across various scenarios with different scales and agent counts. The results are compared with those from other state-of-the-art (SOTA) planners. The evidence suggests that the approach proposed in this letter parallels other planners in numerous aspects, and outperforms them in scenarios featuring a large number of agents.

Cooperative Reward Shaping for Multi-Agent Pathfinding

TL;DR

The paper tackles scalable cooperative path planning in Multi-Agent Pathfinding (MAPF) using distributed MARL where agents lack global information. It introduces Cooperative Reward Shaping (CoRS), a mechanism that augments Independent Q-Learning with a neighborhood-based cooperative signal and a tunable cooperation coefficient , within a DTDE framework enhanced by simple attention-based communication. Theoretical analysis shows that, under certain assumptions and with , the approach satisfies the Individual-Global Max (IGM) condition, aligning individual incentives with collective performance. Empirically, CoRS-DHC outperforms baselines like DHC and DCC, achieving higher success rates and shorter makespans in dense MAPF scenarios without changing the underlying network architecture.

Abstract

The primary objective of Multi-Agent Pathfinding (MAPF) is to plan efficient and conflict-free paths for all agents. Traditional multi-agent path planning algorithms struggle to achieve efficient distributed path planning for multiple agents. In contrast, Multi-Agent Reinforcement Learning (MARL) has been demonstrated as an effective approach to achieve this objective. By modeling the MAPF problem as a MARL problem, agents can achieve efficient path planning and collision avoidance through distributed strategies under partial observation. However, MARL strategies often lack cooperation among agents due to the absence of global information, which subsequently leads to reduced MAPF efficiency. To address this challenge, this letter introduces a unique reward shaping technique based on Independent Q-Learning (IQL). The aim of this method is to evaluate the influence of one agent on its neighbors and integrate such an interaction into the reward function, leading to active cooperation among agents. This reward shaping method facilitates cooperation among agents while operating in a distributed manner. The proposed approach has been evaluated through experiments across various scenarios with different scales and agent counts. The results are compared with those from other state-of-the-art (SOTA) planners. The evidence suggests that the approach proposed in this letter parallels other planners in numerous aspects, and outperforms them in scenarios featuring a large number of agents.
Paper Structure (21 sections, 2 theorems, 10 equations, 9 figures, 7 tables)

This paper contains 21 sections, 2 theorems, 10 equations, 9 figures, 7 tables.

Key Result

Theorem 1

Assume Assumps. assumption 1 and assumption 2 hold. Then when $\alpha = \frac{1}{2}$, $\mathcal{Q}^i_{\pi^i_*} (\bar{s}_t, a^i)$, $\mathcal{Q}^{-i}_{\pi^{-i}_*} (\bar{s}_t, a^{-i})$ and $Q^{tot}_{\bar{\pi}_*} (\bar{s}_t, \{a^i, a^{-i}\})$ satisfy the IGM condition.

Figures (9)

  • Figure 1: Multi-agent scenarios in real-world and grid map with numerous mobile robots. In the grid map, blue cells denote agents, yellow cells indicate their goals, and green cells signify agents that have reached their goals.
  • Figure 2: The combined framework of CoRS and DHC algorithms. The CoRS component predominantly shapes rewards. DHC component comprises of communication blocks and dueling Q networks. Notably, the communication block employs a multi-head attention mechanism. The framework utilizes parallel training as an efficient solution. This process involves the simultaneous generation of multiple actors to produce experiential data and upload it to the global buffer. The learner then retrieves this data from the global buffer for training purposes, thereby allowing for frequent updates to the Actor's network.
  • Figure 3: An example of the reward shaping method.
  • Figure 4: Neighbors of $A^i$ when $d_n = 2$.
  • Figure 5: Training losses of two different reward shaping methods in 10 $\times$ 10 map with 10 agents. Eq. \ref{['Reward Shaping']} significantly reduces the training loss compared to the method uses Eq. \ref{['original Ic']}.
  • ...and 4 more figures

Theorems & Definitions (4)

  • Theorem 1
  • proof
  • Theorem 1
  • proof