Table of Contents
Fetching ...

Stochastic Prize-Collecting Games: Strategic Planning in Multi-Robot Systems

Malintha Fernando, Petter Ögren, Silun Zhang

TL;DR

Stochastic Prize-Collecting Games (SPCG) extend the Team Orienteering Problem to competitive, stochastic, graph-based multi-robot planning where agents are self-interested and operate under energy constraints. The authors prove a unique pure Nash equilibrium on complete graphs under a rank-based conflict rule, and they introduce two learning methods, Ordinal Rank Search (ORS) and Fictitious Ordinal Response Learning (FORL), to obtain best-response policies from local observations. Empirical results on real road networks and synthetic graphs show that ordinal-rank conditioning improves scalability and generalization, with learned policies achieving 87–95% of the MILP TOP optimum. Overall, the paper advances distributed MARL for competitive routing in uncertain environments and demonstrates strong practical performance for large teams and heterogeneous prize distributions.

Abstract

The Team Orienteering Problem (TOP) generalizes many real-world multi-robot scheduling and routing tasks that occur in autonomous mobility, aerial logistics, and surveillance applications. While many flavors of the TOP exist for planning in multi-robot systems, they assume that all the robots cooperate toward a single objective; thus, they do not extend to settings where the robots compete in reward-scarce environments. We propose Stochastic Prize-Collecting Games (SPCG) as an extension of the TOP to plan in the presence of self-interested robots operating on a graph, under energy constraints and stochastic transitions. A theoretical study on complete and star graphs establishes that there is a unique pure Nash equilibrium in SPCGs that coincides with the optimal routing solution of an equivalent TOP given a rank-based conflict resolution rule. This work proposes two algorithms: Ordinal Rank Search (ORS) to obtain the ''ordinal rank'' --one's effective rank in temporarily-formed local neighborhoods during the games' stages, and Fictitious Ordinal Response Learning (FORL) to obtain best-response policies against one's senior-rank opponents. Empirical evaluations conducted on road networks and synthetic graphs under both dynamic and stationary prize distributions show that 1) the state-aliasing induced by OR-conditioning enables learning policies that scale more efficiently to large team sizes than those trained with the global index, and 2) Policies trained with FORL generalize better to imbalanced prize distributions than those with other multi-agent training methods. Finally, the learned policies in the SPCG achieved between 87% and 95% optimality compared to an equivalent TOP solution obtained by mixed-integer linear programming.

Stochastic Prize-Collecting Games: Strategic Planning in Multi-Robot Systems

TL;DR

Stochastic Prize-Collecting Games (SPCG) extend the Team Orienteering Problem to competitive, stochastic, graph-based multi-robot planning where agents are self-interested and operate under energy constraints. The authors prove a unique pure Nash equilibrium on complete graphs under a rank-based conflict rule, and they introduce two learning methods, Ordinal Rank Search (ORS) and Fictitious Ordinal Response Learning (FORL), to obtain best-response policies from local observations. Empirical results on real road networks and synthetic graphs show that ordinal-rank conditioning improves scalability and generalization, with learned policies achieving 87–95% of the MILP TOP optimum. Overall, the paper advances distributed MARL for competitive routing in uncertain environments and demonstrates strong practical performance for large teams and heterogeneous prize distributions.

Abstract

The Team Orienteering Problem (TOP) generalizes many real-world multi-robot scheduling and routing tasks that occur in autonomous mobility, aerial logistics, and surveillance applications. While many flavors of the TOP exist for planning in multi-robot systems, they assume that all the robots cooperate toward a single objective; thus, they do not extend to settings where the robots compete in reward-scarce environments. We propose Stochastic Prize-Collecting Games (SPCG) as an extension of the TOP to plan in the presence of self-interested robots operating on a graph, under energy constraints and stochastic transitions. A theoretical study on complete and star graphs establishes that there is a unique pure Nash equilibrium in SPCGs that coincides with the optimal routing solution of an equivalent TOP given a rank-based conflict resolution rule. This work proposes two algorithms: Ordinal Rank Search (ORS) to obtain the ''ordinal rank'' --one's effective rank in temporarily-formed local neighborhoods during the games' stages, and Fictitious Ordinal Response Learning (FORL) to obtain best-response policies against one's senior-rank opponents. Empirical evaluations conducted on road networks and synthetic graphs under both dynamic and stationary prize distributions show that 1) the state-aliasing induced by OR-conditioning enables learning policies that scale more efficiently to large team sizes than those trained with the global index, and 2) Policies trained with FORL generalize better to imbalanced prize distributions than those with other multi-agent training methods. Finally, the learned policies in the SPCG achieved between 87% and 95% optimality compared to an equivalent TOP solution obtained by mixed-integer linear programming.

Paper Structure

This paper contains 18 sections, 4 theorems, 11 equations, 6 figures, 1 table, 2 algorithms.

Key Result

Theorem 1

There is a unique pure strategy Nash equilibrium for all $\mathbf{p}^0(\omega_u), \forall \omega_u$ when $\mathcal{G}$ is complete, and a fixed rule is available for breaking ties between multiple maximal rewarding strategies for an agent.

Figures (6)

  • Figure 1: Left: A multi-robot prize-collecting problem on a graph. Here, $S, D$ denote the start and terminal nodes. One robot gets priority during conflicts. In the TOP, the optimal policy (shown) routes the robots to maximizes the total reward regardless of their ranks. However, when the robots are self-interested, the senior robot must directly move to prize (2.5) ignoring the prize (1) to prevent being cut-in-front, while the other must move to (1.5). Therefore, the maximum total reward in the SPCG (4), is lesser than that of the TOP's (5). Right: A Manhattan road-network graph used for the experiments. The mean prize at a node depends on the proximity to the center. Any dead-end node may serve as a terminal, while $S$ is randomized.
  • Figure 2: A two-agent SPCG played on a graph defined by $\mathcal{V} = \{s,1,2,3,d\}$ without self-edges. The players start at $s$ at $t=0$, share the same destination $d$. Consider $\mathcal{W}(u,v) = 1, \forall (u,v) \in \mathcal{E}$, and $L_{\max} = 3$. The prize at each node is indicated by $p_u$. Consider $p_s = 0$ and $p_d >> p_u, \forall u \in \mathcal{V}\backslash d$. This game does not have a PNE for any $\alpha \in (0,1)$.
  • Figure 3: Stockholm (left) and Manhattan (right) road network graphs. The graphs were obtained by applying the Fruchterman-Reingold algorithm on the original road networks. The colors correspond to the mean prize ($\bar{p}_u, u \in \mathcal{V}$) at the nodes where $\bar{p}_u \propto 1/\lVert\mathrm{pos}(u) - \mathrm{pos}(\bar{u})\rVert$, and $\bar{u}$ is the map center, and $\mathrm{pos(\cdot)}: \mathbb{R} \rightarrow \mathbb{R}^2$.
  • Figure 4: (a) Convergence plots for a dynamic prize distribution graph with 5 competing agents. In each training method, "OR" and "GS" refers to local observations with Ordinal Rank (OR), and the Global observations (GS). (b) Training time for each method. The training was conducted on a computer equipped with an Nvidia RTX 3090 GPU, and Intel 12700k CPU.
  • Figure 5: (a) The optimality-gap between the SPCG and an equivalent TOP. (b) Stage-specific rewards of the agents in a complete graph under FORL.
  • ...and 1 more figures

Theorems & Definitions (13)

  • Definition 1
  • Theorem 1
  • proof
  • Remark 1
  • Remark 2
  • Theorem 2
  • proof
  • Definition 2
  • Definition 3
  • Definition 4
  • ...and 3 more