Table of Contents
Fetching ...

Decision-making with Speculative Opponent Models

Jing Sun, Shuo Chen, Cong Zhang, Yining Ma, Jie Zhang

TL;DR

The paper tackles opponent modelling under limited information by introducing DOMAC, which blends speculative opponent inference (OMA) with a distributional critic (CDC) in a CTDE setting. By relying solely on local observations, DOMAC learns to predict opponents and evaluate returns using a distributional objective, enabling more informed decisions without access to opponents' true data. Empirical results across multiple challenging domains show that DOMAC achieves faster convergence and higher performance than strong baselines and approaches an upper bound set by ground-truth opponent information. The work demonstrates that distributional RL can effectively guide opponent modelling, offering a practical approach for robust decision-making in partially observable multi-agent environments.

Abstract

Opponent modelling has proven effective in enhancing the decision-making of the controlled agent by constructing models of opponent agents. However, existing methods often rely on access to the observations and actions of opponents, a requirement that is infeasible when such information is either unobservable or challenging to obtain. To address this issue, we introduce Distributional Opponent-aided Multi-agent Actor-Critic (DOMAC), the first speculative opponent modelling algorithm that relies solely on local information (i.e., the controlled agent's observations, actions, and rewards). Specifically, the actor maintains a speculated belief about the opponents using the tailored speculative opponent models that predict the opponents' actions using only local information. Moreover, DOMAC features distributional critic models that estimate the return distribution of the actor's policy, yielding a more fine-grained assessment of the actor's quality. This thus more effectively guides the training of the speculative opponent models that the actor depends upon. Furthermore, we formally derive a policy gradient theorem with the proposed opponent models. Extensive experiments under eight different challenging multi-agent benchmark tasks within the MPE, Pommerman and StarCraft Multiagent Challenge (SMAC) demonstrate that our DOMAC successfully models opponents' behaviours and delivers superior performance against state-of-the-art methods with a faster convergence speed.

Decision-making with Speculative Opponent Models

TL;DR

The paper tackles opponent modelling under limited information by introducing DOMAC, which blends speculative opponent inference (OMA) with a distributional critic (CDC) in a CTDE setting. By relying solely on local observations, DOMAC learns to predict opponents and evaluate returns using a distributional objective, enabling more informed decisions without access to opponents' true data. Empirical results across multiple challenging domains show that DOMAC achieves faster convergence and higher performance than strong baselines and approaches an upper bound set by ground-truth opponent information. The work demonstrates that distributional RL can effectively guide opponent modelling, offering a practical approach for robust decision-making in partially observable multi-agent environments.

Abstract

Opponent modelling has proven effective in enhancing the decision-making of the controlled agent by constructing models of opponent agents. However, existing methods often rely on access to the observations and actions of opponents, a requirement that is infeasible when such information is either unobservable or challenging to obtain. To address this issue, we introduce Distributional Opponent-aided Multi-agent Actor-Critic (DOMAC), the first speculative opponent modelling algorithm that relies solely on local information (i.e., the controlled agent's observations, actions, and rewards). Specifically, the actor maintains a speculated belief about the opponents using the tailored speculative opponent models that predict the opponents' actions using only local information. Moreover, DOMAC features distributional critic models that estimate the return distribution of the actor's policy, yielding a more fine-grained assessment of the actor's quality. This thus more effectively guides the training of the speculative opponent models that the actor depends upon. Furthermore, we formally derive a policy gradient theorem with the proposed opponent models. Extensive experiments under eight different challenging multi-agent benchmark tasks within the MPE, Pommerman and StarCraft Multiagent Challenge (SMAC) demonstrate that our DOMAC successfully models opponents' behaviours and delivers superior performance against state-of-the-art methods with a faster convergence speed.
Paper Structure (20 sections, 1 theorem, 21 equations, 14 figures, 4 tables, 1 algorithm)

This paper contains 20 sections, 1 theorem, 21 equations, 14 figures, 4 tables, 1 algorithm.

Key Result

Theorem 4.1

In a POMG, under the opponent modelling framework defined, the gradient of parameters $\theta_i$ for the policy of the agent $i$ is given as: and the gradient of the parameters $\psi_{ik}$ for the opponent model $\mu_{\psi_{ik}}$ is given as: where $\rho_{\theta_i, \psi_i}$ is defined in Equation (eq4).

Figures (14)

  • Figure 1: The DOMAC framework. In OMA, the speculative opponent model $\mu_i$ takes the local observation $o^t_i$ as input and outputs the prediction of the opponent behaviours. The agent policy network takes action by considering the predicted opponent's actions. The CDC takes the joint observation $\mathbf{o}^t$ and action $\mathbf{a}^t$ of the controlled agents into the agent critic network $G_{\phi_i}$ and outputs the return distribution $Z_i(\mathbf{o}^t,\mathbf{a}^t)$.
  • Figure 2: An illustration of our DOMAC network architecture. (a) Opponent model-aided actor: Each model contains $p$ speculative opponent models, which take the local observation $o^t_i$ and opponent index $k$ as inputs to predict the opponents' actions. Then, the action selection network $\pi_{\theta_i}$ takes the joint predicted action $\{\hat{a}^t_i\}$ together with the $o^t_i$ and $a^{t-1}_i$ as input, and outputs a distribution over the agent $i$'s own actions, which is weighed according to the probabilities of predicted opponents' actions $\mu^k_{\phi_i}$ for $1 \leq k \leq p$. (b) CTDE architecture: we have the access to the observations and actions of the team we control. (c) Centralized distributional critic: the agent critic network takes the joint observation $\mathbf{o}^t$ and action $\mathbf{a}^t$ of the controlled agents and outputs $G_{\Phi_i}$ as the approximation of return distribution.
  • Figure 3: State visualization of benchmark environments. (a) The state of a PP-3v1 game, where blue vertices and red vertices denote the predators and prey respectively. (b) The image-based state for the pommerman environment.
  • Figure 4: State visualization of StarCraft II, (a) The state of a MMM2 game, (b) The state of a 5m_vs_6m game.
  • Figure 5: Performance of DOMAC and baselines in the Predator-prey and Pommerman environments.
  • ...and 9 more figures

Theorems & Definitions (1)

  • Theorem 4.1