Adaptive Opponent Policy Detection in Multi-Agent MDPs: Real-Time Strategy Switch Identification Using Running Error Estimation

Mohidul Haque Mridul; Mohammad Foysal Khan; Redwan Ahmed Rizvee; Md Mosaddek Khan

Adaptive Opponent Policy Detection in Multi-Agent MDPs: Real-Time Strategy Switch Identification Using Running Error Estimation

Mohidul Haque Mridul, Mohammad Foysal Khan, Redwan Ahmed Rizvee, Md Mosaddek Khan

TL;DR

The paper tackles non-stationarity in multi-agent RL by introducing OPS-DeMo, an online policy-switch detection framework that uses running-error estimation against a bank of assumed opponent policies and a bank of trained response policies. The method continuously updates beliefs with a decay mechanism, enabling real-time detection of abrupt policy switches and rapid adaptation of responses, extending SAM concepts to PPO-based MARL. Key contributions include a formal metric for policy compliance, an online detection algorithm, and post-switch policy identification, demonstrated in a 2-predator, 2-prey predator-prey setting where OPS-DeMo significantly improves mean episodic rewards by about 49.6\% over a standalone PPO model. The approach is robust to sudden opponent shifts and suitable for edge devices due to its action-only, storage-light design, potentially enhancing real-time decision making in dynamic, decentralized MARL applications. The framework offers practical impact for competitive and cooperative multi-agent systems, enabling more informed and stable policy adaptation under non-stationary conditions, with future work extending continuous learning and handling uniform action distributions. The decay mechanism is formalized as $d = \alpha e_f + (1 - \alpha) e_{nf}$, where $\alpha$ controls detection strictness and balances adherence versus deviation from the assumed policy.

Abstract

In Multi-agent Reinforcement Learning (MARL), accurately perceiving opponents' strategies is essential for both cooperative and adversarial contexts, particularly within dynamic environments. While Proximal Policy Optimization (PPO) and related algorithms such as Actor-Critic with Experience Replay (ACER), Trust Region Policy Optimization (TRPO), and Deep Deterministic Policy Gradient (DDPG) perform well in single-agent, stationary environments, they suffer from high variance in MARL due to non-stationary and hidden policies of opponents, leading to diminished reward performance. Additionally, existing methods in MARL face significant challenges, including the need for inter-agent communication, reliance on explicit reward information, high computational demands, and sampling inefficiencies. These issues render them less effective in continuous environments where opponents may abruptly change their policies without prior notice. Against this background, we present OPS-DeMo (Online Policy Switch-Detection Model), an online algorithm that employs dynamic error decay to detect changes in opponents' policies. OPS-DeMo continuously updates its beliefs using an Assumed Opponent Policy (AOP) Bank and selects corresponding responses from a pre-trained Response Policy Bank. Each response policy is trained against consistently strategizing opponents, reducing training uncertainty and enabling the effective use of algorithms like PPO in multi-agent environments. Comparative assessments show that our approach outperforms PPO-trained models in dynamic scenarios like the Predator-Prey setting, providing greater robustness to sudden policy shifts and enabling more informed decision-making through precise opponent policy insights.

Adaptive Opponent Policy Detection in Multi-Agent MDPs: Real-Time Strategy Switch Identification Using Running Error Estimation

TL;DR

, where

controls detection strictness and balances adherence versus deviation from the assumed policy.

Abstract

Paper Structure (20 sections, 3 theorems, 7 equations, 8 figures, 1 table, 1 algorithm)

This paper contains 20 sections, 3 theorems, 7 equations, 8 figures, 1 table, 1 algorithm.

Introduction
Related Works
The Online Policy Switch Detection Model (OPS-DeMo)
Metric to Measure Policy Compliance
Architecture of the Model
Algorithm Description
Detection of Policy Switch
Error Decay
Identification of the Post-Switch Policy
Empirical Evaluation
Implementation
Environment Setup
Training Setup
Simulation of Policy Switch
Hyperparameters related to the Experiments
...and 5 more sections

Key Result

Lemma 1

Consider a timestep $t$ within the context of a MDP with a discrete action space of $n$ actions, wherein an agent follows a policy $\pi$ and selects an action $a_i$ from a Markovian state $s$. Within this framework, the observed error at $t$ can be formulated as $(1 - p_{a_i})$, where $p_{a_i}$ sign

Figures (8)

Figure 1: Architecture of OPS-DeMo
Figure 2: Predator-Prey Environment
Figure 3: Running errors of two probable policies of Predator B, based on observations from Predator A, with Predator B switching its policy every $100$ timesteps.
Figure 4: Running errors of two probable policies of Predator B, based on observations from Predator A, with Predator B switching its policy every $200$ timesteps.
Figure 5: Running errors of a probable policy of Predator B, based on observations from Predator A. Predator B switches its policy every $100$ timestep, illustrating the impact of different strictness factors on running errors.
...and 3 more figures

Theorems & Definitions (6)

Lemma 1
proof
Lemma 2
proof
Lemma 3
proof

Adaptive Opponent Policy Detection in Multi-Agent MDPs: Real-Time Strategy Switch Identification Using Running Error Estimation

TL;DR

Abstract

Adaptive Opponent Policy Detection in Multi-Agent MDPs: Real-Time Strategy Switch Identification Using Running Error Estimation

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (8)

Theorems & Definitions (6)