Table of Contents
Fetching ...

Self-Confirming Transformer for Belief-Conditioned Adaptation in Offline Multi-Agent Reinforcement Learning

Tao Li, Juan Guevara, Xinhong Xie, Quanyan Zhu

TL;DR

This paper introduces Self-Confirming Transformer (SCT), a decoder-only transformer for offline multi-agent reinforcement learning that incorporates belief conditioning to adapt to nonstationary opponents. By generating a fictitious opponent action $\hat{a}_{-i}^t$ as a belief and conditioning the ego action $\hat{a}_{i}^t$ on this belief and current observations, SCT blends opponent modeling with policy generation within a single model. The training objective, the Self-Confirming Loss, combines belief consistency and best-response components inspired by self-confirming equilibrium, enabling the policy to act optimally given the belief about the opponent. Empirical results in simple-tag/simple-world environments and Iterated Prisoner’s Dilemma show SCT outperforms baselines and exhibits equilibrium-like behavior, including strong prediction accuracy against seen and unseen strategies, highlighting improved robustness to nonstationary opponents in offline MARL.

Abstract

Offline reinforcement learning (RL) suffers from the distribution shift between the offline dataset and the online environment. In multi-agent RL (MARL), this distribution shift may arise from the nonstationary opponents in the online testing who display distinct behaviors from those recorded in the offline dataset. Hence, the key to the broader deployment of offline MARL is the online adaptation to nonstationary opponents. Recent advances in foundation models, e.g., large language models, have demonstrated the generalization ability of the transformer, an emerging neural network architecture, in sequence modeling, of which offline RL is a special case. One naturally wonders \textit{whether offline-trained transformer-based RL policies adapt to nonstationary opponents online}. We propose a novel auto-regressive training to equip transformer agents with online adaptability based on the idea of self-augmented pre-conditioning. The transformer agent first learns offline to predict the opponent's action based on past observations. When deployed online, such a fictitious opponent play, referred to as the belief, is fed back to the transformer, together with other environmental feedback, to generate future actions conditional on the belief. Motivated by self-confirming equilibrium in game theory, the training loss consists of belief consistency loss, requiring the beliefs to match the opponent's actual actions and best response loss, mandating the agent to behave optimally under the belief. We evaluate the online adaptability of the proposed self-confirming transformer (SCT) in a structured environment, iterated prisoner's dilemma games, to demonstrate SCT's belief consistency and equilibrium behaviors as well as more involved multi-particle environments to showcase its superior performance against nonstationary opponents over prior transformers and offline MARL baselines.

Self-Confirming Transformer for Belief-Conditioned Adaptation in Offline Multi-Agent Reinforcement Learning

TL;DR

This paper introduces Self-Confirming Transformer (SCT), a decoder-only transformer for offline multi-agent reinforcement learning that incorporates belief conditioning to adapt to nonstationary opponents. By generating a fictitious opponent action as a belief and conditioning the ego action on this belief and current observations, SCT blends opponent modeling with policy generation within a single model. The training objective, the Self-Confirming Loss, combines belief consistency and best-response components inspired by self-confirming equilibrium, enabling the policy to act optimally given the belief about the opponent. Empirical results in simple-tag/simple-world environments and Iterated Prisoner’s Dilemma show SCT outperforms baselines and exhibits equilibrium-like behavior, including strong prediction accuracy against seen and unseen strategies, highlighting improved robustness to nonstationary opponents in offline MARL.

Abstract

Offline reinforcement learning (RL) suffers from the distribution shift between the offline dataset and the online environment. In multi-agent RL (MARL), this distribution shift may arise from the nonstationary opponents in the online testing who display distinct behaviors from those recorded in the offline dataset. Hence, the key to the broader deployment of offline MARL is the online adaptation to nonstationary opponents. Recent advances in foundation models, e.g., large language models, have demonstrated the generalization ability of the transformer, an emerging neural network architecture, in sequence modeling, of which offline RL is a special case. One naturally wonders \textit{whether offline-trained transformer-based RL policies adapt to nonstationary opponents online}. We propose a novel auto-regressive training to equip transformer agents with online adaptability based on the idea of self-augmented pre-conditioning. The transformer agent first learns offline to predict the opponent's action based on past observations. When deployed online, such a fictitious opponent play, referred to as the belief, is fed back to the transformer, together with other environmental feedback, to generate future actions conditional on the belief. Motivated by self-confirming equilibrium in game theory, the training loss consists of belief consistency loss, requiring the beliefs to match the opponent's actual actions and best response loss, mandating the agent to behave optimally under the belief. We evaluate the online adaptability of the proposed self-confirming transformer (SCT) in a structured environment, iterated prisoner's dilemma games, to demonstrate SCT's belief consistency and equilibrium behaviors as well as more involved multi-particle environments to showcase its superior performance against nonstationary opponents over prior transformers and offline MARL baselines.
Paper Structure (6 sections, 8 equations, 5 figures, 7 tables)

This paper contains 6 sections, 8 equations, 5 figures, 7 tables.

Figures (5)

  • Figure 1: Self-augmented belief conditioning in the self-confirming transformer (SCT). SCT first generates a belief on the opponent’s action $a_{-i}^t$ (the green block), which is a fictitious token unobserved from the environment. Based on this belief, the transformer generates the action.
  • Figure 2: The predator-prey tasks in multi-agent particle environment.
  • Figure 3: The normalized scores (the higher, the better) of playing MADT and MATD3 policy against the nonstationary opponent in simple-tag (left) and simple-world (right). The opponent employs a blend of MATD3 and the random policy, with the blending rate $p$ shown on the x-axis. The green dashed line indicates the benchmark performance of the testing task.
  • Figure 4: A comparison between RMADT and SCT operation. The belief generation in RMADT does not direct the action generation. Even though the two share the same loss function, RMADT only aims to accurately predict the opponent's action, and the resulting action is not self-confirming.
  • Figure 5: The normalized scores of SCT in simple-tag and simple-world environments. SCT outperforms MADT when facing nonstationary opponents.

Theorems & Definitions (1)

  • definition 1: Self-Confirming Equilibrium