Best Response Shaping

Milad Aghajohari; Tim Cooijmans; Juan Agustin Duque; Shunichi Akatsuka; Aaron Courville

Best Response Shaping

Milad Aghajohari, Tim Cooijmans, Juan Agustin Duque, Shunichi Akatsuka, Aaron Courville

TL;DR

Best Response Shaping (BRS) targets multi-agent reinforcement learning in partially competitive settings by differentiating through a detective that approximates the opponent's best response and by conditioning this detective on the agent's policy via a state-aware question-answering mechanism. This approach addresses the horizon-exploitation vulnerability of prior methods like LOLA and POLA and introduces a cooperative self-play regularization to promote robust reciprocity. Empirical results in the Iterated Prisoner's Dilemma and the Coin Game show BRS yields strong retaliation against defection, full cooperation against best-response opponents approximated by MCTS, and superior self-cooperation compared with POLA. The work broadens the applicability of MARL to general-sum environments and provides a scalable route toward improved social welfare through learned reciprocity-based cooperation.

Abstract

We investigate the challenge of multi-agent deep reinforcement learning in partially competitive environments, where traditional methods struggle to foster reciprocity-based cooperation. LOLA and POLA agents learn reciprocity-based cooperative policies by differentiation through a few look-ahead optimization steps of their opponent. However, there is a key limitation in these techniques. Because they consider a few optimization steps, a learning opponent that takes many steps to optimize its return may exploit them. In response, we introduce a novel approach, Best Response Shaping (BRS), which differentiates through an opponent approximating the best response, termed the "detective." To condition the detective on the agent's policy for complex games we propose a state-aware differentiable conditioning mechanism, facilitated by a question answering (QA) method that extracts a representation of the agent based on its behaviour on specific environment states. To empirically validate our method, we showcase its enhanced performance against a Monte Carlo Tree Search (MCTS) opponent, which serves as an approximation to the best response in the Coin Game. This work expands the applicability of multi-agent RL in partially competitive environments and provides a new pathway towards achieving improved social welfare in general sum games.

Best Response Shaping

TL;DR

Abstract

Paper Structure (38 sections, 20 equations, 9 figures, 4 tables, 1 algorithm)

This paper contains 38 sections, 20 equations, 9 figures, 4 tables, 1 algorithm.

Introduction
Background
Multi Agent Reinforcement Learning
Social Dilemmas and the Iterated Prisoner's Dilemma
Related Work
Best Response Shaping
Best Response Agent to the Best Response Opponent
Detective Opponent Training
Conditioning on Agent's Policy
Simulation Based Question Answering
Differentiating Through the Detective
Cooperation Regularization via Self-Play with Reward Sharing
Experiments
Iterated Prisoner's Dilemma
The Coin Game
...and 23 more sections

Figures (9)

Figure 1: The detective is trained using agents sampled from a replay buffer, which contains agents encountered during training. Additional noise is incorporated to broaden the range of policies.
Figure 2: Illustration of the policies of agents trained with BRS and BRS-NOSP in a finite Iterated Prisoner's Dilemma game of length $6$. The agents are trained against a tree search detective maximizing its own return. BRS agents learn tit-for-tat, a policy that cooperates initially and mirrors the opponent's behavior thereafter. BRS-NOSP agents learn cynic-tit-for-tat (CTFT), they defect initially but mirror the opponent's behavior thereafter.
Figure 3: Comparison of BRS and POLA on Coin Game. We evaluate the agent's returns versus different opponents: Always Defect opponent (AD); Always Cooperate opponent (AC), A Monte Carlo Tree Search opponent (MCTS) and agent's performance against itself (Self).
Figure 4: BRS-NORB is equivalent to BRS, with no replay buffer and no added noise. Its performance is close to BRS with more variance. BRS-NOSP is equivalent to BRS but with no self-play.
Figure 5: Illustrates the outcomes of $1$-vs-$1$ Coin games lasting $50$ rounds, involving a range of agents. The return achieved by each agent is documented within the corresponding cell. The reported returns are an average across 32 independent games. It is important to note that there are no games recorded between the MTCS agent and itself as it is not possible.
...and 4 more figures

Best Response Shaping

TL;DR

Abstract

Best Response Shaping

Authors

TL;DR

Abstract

Table of Contents

Figures (9)