Scaling Opponent Shaping to High Dimensional Games

Akbir Khan; Timon Willi; Newton Kwan; Andrea Tacchetti; Chris Lu; Edward Grefenstette; Tim Rocktäschel; Jakob Foerster

Scaling Opponent Shaping to High Dimensional Games

Akbir Khan, Timon Willi, Newton Kwan, Andrea Tacchetti, Chris Lu, Edward Grefenstette, Tim Rocktäschel, Jakob Foerster

TL;DR

This paper tackles the challenge of scaling opponent shaping (OS) to high-dimensional, temporally-extended general-sum games. It introduces Shaper, a memory-efficient OS that captures both context and history with a single recurrent agent and employs batched hidden-state averaging to align co-player updates across batches. Through extensive experiments on IPD/IMP in the Matrix and the CoinGame, Shaper demonstrates superior individual and collective outcomes compared to prior OS methods and Naive Learners, highlighting the importance of memory and batch-averaging. The results establish OS as a scalable approach for complex multi-agent settings, while also revealing limitations of existing benchmarks like CoinGame and underscoring ethical considerations for shaping in real-world systems.

Abstract

In multi-agent settings with mixed incentives, methods developed for zero-sum games have been shown to lead to detrimental outcomes. To address this issue, opponent shaping (OS) methods explicitly learn to influence the learning dynamics of co-players and empirically lead to improved individual and collective outcomes. However, OS methods have only been evaluated in low-dimensional environments due to the challenges associated with estimating higher-order derivatives or scaling model-free meta-learning. Alternative methods that scale to more complex settings either converge to undesirable solutions or rely on unrealistic assumptions about the environment or co-players. In this paper, we successfully scale an OS-based approach to general-sum games with temporally-extended actions and long-time horizons for the first time. After analysing the representations of the meta-state and history used by previous algorithms, we propose a simplified version called Shaper. We show empirically that Shaper leads to improved individual and collective outcomes in a range of challenging settings from literature. We further formalize a technique previously implicit in the literature, and analyse its contribution to opponent shaping. We show empirically that this technique is helpful for the functioning of prior methods in certain environments. Lastly, we show that previous environments, such as the CoinGame, are inadequate for analysing temporally-extended general-sum interactions.

Scaling Opponent Shaping to High Dimensional Games

TL;DR

Abstract

Paper Structure (29 sections, 5 equations, 23 figures, 19 tables, 2 algorithms)

This paper contains 29 sections, 5 equations, 23 figures, 19 tables, 2 algorithms.

Introduction
Background
Shaper: A Scalable OS Method
Experiments
Results
Related Work
Conclusion
Ethics Statement*
Shaper details
Matrix Game Details
Payoff Matrices
Training Details
Evaluation
Matrix Game Results
Generalisability over long time period
...and 14 more sections

Figures (23)

Figure 1: Evaluation results over a single trial (with co-player) compromising over 100 seeds for the CoinGame. (a) Reward, (b) Shaper's frequency of picking up its own colour coin, (c) state visitation, and (d) the number of coins picked up per episode. Shaper successfully elicits exploitation with a co-player with a high state visitation for DC and strong competency.
Figure 2: Render of the IPDitM games, a multi-step, gridworld-based general-sum game. Agents with restricted visibility and orientation traverse a grid picking up either Defect or Cooperate coins. (left) shows an initial state of the game before either agent has a coin. Once agents pick up a coin, their appearance changes, and they can interact. (right) shows the orange agent having collected a coin and the blue agent firing their interact beam.
Figure 3: Evaluation results over a single trial (with co-player) compromising over 100 seeds for the IPDitM. (a) Mean reward per timestep, (b) mean ratio of picking up cooperate coins per soft-reset, (c) total number of coins picked up per soft-reset. The independent learner is shown to contrast what learning without a meta-agent would look like.
Figure 4: Hardstop Challenge: Average reward per timestep over an evaluation trial for Shaper (a) and GS (b) against a Naive Learner in the IPD. Here GS fails to generalise to a co-player that stops learning after an unknown number of timesteps (unseen during training). (c) State Visitation through the evaluation shows Shaper responds to co-players frozen policy by moving into either DD (the best response to a defective agent) or DC (the best response to a fully cooperative agent).
Figure 5: Reward per timestep throughout training for The "Average" challenge. Results are presented over matrix games for 5 seeds. In a) and b) we evaluate OS methods on IPD and in c) and d) we evaluate on IMP. We note that batching only helps M-FOS in the IPD. This indicates batching is only useful in sufficiently diverse environments, relative to the OS method.
...and 18 more figures

Scaling Opponent Shaping to High Dimensional Games

TL;DR

Abstract

Scaling Opponent Shaping to High Dimensional Games

Authors

TL;DR

Abstract

Table of Contents

Figures (23)