Maximizing utility in multi-agent environments by anticipating the behavior of other learners

Angelos Assos; Yuval Dagan; Constantinos Daskalakis

Maximizing utility in multi-agent environments by anticipating the behavior of other learners

Angelos Assos, Yuval Dagan, Constantinos Daskalakis

TL;DR

The paper investigates how a Planning optimizer can anticipate and exploit the behavior of a learning-based opponent in two-agent repeated environments. It delivers a positive, computable result for zero-sum games: against a Replicator Dynamics learner, the optimizer can achieve its maximal cumulative payoff by adopting a constant mixed strategy $x^*$, with the optimal reward computable via a convex log-sum-exp formulation. It also analyzes the discrete-time MWU setting, showing the discrete payoff can only improve relative to the continuous-time benchmark and providing bounds on possible gains. In contrast, for general-sum games, the authors establish a computational hardness result: unless $P=NP$, no FPTAS exists for maximizing the optimizer’s utility against a history-best-responding learner, highlighting a fundamental gap between zero-sum tractability and general-sum intractability. Overall, the work clarifies when planning against learning agents is computationally feasible and when it is provably hard, offering a control-theoretic lens via the Hamilton-Jacobi-Bellman equation for zero-sum cases and suggesting directions for future research on broader learner classes and problem structures.

Abstract

Learning algorithms are often used to make decisions in sequential decision-making environments. In multi-agent settings, the decisions of each agent can affect the utilities/losses of the other agents. Therefore, if an agent is good at anticipating the behavior of the other agents, in particular how they will make decisions in each round as a function of their experience that far, it could try to judiciously make its own decisions over the rounds of the interaction so as to influence the other agents to behave in a way that ultimately benefits its own utility. In this paper, we study repeated two-player games involving two types of agents: a learner, which employs an online learning algorithm to choose its strategy in each round; and an optimizer, which knows the learner's utility function and the learner's online learning algorithm. The optimizer wants to plan ahead to maximize its own utility, while taking into account the learner's behavior. We provide two results: a positive result for repeated zero-sum games and a negative result for repeated general-sum games. Our positive result is an algorithm for the optimizer, which exactly maximizes its utility against a learner that plays the Replicator Dynamics -- the continuous-time analogue of Multiplicative Weights Update (MWU). Additionally, we use this result to provide an algorithm for the optimizer against MWU, i.e.~for the discrete-time setting, which guarantees an average utility for the optimizer that is higher than the value of the one-shot game. Our negative result shows that, unless P=NP, there is no Fully Polynomial Time Approximation Scheme (FPTAS) for maximizing the utility of an optimizer against a learner that best-responds to the history in each round. Yet, this still leaves open the question of whether there exists a polynomial-time algorithm that optimizes the utility up to $o(T)$.

Maximizing utility in multi-agent environments by anticipating the behavior of other learners

TL;DR

, with the optimal reward computable via a convex log-sum-exp formulation. It also analyzes the discrete-time MWU setting, showing the discrete payoff can only improve relative to the continuous-time benchmark and providing bounds on possible gains. In contrast, for general-sum games, the authors establish a computational hardness result: unless

, no FPTAS exists for maximizing the optimizer’s utility against a history-best-responding learner, highlighting a fundamental gap between zero-sum tractability and general-sum intractability. Overall, the work clarifies when planning against learning agents is computationally feasible and when it is provably hard, offering a control-theoretic lens via the Hamilton-Jacobi-Bellman equation for zero-sum cases and suggesting directions for future research on broader learner classes and problem structures.

Abstract

Paper Structure (22 sections, 18 theorems, 65 equations, 1 figure, 2 algorithms)

This paper contains 22 sections, 18 theorems, 65 equations, 1 figure, 2 algorithms.

Introduction
Setting.
Zero-sum games.
General-sum games.
Preliminaries
Algorithms for the learner
Our results
Zero sum games
Continuous-time games.
Discrete-time games.
A lower bound for general-sum games
Related Work
Optimizing against no regret learners.
Optimizing against MWU in $2\times 2$-games.
Regret lower bounds for online learners.
...and 7 more sections

Key Result

Theorem 1

Let $A \in \mathbb{R}^{n\times m}$ be a zero-sum game-matrix and let $\eta,T > 0$. There exists an algorithm, that runs in polynomial time in $n$ and $m$, which finds a strategy $\{x(t)\}_{t \in [0,T]}$ that maximizes the utility of an optimizer against the Replicator Dynamics with parameter $\eta$

Figures (1)

Figure 1: Graph $G$

Theorems & Definitions (36)

Definition 1: Historical Rewards for the learner
Definition 2: Value of the game
Definition 3: Best Response Set
Definition 4: Min-max strategies
Definition 5: MWU Algorithm
Definition 6: Replicator Dynamics
Definition 7: Best Response Algorithm
Theorem 1: informal
Proposition 1: informal
Proposition 2: informal
...and 26 more

Maximizing utility in multi-agent environments by anticipating the behavior of other learners

TL;DR

Abstract

Maximizing utility in multi-agent environments by anticipating the behavior of other learners

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (1)

Theorems & Definitions (36)