Multi-Agent Learning in Contextual Games under Unknown Constraints

Anna M. Maddux; Maryam Kamgarpour

Multi-Agent Learning in Contextual Games under Unknown Constraints

Anna M. Maddux, Maryam Kamgarpour

TL;DR

This work tackles learning in repeated contextual games with unknown rewards and unknown constraints. It introduces c.z.AdaNormalGP, a Gaussian-process-based no-regret, no-violation algorithm that leverages kernel-induced similarities across contexts and actions, and it formalizes constrained contextual coarse correlated equilibria (c.z.CCE) that emerge when all players follow such strategies. Theoretical results establish kernel-dependent regret and sublinear cumulative constraint violations, with separate analysis for finite and infinite context spaces. Empirical results on multi-building temperature control and synthetic games validate the method's ability to learn effective, constraint-satisfying policies in complex, context-rich environments.

Abstract

We consider the problem of learning to play a repeated contextual game with unknown reward and unknown constraints functions. Such games arise in applications where each agent's action needs to belong to a feasible set, but the feasible set is a priori unknown. For example, in constrained multi-agent reinforcement learning, the constraints on the agents' policies are a function of the unknown dynamics and hence, are themselves unknown. Under kernel-based regularity assumptions on the unknown functions, we develop a no-regret, no-violation approach which exploits similarities among different reward and constraint outcomes. The no-violation property ensures that the time-averaged sum of constraint violations converges to zero as the game is repeated. We show that our algorithm, referred to as c.z.AdaNormalGP, obtains kernel-dependent regret bounds and that the cumulative constraint violations have sublinear kernel-dependent upper bounds. In addition we introduce the notion of constrained contextual coarse correlated equilibria (c.z.CCE) and show that $ε$-c.z.CCEs can be approached whenever players' follow a no-regret no-violation strategy. Finally, we experimentally demonstrate the effectiveness of c.z.AdaNormalGP on an instance of multi-agent reinforcement learning.

Multi-Agent Learning in Contextual Games under Unknown Constraints

TL;DR

Abstract

-c.z.CCEs can be approached whenever players' follow a no-regret no-violation strategy. Finally, we experimentally demonstrate the effectiveness of c.z.AdaNormalGP on an instance of multi-agent reinforcement learning.

Paper Structure (19 sections, 15 theorems, 79 equations, 3 figures, 3 algorithms)

This paper contains 19 sections, 15 theorems, 79 equations, 3 figures, 3 algorithms.

Introduction
Problem Setup
Feedback model and regularity assumption
The c.z.AdaNormalGP Algorithm
Finite number of contexts
Game equilibria
Experiments
Temperature controller design
Conclusion and further discussion
Appendix
Supplementary Material for Section \ref{['sec:c.z.AdaNormalGP']}
Finite number of contexts: Proof of Theorem \ref{['thm:finite_Z']}
Infinite (large) number of contexts
Supplementary Material for Remark \ref{['rem:general_bounds']}
Expert algorithm bounds for finite context space
...and 4 more sections

Key Result

Theorem 1

Fix $\delta\in(0,1)$. Under Assumptions ass:feasibility_context-ass:regularity_context, if a player plays according to c.z.AdaNormalGP with $p^t(z^t)$ computed according to Algorithm alg:strategy_finite_Z and $\beta_m^t=B_m+\sigma_m\sqrt{2(\gamma_m^{t-1}+1+\log(2(M+1)/\delta))}$ for all $m\in\{0\}\c where $B=1+\frac{3}{2}\frac{1}{K}\sum_{a_i=1}^K (1+\log(1+C_i^t(a_i)))\leq \frac{5}{2}+\frac{3}{2}\

Figures (3)

Figure 1: Mean temperature over $48$ hours, where the control inputs are sampled from the weights learned by c.AdaNormalGP (top) and GPMW (bottom).
Figure 2: Mean energy cost achieved by c.AdaNormalGP, GPMW, and uniformly at random sampled control inputs for each round $t=1,\ldots T$. The minimum feasible- and the minimum cost are found exhaustively over the entire action space.
Figure 3: Regret and cumuluative constraint violations for players "random", "GPMW", "c.GPMW", "c.AdaNormalGP", and "c.z.AdaNormalGP". Shaded areas represent $\pm$ one standard deviation.

Theorems & Definitions (28)

Remark 1
Theorem 1
Corollary 1
Definition 1
Proposition 1
Lemma 1
Theorem 2
Lemma 2
proof
Lemma 3
...and 18 more

Multi-Agent Learning in Contextual Games under Unknown Constraints

TL;DR

Abstract

Multi-Agent Learning in Contextual Games under Unknown Constraints

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (28)