Table of Contents
Fetching ...

Multi-Agent Learning in Contextual Games under Unknown Constraints

Anna M. Maddux, Maryam Kamgarpour

TL;DR

This work tackles learning in repeated contextual games with unknown rewards and unknown constraints. It introduces c.z.AdaNormalGP, a Gaussian-process-based no-regret, no-violation algorithm that leverages kernel-induced similarities across contexts and actions, and it formalizes constrained contextual coarse correlated equilibria (c.z.CCE) that emerge when all players follow such strategies. Theoretical results establish kernel-dependent regret and sublinear cumulative constraint violations, with separate analysis for finite and infinite context spaces. Empirical results on multi-building temperature control and synthetic games validate the method's ability to learn effective, constraint-satisfying policies in complex, context-rich environments.

Abstract

We consider the problem of learning to play a repeated contextual game with unknown reward and unknown constraints functions. Such games arise in applications where each agent's action needs to belong to a feasible set, but the feasible set is a priori unknown. For example, in constrained multi-agent reinforcement learning, the constraints on the agents' policies are a function of the unknown dynamics and hence, are themselves unknown. Under kernel-based regularity assumptions on the unknown functions, we develop a no-regret, no-violation approach which exploits similarities among different reward and constraint outcomes. The no-violation property ensures that the time-averaged sum of constraint violations converges to zero as the game is repeated. We show that our algorithm, referred to as c.z.AdaNormalGP, obtains kernel-dependent regret bounds and that the cumulative constraint violations have sublinear kernel-dependent upper bounds. In addition we introduce the notion of constrained contextual coarse correlated equilibria (c.z.CCE) and show that $ε$-c.z.CCEs can be approached whenever players' follow a no-regret no-violation strategy. Finally, we experimentally demonstrate the effectiveness of c.z.AdaNormalGP on an instance of multi-agent reinforcement learning.

Multi-Agent Learning in Contextual Games under Unknown Constraints

TL;DR

This work tackles learning in repeated contextual games with unknown rewards and unknown constraints. It introduces c.z.AdaNormalGP, a Gaussian-process-based no-regret, no-violation algorithm that leverages kernel-induced similarities across contexts and actions, and it formalizes constrained contextual coarse correlated equilibria (c.z.CCE) that emerge when all players follow such strategies. Theoretical results establish kernel-dependent regret and sublinear cumulative constraint violations, with separate analysis for finite and infinite context spaces. Empirical results on multi-building temperature control and synthetic games validate the method's ability to learn effective, constraint-satisfying policies in complex, context-rich environments.

Abstract

We consider the problem of learning to play a repeated contextual game with unknown reward and unknown constraints functions. Such games arise in applications where each agent's action needs to belong to a feasible set, but the feasible set is a priori unknown. For example, in constrained multi-agent reinforcement learning, the constraints on the agents' policies are a function of the unknown dynamics and hence, are themselves unknown. Under kernel-based regularity assumptions on the unknown functions, we develop a no-regret, no-violation approach which exploits similarities among different reward and constraint outcomes. The no-violation property ensures that the time-averaged sum of constraint violations converges to zero as the game is repeated. We show that our algorithm, referred to as c.z.AdaNormalGP, obtains kernel-dependent regret bounds and that the cumulative constraint violations have sublinear kernel-dependent upper bounds. In addition we introduce the notion of constrained contextual coarse correlated equilibria (c.z.CCE) and show that -c.z.CCEs can be approached whenever players' follow a no-regret no-violation strategy. Finally, we experimentally demonstrate the effectiveness of c.z.AdaNormalGP on an instance of multi-agent reinforcement learning.
Paper Structure (19 sections, 15 theorems, 79 equations, 3 figures, 3 algorithms)

This paper contains 19 sections, 15 theorems, 79 equations, 3 figures, 3 algorithms.

Key Result

Theorem 1

Fix $\delta\in(0,1)$. Under Assumptions ass:feasibility_context-ass:regularity_context, if a player plays according to c.z.AdaNormalGP with $p^t(z^t)$ computed according to Algorithm alg:strategy_finite_Z and $\beta_m^t=B_m+\sigma_m\sqrt{2(\gamma_m^{t-1}+1+\log(2(M+1)/\delta))}$ for all $m\in\{0\}\c where $B=1+\frac{3}{2}\frac{1}{K}\sum_{a_i=1}^K (1+\log(1+C_i^t(a_i)))\leq \frac{5}{2}+\frac{3}{2}\

Figures (3)

  • Figure 1: Mean temperature over $48$ hours, where the control inputs are sampled from the weights learned by c.AdaNormalGP (top) and GPMW (bottom).
  • Figure 2: Mean energy cost achieved by c.AdaNormalGP, GPMW, and uniformly at random sampled control inputs for each round $t=1,\ldots T$. The minimum feasible- and the minimum cost are found exhaustively over the entire action space.
  • Figure 3: Regret and cumuluative constraint violations for players "random", "GPMW", "c.GPMW", "c.AdaNormalGP", and "c.z.AdaNormalGP". Shaded areas represent $\pm$ one standard deviation.

Theorems & Definitions (28)

  • Remark 1
  • Theorem 1
  • Corollary 1
  • Definition 1
  • Proposition 1
  • Lemma 1
  • Theorem 2
  • Lemma 2
  • proof
  • Lemma 3
  • ...and 18 more