Table of Contents
Fetching ...

Decentralized Optimal Equilibrium Learning in Stochastic Games via Single-bit Feedback

Seref Taha Kiremitci, Ahmed Said Donmez, Muhammed O. Sayin

TL;DR

This work tackles decentralized equilibrium selection in finite discounted stochastic games under severe information constraints by introducing DOEL, which aims to steer agents toward welfare-optimizing equilibria using a novel single-bit signaling mechanism. The framework supports two learning schemes—Explore-and-Commit (E&C-DOEL) and Online-DOEL—that accommodate heterogeneous agent capabilities (model-based or model-free) and guarantee explicit finite-time regret bounds. Under mild regularity conditions, both methods achieve an expected regret that grows logarithmically with time, despite only one bit of feedback per agent per round and phase-wise exploration. The results are supported by theoretical analysis and illustrative simulations showing convergence toward welfare-maximizing equilibria for both sum- and product-based social welfare objectives, with remarkably low communication overhead. This approach opens avenues for scalable, privacy-preserving coordination in dynamic, strategic environments where centralized training or full model knowledge are impractical.

Abstract

We study decentralized equilibrium selection in stochastic games under severe information and communication constraints. In such settings, convergence to equilibrium alone is insufficient, as stochastic games typically admit many equilibria with markedly different welfare properties. We address decentralized optimal equilibrium selection, where agents coordinate on equilibria that optimize a designer-specified social welfare objective while allowing heterogeneous tolerance to deviations from strict best responses. Agents observe only the global state trajectory and their realized rewards, and exchange a single randomized bit of feedback per agent per round. This semantic content/discontent signaling mechanism implicitly aligns decentralized learning dynamics with the global welfare objective. We develop explore-and-commit and online variants applicable to general stochastic games, accommodating heterogeneous model-based or model-free methods for solving the induced Markov decision processes, and establish explicit finite-time regret guarantees, showing logarithmic expected regret under mild conditions.

Decentralized Optimal Equilibrium Learning in Stochastic Games via Single-bit Feedback

TL;DR

This work tackles decentralized equilibrium selection in finite discounted stochastic games under severe information constraints by introducing DOEL, which aims to steer agents toward welfare-optimizing equilibria using a novel single-bit signaling mechanism. The framework supports two learning schemes—Explore-and-Commit (E&C-DOEL) and Online-DOEL—that accommodate heterogeneous agent capabilities (model-based or model-free) and guarantee explicit finite-time regret bounds. Under mild regularity conditions, both methods achieve an expected regret that grows logarithmically with time, despite only one bit of feedback per agent per round and phase-wise exploration. The results are supported by theoretical analysis and illustrative simulations showing convergence toward welfare-maximizing equilibria for both sum- and product-based social welfare objectives, with remarkably low communication overhead. This approach opens avenues for scalable, privacy-preserving coordination in dynamic, strategic environments where centralized training or full model knowledge are impractical.

Abstract

We study decentralized equilibrium selection in stochastic games under severe information and communication constraints. In such settings, convergence to equilibrium alone is insufficient, as stochastic games typically admit many equilibria with markedly different welfare properties. We address decentralized optimal equilibrium selection, where agents coordinate on equilibria that optimize a designer-specified social welfare objective while allowing heterogeneous tolerance to deviations from strict best responses. Agents observe only the global state trajectory and their realized rewards, and exchange a single randomized bit of feedback per agent per round. This semantic content/discontent signaling mechanism implicitly aligns decentralized learning dynamics with the global welfare objective. We develop explore-and-commit and online variants applicable to general stochastic games, accommodating heterogeneous model-based or model-free methods for solving the induced Markov decision processes, and establish explicit finite-time regret guarantees, showing logarithmic expected regret under mild conditions.
Paper Structure (17 sections, 80 equations, 2 figures, 2 algorithms)

This paper contains 17 sections, 80 equations, 2 figures, 2 algorithms.

Figures (2)

  • Figure 1: Evolution of social welfare for content-endorsed joint strategy $\widehat{\pi}_k$ under welfare-maximization.
  • Figure 2: Evolution of social welfare for content-endorsed joint strategy $\widehat{\pi}_k$ under equilibrium selection.