Table of Contents
Fetching ...

Generalized Individual Q-learning for Polymatrix Games with Partial Observations

Ahmed Said Donmez, Muhammed O. Sayin

TL;DR

This paper presents the generalized individual Q-learning dynamics that combine belief-based and payoff-based learning for the networked interconnections of more than two self-interested agents, and demonstrates a faster convergence to quantal response equilibrium in multi-agent zero-sum and potential polymatrix games.

Abstract

This paper addresses the challenge of limited observations in non-cooperative multi-agent systems where agents can have partial access to other agents' actions. We present the generalized individual Q-learning dynamics that combine belief-based and payoff-based learning for the networked interconnections of more than two self-interested agents. This approach leverages access to opponents' actions whenever possible, demonstrably achieving a faster (guaranteed) convergence to quantal response equilibrium in multi-agent zero-sum and potential polymatrix games. Notably, the dynamics reduce to the well-studied smoothed fictitious play and individual Q-learning under full and no access to opponent actions, respectively. We further quantify the improvement in convergence rate due to observing opponents' actions through numerical simulations.

Generalized Individual Q-learning for Polymatrix Games with Partial Observations

TL;DR

This paper presents the generalized individual Q-learning dynamics that combine belief-based and payoff-based learning for the networked interconnections of more than two self-interested agents, and demonstrates a faster convergence to quantal response equilibrium in multi-agent zero-sum and potential polymatrix games.

Abstract

This paper addresses the challenge of limited observations in non-cooperative multi-agent systems where agents can have partial access to other agents' actions. We present the generalized individual Q-learning dynamics that combine belief-based and payoff-based learning for the networked interconnections of more than two self-interested agents. This approach leverages access to opponents' actions whenever possible, demonstrably achieving a faster (guaranteed) convergence to quantal response equilibrium in multi-agent zero-sum and potential polymatrix games. Notably, the dynamics reduce to the well-studied smoothed fictitious play and individual Q-learning under full and no access to opponent actions, respectively. We further quantify the improvement in convergence rate due to observing opponents' actions through numerical simulations.
Paper Structure (8 sections, 46 equations, 4 figures, 1 algorithm)

This paper contains 8 sections, 46 equations, 4 figures, 1 algorithm.

Figures (4)

  • Figure 1: An illustration of observability and interconnectedness over two layers. Nodes connected by dashed lines across layers represent an agent. In Layer 1, observability is depicted, where directed edges indicate that agents can observe the actions of others, forming the observability graph. In Layer 2, agents interact and can affect each other's payoffs, as represented by the directed edges, forming the interaction graph.
  • Figure 2: Convergence of the QRE-gap (left), and $q_{\mathrm{diff}}$ (right) for different values of edge-connection probability $p$, in the potential polymatrix games. The solid curves represent the mean over $50$ independent trials and the shaded areas are $\pm 0.5$ standard deviations from the mean values.
  • Figure 3: Convergence of the QRE-gap (left), and $q_{\mathrm{diff}}$ (right) for different values of edge-connection probability $p$, in the zero-sum polymatrix games. The solid curves represent the mean over $50$ independent trials and the shaded areas are $\pm 0.5$ standard deviations from the mean values.
  • Figure 4: An illustration of $\pi^i_k$'s over the probability simplex $\Delta^i\subset\mathbb{R}^3$ for each agent $i$ at the end of $10^6$ iterations and over $50$ independent trials for $p=1$. Green and red dots are, resp., for the potential and zero-sum games. The shaded areas represent the standard deviations from the mean values.

Theorems & Definitions (1)

  • proof