Table of Contents
Fetching ...

Deep SOR Minimax Q-learning for Two-player Zero-sum Game

Saksham Gautam, Lakshmi Mandal, Shalabh Bhatnagar

TL;DR

This work tackles slow convergence in two-player zero-sum Markov games by introducing a deep SOR Minimax Q-learning (D-SOR-MQL) algorithm that uses neural networks to approximate the Q-function in high-dimensional spaces. It combines Generalized Policy Iteration with a deep minimax Q-learning update that incorporates a successive over-relaxation parameter $w \ge 1$, and provides a finite-time convergence analysis under linear function approximation with a proven $O(\epsilon^{-2})$ sample complexity. Empirically, D-SOR-MQL demonstrates faster convergence and lower error than the baseline minimax Q-learning across competitive multi-agent environments, with ablation studies illustrating the effect of $w$. The findings suggest that integrating SOR into deep minimax Q-learning can significantly speed up learning in multi-agent zero-sum settings, with future work aimed at extending the finite-time guarantees to deep-function approximators.

Abstract

In this work, we consider the problem of a two-player zero-sum game. In the literature, the successive over-relaxation Q-learning algorithm has been developed and implemented, and it is seen to result in a lower contraction factor for the associated Q-Bellman operator resulting in a faster value iteration-based procedure. However, this has been presented only for the tabular case and not for the setting with function approximation that typically caters to real-world high-dimensional state-action spaces. Furthermore, such settings in the case of two-player zero-sum games have not been considered. We thus propose a deep successive over-relaxation minimax Q-learning algorithm that incorporates deep neural networks as function approximators and is suitable for high-dimensional spaces. We prove the finite-time convergence of the proposed algorithm. Through numerical experiments, we show the effectiveness of the proposed method over the existing Q-learning algorithm. Our ablation studies demonstrate the effect of different values of the crucial successive over-relaxation parameter.

Deep SOR Minimax Q-learning for Two-player Zero-sum Game

TL;DR

This work tackles slow convergence in two-player zero-sum Markov games by introducing a deep SOR Minimax Q-learning (D-SOR-MQL) algorithm that uses neural networks to approximate the Q-function in high-dimensional spaces. It combines Generalized Policy Iteration with a deep minimax Q-learning update that incorporates a successive over-relaxation parameter , and provides a finite-time convergence analysis under linear function approximation with a proven sample complexity. Empirically, D-SOR-MQL demonstrates faster convergence and lower error than the baseline minimax Q-learning across competitive multi-agent environments, with ablation studies illustrating the effect of . The findings suggest that integrating SOR into deep minimax Q-learning can significantly speed up learning in multi-agent zero-sum settings, with future work aimed at extending the finite-time guarantees to deep-function approximators.

Abstract

In this work, we consider the problem of a two-player zero-sum game. In the literature, the successive over-relaxation Q-learning algorithm has been developed and implemented, and it is seen to result in a lower contraction factor for the associated Q-Bellman operator resulting in a faster value iteration-based procedure. However, this has been presented only for the tabular case and not for the setting with function approximation that typically caters to real-world high-dimensional state-action spaces. Furthermore, such settings in the case of two-player zero-sum games have not been considered. We thus propose a deep successive over-relaxation minimax Q-learning algorithm that incorporates deep neural networks as function approximators and is suitable for high-dimensional spaces. We prove the finite-time convergence of the proposed algorithm. Through numerical experiments, we show the effectiveness of the proposed method over the existing Q-learning algorithm. Our ablation studies demonstrate the effect of different values of the crucial successive over-relaxation parameter.

Paper Structure

This paper contains 13 sections, 4 theorems, 46 equations, 9 figures, 2 tables, 1 algorithm.

Key Result

Theorem 1

Assume that Assumptions assump:1, assump:2 and assump:3 are satisfied. Additionally suppose that there exists a constant $Z$ such that the parameter norm is bounded by $\|\theta^*\|\leq Z$ and the iterates $\theta_t$ satisfy $\|\theta_t\|\leq Z$ almost surely for all $t$. Let the step size be $\al with atleast $1-\delta$ probability

Figures (9)

  • Figure 1: Loss on Guard-Invader environment with $(a)$$49$, and $(b)$$121$ states, respectively.
  • Figure 2: Loss on Soccer environment with $(a)$$49$, and $(b)$$121$ states, respectively.
  • Figure 3: Mimimax Q-value on Guard-Invader environment with $(a)$$49$, and $(b)$$121$ states, respectively.
  • Figure 4: Mimimax Q-value on Soccer environment with $(a)$$49$, and $(b)$$121$ states, respectively.
  • Figure 5: Network parameters $\theta_0$, $\theta_{\text{eval}}$, and $\theta_{\text{target}}$. The current timestep is denoted by $t$; the target network parameters ($\theta_{\text{target}}$) are updated every T iterations, and the evaluation network parameters ($\theta_{\text{eval}}$) are updated every nT iterations, where n is the number of inner loops used for evaluation
  • ...and 4 more figures

Theorems & Definitions (6)

  • Theorem 1
  • proof
  • Lemma 1
  • Lemma 2
  • Lemma 3
  • Remark 1