Deep SOR Minimax Q-learning for Two-player Zero-sum Game
Saksham Gautam, Lakshmi Mandal, Shalabh Bhatnagar
TL;DR
This work tackles slow convergence in two-player zero-sum Markov games by introducing a deep SOR Minimax Q-learning (D-SOR-MQL) algorithm that uses neural networks to approximate the Q-function in high-dimensional spaces. It combines Generalized Policy Iteration with a deep minimax Q-learning update that incorporates a successive over-relaxation parameter $w \ge 1$, and provides a finite-time convergence analysis under linear function approximation with a proven $O(\epsilon^{-2})$ sample complexity. Empirically, D-SOR-MQL demonstrates faster convergence and lower error than the baseline minimax Q-learning across competitive multi-agent environments, with ablation studies illustrating the effect of $w$. The findings suggest that integrating SOR into deep minimax Q-learning can significantly speed up learning in multi-agent zero-sum settings, with future work aimed at extending the finite-time guarantees to deep-function approximators.
Abstract
In this work, we consider the problem of a two-player zero-sum game. In the literature, the successive over-relaxation Q-learning algorithm has been developed and implemented, and it is seen to result in a lower contraction factor for the associated Q-Bellman operator resulting in a faster value iteration-based procedure. However, this has been presented only for the tabular case and not for the setting with function approximation that typically caters to real-world high-dimensional state-action spaces. Furthermore, such settings in the case of two-player zero-sum games have not been considered. We thus propose a deep successive over-relaxation minimax Q-learning algorithm that incorporates deep neural networks as function approximators and is suitable for high-dimensional spaces. We prove the finite-time convergence of the proposed algorithm. Through numerical experiments, we show the effectiveness of the proposed method over the existing Q-learning algorithm. Our ablation studies demonstrate the effect of different values of the crucial successive over-relaxation parameter.
