Table of Contents
Fetching ...

LOQA: Learning with Opponent Q-Learning Awareness

Milad Aghajohari, Juan Agustin Duque, Tim Cooijmans, Aaron Courville

TL;DR

LOQA tackles reciprocal cooperation in general-sum MARL with a scalable, decentralized approach. By modeling the opponent's policy as proportional to its action-value and differentiating through the opponent's Q-function, LOQA shapes learning without expensive optimization graphs or second-order gradients. Empirically, LOQA achieves state-of-the-art results on the Iterated Prisoner's Dilemma and the Coin Game, while substantially reducing training time compared to prior methods. The method demonstrates robust, reciprocity-based cooperation and scalability to larger grid environments, suggesting broad applicability to real-world multi-agent systems.

Abstract

In various real-world scenarios, interactions among agents often resemble the dynamics of general-sum games, where each agent strives to optimize its own utility. Despite the ubiquitous relevance of such settings, decentralized machine learning algorithms have struggled to find equilibria that maximize individual utility while preserving social welfare. In this paper we introduce Learning with Opponent Q-Learning Awareness (LOQA), a novel, decentralized reinforcement learning algorithm tailored to optimizing an agent's individual utility while fostering cooperation among adversaries in partially competitive environments. LOQA assumes the opponent samples actions proportionally to their action-value function Q. Experimental results demonstrate the effectiveness of LOQA at achieving state-of-the-art performance in benchmark scenarios such as the Iterated Prisoner's Dilemma and the Coin Game. LOQA achieves these outcomes with a significantly reduced computational footprint, making it a promising approach for practical multi-agent applications.

LOQA: Learning with Opponent Q-Learning Awareness

TL;DR

LOQA tackles reciprocal cooperation in general-sum MARL with a scalable, decentralized approach. By modeling the opponent's policy as proportional to its action-value and differentiating through the opponent's Q-function, LOQA shapes learning without expensive optimization graphs or second-order gradients. Empirically, LOQA achieves state-of-the-art results on the Iterated Prisoner's Dilemma and the Coin Game, while substantially reducing training time compared to prior methods. The method demonstrates robust, reciprocity-based cooperation and scalability to larger grid environments, suggesting broad applicability to real-world multi-agent systems.

Abstract

In various real-world scenarios, interactions among agents often resemble the dynamics of general-sum games, where each agent strives to optimize its own utility. Despite the ubiquitous relevance of such settings, decentralized machine learning algorithms have struggled to find equilibria that maximize individual utility while preserving social welfare. In this paper we introduce Learning with Opponent Q-Learning Awareness (LOQA), a novel, decentralized reinforcement learning algorithm tailored to optimizing an agent's individual utility while fostering cooperation among adversaries in partially competitive environments. LOQA assumes the opponent samples actions proportionally to their action-value function Q. Experimental results demonstrate the effectiveness of LOQA at achieving state-of-the-art performance in benchmark scenarios such as the Iterated Prisoner's Dilemma and the Coin Game. LOQA achieves these outcomes with a significantly reduced computational footprint, making it a promising approach for practical multi-agent applications.
Paper Structure (30 sections, 13 equations, 6 figures, 4 tables, 3 algorithms)

This paper contains 30 sections, 13 equations, 6 figures, 4 tables, 3 algorithms.

Figures (6)

  • Figure 1: Probability of cooperation of a sigmoid LOQA agent at each possible state in the one-step history IPD after 7000 training iterations. LOQA agents' resulting policy is similar to tit-for-tat, a policy that cooperates at the first step and copies the previous action of the opponent at subsequent time-steps.
  • Figure 2: Average rewards after evaluating 10 fully trained LOQA and POLA seeds against different agents in a 3x3 sized Coin Game lasting 50 episodes. AC for always Cooperate, AD for always defect. Notice that a fully cooperative agent achieves an average reward of 0.35 against itself. LOQA is able to generate a policy that demonstrates reciprocity-based cooperation.
  • Figure 3: Wall clock time vs grid size for three seeds of LOQA and POLA on reaching different thresholds. Each data point indicates the first time its corresponding seed passed a certain threshold. The wall clock time is measured in seconds. Red triangles indicate LOQA's performance while blue circles visualize LOQA's performance. Dashed lines pass through the average time for runs that passed for their respective algorithm.
  • Figure 4: Training curves for 3 seeds of POLA and LOQA on the two evaluation metrics for a 7x7 grid size: Normalized return vs. themselves (Self) and vs. always defect (AD). The wall clock time is measured in seconds. Note that the range of the x-axis is different for POLA and LOQA as POLA takes longer to pass the considered thresholds.
  • Figure 5: Average rewards after evaluating 10 fully trained LOQA and POLA seeds against different agents in a 3x3 sized Coin Game. We abbreviate L:LOQA, L-R:LOQA without replay buffer, L-S:for LOQA without self-play, P:POLA, AC:Always Cooperate, AD:Always defect and R:Random.
  • ...and 1 more figures