Table of Contents
Fetching ...

Multiagent Cooperation and Competition with Deep Reinforcement Learning

Ardi Tampuu, Tambet Matiisen, Dorian Kodelja, Ilya Kuzovkin, Kristjan Korjus, Juhan Aru, Jaan Aru, Raul Vicente

TL;DR

This work extends Deep Q-Learning to a decentralized two-agent Pong setting to study how competition and collaboration emerge under different reward structures. By training independent DQNs on raw game frames, the authors show that fully competitive rewards foster scoring prowess, while fully cooperative rewards encourage keeping the ball in play and coordinated ball-passing strategies; intermediate rewards reveal a smooth transition between these modes. The study demonstrates the feasibility of using deep, model-free, multiagent reinforcement learning to analyze emergent social behaviors in complex environments, and discusses limitations like Q-value overestimation and future work with more agents and diverse games. The results have implications for understanding decentralized learning, emergent coordination, and potential applications in distributed control and communication without predefined protocols.

Abstract

Multiagent systems appear in most social, economical, and political situations. In the present work we extend the Deep Q-Learning Network architecture proposed by Google DeepMind to multiagent environments and investigate how two agents controlled by independent Deep Q-Networks interact in the classic videogame Pong. By manipulating the classical rewarding scheme of Pong we demonstrate how competitive and collaborative behaviors emerge. Competitive agents learn to play and score efficiently. Agents trained under collaborative rewarding schemes find an optimal strategy to keep the ball in the game as long as possible. We also describe the progression from competitive to collaborative behavior. The present work demonstrates that Deep Q-Networks can become a practical tool for studying the decentralized learning of multiagent systems living in highly complex environments.

Multiagent Cooperation and Competition with Deep Reinforcement Learning

TL;DR

This work extends Deep Q-Learning to a decentralized two-agent Pong setting to study how competition and collaboration emerge under different reward structures. By training independent DQNs on raw game frames, the authors show that fully competitive rewards foster scoring prowess, while fully cooperative rewards encourage keeping the ball in play and coordinated ball-passing strategies; intermediate rewards reveal a smooth transition between these modes. The study demonstrates the feasibility of using deep, model-free, multiagent reinforcement learning to analyze emergent social behaviors in complex environments, and discusses limitations like Q-value overestimation and future work with more agents and diverse games. The results have implications for understanding decentralized learning, emergent coordination, and potential applications in distributed control and communication without predefined protocols.

Abstract

Multiagent systems appear in most social, economical, and political situations. In the present work we extend the Deep Q-Learning Network architecture proposed by Google DeepMind to multiagent environments and investigate how two agents controlled by independent Deep Q-Networks interact in the classic videogame Pong. By manipulating the classical rewarding scheme of Pong we demonstrate how competitive and collaborative behaviors emerge. Competitive agents learn to play and score efficiently. Agents trained under collaborative rewarding schemes find an optimal strategy to keep the ball in the game as long as possible. We also describe the progression from competitive to collaborative behavior. The present work demonstrates that Deep Q-Networks can become a practical tool for studying the decentralized learning of multiagent systems living in highly complex environments.

Paper Structure

This paper contains 21 sections, 8 figures, 4 tables.

Figures (8)

  • Figure 1: The Pong game. Each agent corresponds to one of the paddles.
  • Figure 2: Evolution of the behaviour of the competitive agents during training. (\ref{['fig:sidebounces_comp']}) The number of paddle-bounces increases indicating that the players get better at catching the ball. (\ref{['fig:wallbounce_comp']}) The frequency of the ball hitting the upper and lower walls decreases slowly with training. The first 10 epochs are omitted from the plot as very few paddle-bounces were made by the agents and the metric was very noisy. (\ref{['fig:serving_time_comp']}) Serving time decreases abruptly in early stages of training- the agents learn to put the ball back into play. Serving time is measured in frames.
  • Figure 3: A competitive game - game situations and the Q-values predicted by the agents. A) The left player predicts that the right player will not reach the ball as it is rapidly moving upwards. B) A change in the direction of the ball causes the left player's reward expectation to drop. C) Players understand that the ball will inevitably go out of the play. See section \ref{['sec:videos']} for videos illustrating other game situations and the corresponding agents' Q-values.
  • Figure 4: Evolution of the behaviour of the collaborative agents during training. (\ref{['fig:sidebounces_coop']}) The number of paddle-bounces increases as the players get better at reaching the ball. (\ref{['fig:wallbounce_coop']}) The frequency of the ball hitting the upper and lower walls decreases significantly with training. The first 10 epochs are omitted from the plot as very few paddle-bounces were made by the agents and the metric was very noisy. (\ref{['fig:serving_time_coop']}) Serving time increases - the agents learn to postpone putting the ball into play. Serving time is measured in frames.
  • Figure 5: Cooperative game. A) The ball is moving slowly and the future reward expectation is not very low - the agents do not expect to miss the slow balls. B) The ball is moving faster and the reward expectation is much more negative - the agents expect to miss the ball in the near future. C) The ball is inevitably going out of play. Both agents' reward expectations drop accordingly. See section \ref{['sec:videos']} for videos illustrating other game situations and the corresponding agents' Q-values.
  • ...and 3 more figures