Table of Contents
Fetching ...

The Cell Must Go On: Agar.io for Continual Reinforcement Learning

Mohamed A. Mohamed, Kateryna Nekhomiazh, Vedant Vyas, Marcos M. Jose, Andrew Patterson, Marlos C. Machado

TL;DR

AgarCL introduces Agar.io as a continual reinforcement learning benchmark, featuring non-episodic, high-dimensional, partially observable dynamics with continuous and discrete actions. The paper precisely defines reward, observation, and action spaces, and provides a scaffolded set of mini-games to dissect continual RL challenges. Benchmarking DQN, PPO, and SAC reveals pronounced difficulties in learning in continual, non-stationary settings and highlights critical issues in hyperparameter sensitivity and evaluation methodology. Overall, AgarCL serves as a challenging, open-source platform to drive method development and deeper understanding of continual RL, while underscoring substantial compute and hyperparameter tuning demands.

Abstract

Continual reinforcement learning (RL) concerns agents that are expected to learn continually, rather than converge to a policy that is then fixed for evaluation. Such an approach is well suited to environments the agent perceives as changing, which renders any static policy ineffective over time. The few simulators explicitly designed for empirical research in continual RL are often limited in scope or complexity, and it is now common for researchers to modify episodic RL environments by artificially incorporating abrupt task changes during interaction. In this paper, we introduce AgarCL, a research platform for continual RL that allows for a progression of increasingly sophisticated behaviour. AgarCL is based on the game Agar.io, a non-episodic, high-dimensional problem featuring stochastic, ever-evolving dynamics, continuous actions, and partial observability. Additionally, we provide benchmark results reporting the performance of DQN, PPO, and SAC in both the primary, challenging continual RL problem, and across a suite of smaller tasks within AgarCL, each of which isolates aspects of the full environment and allow us to characterize the challenges posed by different aspects of the game.

The Cell Must Go On: Agar.io for Continual Reinforcement Learning

TL;DR

AgarCL introduces Agar.io as a continual reinforcement learning benchmark, featuring non-episodic, high-dimensional, partially observable dynamics with continuous and discrete actions. The paper precisely defines reward, observation, and action spaces, and provides a scaffolded set of mini-games to dissect continual RL challenges. Benchmarking DQN, PPO, and SAC reveals pronounced difficulties in learning in continual, non-stationary settings and highlights critical issues in hyperparameter sensitivity and evaluation methodology. Overall, AgarCL serves as a challenging, open-source platform to drive method development and deeper understanding of continual RL, while underscoring substantial compute and hyperparameter tuning demands.

Abstract

Continual reinforcement learning (RL) concerns agents that are expected to learn continually, rather than converge to a policy that is then fixed for evaluation. Such an approach is well suited to environments the agent perceives as changing, which renders any static policy ineffective over time. The few simulators explicitly designed for empirical research in continual RL are often limited in scope or complexity, and it is now common for researchers to modify episodic RL environments by artificially incorporating abrupt task changes during interaction. In this paper, we introduce AgarCL, a research platform for continual RL that allows for a progression of increasingly sophisticated behaviour. AgarCL is based on the game Agar.io, a non-episodic, high-dimensional problem featuring stochastic, ever-evolving dynamics, continuous actions, and partial observability. Additionally, we provide benchmark results reporting the performance of DQN, PPO, and SAC in both the primary, challenging continual RL problem, and across a suite of smaller tasks within AgarCL, each of which isolates aspects of the full environment and allow us to characterize the challenges posed by different aspects of the game.

Paper Structure

This paper contains 41 sections, 5 equations, 10 figures, 9 tables.

Figures (10)

  • Figure 1: Agent-environment interface and main entities in AgarCL. The agent has access to one of two observation types: pixel-based or symbolic. The pixel-based observation includes four channels that represent different game entities: pellets, bots, viruses, and the agent itself. Alternatively, the agent can receive a symbolic observation, consisting of pre-processed features such as the distances to nearby enemies, pellets, and other entities. The reward function is defined as the change in the agent’s mass between two consecutive time steps. The action space in AgarCL is hybrid: at each time step, the agent selects an $\langle x, y \rangle$ coordinate mimicking where a human player would point their mouse. Additionally, the agent decides whether to split, eject mass, or just move. In Section \ref{['AgarCL_Learning_Environment']}, we provide detailed descriptions of these entities and the game dynamics, including the available actions.
  • Figure 2: Environment dynamics and actions in AgarCL, with the agent shown in black. ① The agent can eat smaller cells to gain mass. ② The Split action divides each of the agent’s cells in half and propels them in a chosen $\langle x, y \rangle$ direction, allowing slower agents to catch faster ones. Each cell moves at a speed inversely related to its mass. ③ Cells can later merge if brought close together. Viruses add complexity: depending on mass, the agent can either ④ be split by a virus or ⑤ consume it. ⑥ The agent can also eject mass in a chosen direction. This mass can be eaten by any agent. Ejecting enough mass into a virus spawns a new one and propels the original, enabling smaller cells to attack larger ones by pushing viruses into them. While players in the online game use these mechanics to communicate, this is outside the scope of our work; bot behaviour is scripted.
  • Figure 3: First 5,000 time steps of an expert trajectory showing game progression. The agent, shown in pink, steadily gains mass: ① starting small, it eventually passes over a virus; ② splits into smaller cells; ③ grows large enough to consume viruses and other agents; ④ splits to attack, increasing its speed. As it grows, the view zooms out to show its full body. ⑤ The agent ultimately surpasses all opponents in mass and ⑥ can see the entire arena. Mass decays over time, and the view would zoom in again if mass dropped. This run was recorded while one of the authors played.
  • Figure 4: Performance of RL baselines on episodic pellet-collection mini-games. Panels ⓐ, ⓑ, and ⓒ show the performance on the square-path tasks (mini-games $1$, $2$, and $3$), while Panels ⓓ ,ⓔ, and ⓕ show the performance on randomly regenerated tasks (mini-games $4$, $5$, and $6$). Note that the y-axis scales are different from each other. The dashed line marks final average human-expert performance, and the green line the random policy. The shaded region shows the 95% confidence interval over 10 runs, computed using the t-distribution.
  • Figure 5: Performance of RL baselines on continual pellet-collection mini-games. Panels ⓐ, ⓑ, and ⓒ show the performance on the square-path tasks (mini-games $1$, $2$, and $3$), while Panels ⓓ ,ⓔ, and ⓕ show the performance on randomly regenerated tasks (mini-games $4$, $5$, and $6$). Note that the y-axis scales are different from each other. The dashed line marks final average human-expert performance, and the green line the random policy. The shaded region shows the 95% confidence interval over 10 runs, computed using the t-distribution.
  • ...and 5 more figures