Group-Agent Reinforcement Learning with Heterogeneous Agents
Kaiyue Wu, Xiao-Jun Zeng, Tingting Mu
TL;DR
The paper tackles asynchronous, heterogeneous group-agent reinforcement learning (HGARL) where multiple agents with different algorithms learn in parallel and share knowledge to accelerate individual learning. It presents HGARL, sharing policy/value parameters and per-episode rewards, and introduces three action-selection rules—Probability Addition ($\pi(a_t|s_t)=\sum_m \pi_m(a_t|s_t)$), Probability Multiplication ($\pi(a_t|s_t)=\prod_m \pi_m(a_t|s_t)$), and Reward-Value-Likelihood Combo—along with a model-adoption mechanism when a peer's policy yields superior results. A key contribution is the Combo rule, which fuses accumulated rewards, value estimates, and action confidence with a threshold $\phi$ on $-\log \pi$ to filter actions, plus a periodic model-adoption step to replace an agent’s model with a peer’s when advantageous. Experiments on 43 Atari 2600 games with A2C, PPO, and ACER show HGARL achieves a speed-up in $T_G$ such that speed-up $r=T/T_G$ surpasses 1 in $96.12\%$ of tests, and around $41.09\%$ of cases reach higher final rewards within only $5\%$ of the time steps required by solitary learning, demonstrating HGARL’s strong potential for accelerating and enhancing learning in heterogeneous, multi-agent settings.
Abstract
Group-agent reinforcement learning (GARL) is a newly arising learning scenario, where multiple reinforcement learning agents study together in a group, sharing knowledge in an asynchronous fashion. The goal is to improve the learning performance of each individual agent. Under a more general heterogeneous setting where different agents learn using different algorithms, we advance GARL by designing novel and effective group-learning mechanisms. They guide the agents on whether and how to learn from action choices from the others, and allow the agents to adopt available policy and value function models sent by another agent if they perform better. We have conducted extensive experiments on a total of 43 different Atari 2600 games to demonstrate the superior performance of the proposed method. After the group learning, among the 129 agents examined, 96% are able to achieve a learning speed-up, and 72% are able to learn over 100 times faster. Also, around 41% of those agents have achieved a higher accumulated reward score by learning in less than 5% of the time steps required by a single agent when learning on its own.
