Group-Agent Reinforcement Learning with Heterogeneous Agents

Kaiyue Wu; Xiao-Jun Zeng; Tingting Mu

Group-Agent Reinforcement Learning with Heterogeneous Agents

Kaiyue Wu, Xiao-Jun Zeng, Tingting Mu

TL;DR

The paper tackles asynchronous, heterogeneous group-agent reinforcement learning (HGARL) where multiple agents with different algorithms learn in parallel and share knowledge to accelerate individual learning. It presents HGARL, sharing policy/value parameters and per-episode rewards, and introduces three action-selection rules—Probability Addition ($\pi(a_t|s_t)=\sum_m \pi_m(a_t|s_t)$), Probability Multiplication ($\pi(a_t|s_t)=\prod_m \pi_m(a_t|s_t)$), and Reward-Value-Likelihood Combo—along with a model-adoption mechanism when a peer's policy yields superior results. A key contribution is the Combo rule, which fuses accumulated rewards, value estimates, and action confidence with a threshold $\phi$ on $-\log \pi$ to filter actions, plus a periodic model-adoption step to replace an agent’s model with a peer’s when advantageous. Experiments on 43 Atari 2600 games with A2C, PPO, and ACER show HGARL achieves a speed-up in $T_G$ such that speed-up $r=T/T_G$ surpasses 1 in $96.12\%$ of tests, and around $41.09\%$ of cases reach higher final rewards within only $5\%$ of the time steps required by solitary learning, demonstrating HGARL’s strong potential for accelerating and enhancing learning in heterogeneous, multi-agent settings.

Abstract

Group-agent reinforcement learning (GARL) is a newly arising learning scenario, where multiple reinforcement learning agents study together in a group, sharing knowledge in an asynchronous fashion. The goal is to improve the learning performance of each individual agent. Under a more general heterogeneous setting where different agents learn using different algorithms, we advance GARL by designing novel and effective group-learning mechanisms. They guide the agents on whether and how to learn from action choices from the others, and allow the agents to adopt available policy and value function models sent by another agent if they perform better. We have conducted extensive experiments on a total of 43 different Atari 2600 games to demonstrate the superior performance of the proposed method. After the group learning, among the 129 agents examined, 96% are able to achieve a learning speed-up, and 72% are able to learn over 100 times faster. Also, around 41% of those agents have achieved a higher accumulated reward score by learning in less than 5% of the time steps required by a single agent when learning on its own.

Group-Agent Reinforcement Learning with Heterogeneous Agents

TL;DR

), Probability Multiplication (

), and Reward-Value-Likelihood Combo—along with a model-adoption mechanism when a peer's policy yields superior results. A key contribution is the Combo rule, which fuses accumulated rewards, value estimates, and action confidence with a threshold

to filter actions, plus a periodic model-adoption step to replace an agent’s model with a peer’s when advantageous. Experiments on 43 Atari 2600 games with A2C, PPO, and ACER show HGARL achieves a speed-up in

such that speed-up

surpasses 1 in

of tests, and around

of cases reach higher final rewards within only

of the time steps required by solitary learning, demonstrating HGARL’s strong potential for accelerating and enhancing learning in heterogeneous, multi-agent settings.

Abstract

Paper Structure (16 sections, 6 equations, 9 figures, 6 tables, 1 algorithm)

This paper contains 16 sections, 6 equations, 9 figures, 6 tables, 1 algorithm.

Introduction
Related Work and Discussion
Multi-Agent Reinforcement Learning
Ensemble Reinforcement Learning
Proposed HGARL
Agent Simulation
Action Selection and Adoption
Model Adoption
Experiments and Results
Experiment Setting
Results and Analysis
On Learning Speed-up.
On Action Selection Rules.
Conclusion
Selection of hyper parameter $\phi$
...and 1 more sections

Figures (9)

Figure 1: Flowchart: The agents of different types share knowledge with each other during their learning processes. The knowledge are of two types, one is the policy and value function model parameters and the other is the accumulated reward score that they've achieved so far. At each time step, an agent will get a set of suggested actions to take from itself and all its peers in the learning group, then select a best action according to one of our action selection rules. After performing the selected action in its environment, the agent will collect trajectory data resulted from this action and train itself with the data. There is one more step of model adoption which will only happen when the used action selection rule is Combo.
Figure 2: Performance comparison for different agents and games in the first 3e6 time steps.
Figure 3: Action Probabilities: The probability that each possible action gets from agents' decisions. A higher probability means the agents think the corresponding action is the better one in the current situation. And an agent will usually choose an action with the highest probability under the current state.
Figure 4: Atari 2600 Games: Part 1. The Combo rule shows superb performance for all three agents of A2C, ACER and PPO.
Figure 5: Atari 2600 Games: Part 2. The ACER agents are greatly improved under the Combo rule.
...and 4 more figures

Group-Agent Reinforcement Learning with Heterogeneous Agents

TL;DR

Abstract

Group-Agent Reinforcement Learning with Heterogeneous Agents

Authors

TL;DR

Abstract

Table of Contents

Figures (9)