Table of Contents
Fetching ...

Multi-agent reinforcement learning using echo-state network and its application to pedestrian dynamics

Hisato Komatsu

TL;DR

The paper tackles efficient simulation of coordinated pedestrian dynamics using multi-agent reinforcement learning (MARL) by combining echo-state networks (ESN) with least squares policy iteration (LSPI). By sharing input and reservoir parameters within groups and updating output weights per episode, the approach achieves data-efficient learning in grid-world pedestrian tasks, including forked-route routing and bidirectional lane formation, with rewards $r_{i,t}=+1$ for progress in the target direction and $r_{i,t}=-1$ for opposing progress. Key findings show that the ESN+LSPI framework learns to move while avoiding others at moderate densities, forms lanes in Task II up to $n_{agent}=48$, but experiences jamming at higher densities, and generally offers lower computational cost than representative deep RL methods. The work demonstrates the viability of reservoir computing for scalable MARL in collective behavior and points to future improvements in processing capacity, exploration, and scaling to large-scale pedestrian or evacuation scenarios, with code available for replication.

Abstract

In recent years, simulations of pedestrians using the multi-agent reinforcement learning (MARL) have been studied. This study considered the roads on a grid-world environment, and implemented pedestrians as MARL agents using an echo-state network and the least squares policy iteration method. Under this environment, the ability of these agents to learn to move forward by avoiding other agents was investigated. Specifically, we considered two types of tasks: the choice between a narrow direct route and a broad detour, and the bidirectional pedestrian flow in a corridor. The simulations results indicated that the learning was successful when the density of the agents was not that high.

Multi-agent reinforcement learning using echo-state network and its application to pedestrian dynamics

TL;DR

The paper tackles efficient simulation of coordinated pedestrian dynamics using multi-agent reinforcement learning (MARL) by combining echo-state networks (ESN) with least squares policy iteration (LSPI). By sharing input and reservoir parameters within groups and updating output weights per episode, the approach achieves data-efficient learning in grid-world pedestrian tasks, including forked-route routing and bidirectional lane formation, with rewards for progress in the target direction and for opposing progress. Key findings show that the ESN+LSPI framework learns to move while avoiding others at moderate densities, forms lanes in Task II up to , but experiences jamming at higher densities, and generally offers lower computational cost than representative deep RL methods. The work demonstrates the viability of reservoir computing for scalable MARL in collective behavior and points to future improvements in processing capacity, exploration, and scaling to large-scale pedestrian or evacuation scenarios, with code available for replication.

Abstract

In recent years, simulations of pedestrians using the multi-agent reinforcement learning (MARL) have been studied. This study considered the roads on a grid-world environment, and implemented pedestrians as MARL agents using an echo-state network and the least squares policy iteration method. Under this environment, the ability of these agents to learn to move forward by avoiding other agents was investigated. Specifically, we considered two types of tasks: the choice between a narrow direct route and a broad detour, and the bidirectional pedestrian flow in a corridor. The simulations results indicated that the learning was successful when the density of the agents was not that high.
Paper Structure (17 sections, 31 equations, 20 figures, 3 tables, 1 algorithm)

This paper contains 17 sections, 31 equations, 20 figures, 3 tables, 1 algorithm.

Figures (20)

  • Figure 1: Eyesight of each agent. We called $3 \times 3$, $7 \times 7$, and $11 \times 11$ cells centering on agent itself (painted red in this figure) as $\mathcal{A}_{1}$, $\mathcal{A}_{2}$, and $\mathcal{A}_{3}$, respectively.
  • Figure 2: Forked road considered in task. I. Grids indicating vacant areas and walls are painted black and green, respectively. White lines are drawn to emphasize the borders of grids.
  • Figure 3: Initial placement of agents at (a).$n_{\mathrm{agent}} = 12$, (b).$n_{\mathrm{agent}} = 24$, (c).$n_{\mathrm{agent}} = 32$ and (d)$n_{\mathrm{agent}} = 40$ in the task. I. Meanings of the black and green cells are the same as Fig. \ref{['detour_grid']}, and agents are painted red.
  • Figure 4: Initial placement of agents at (a).$n_{\mathrm{agent}} = 16$, (b).$n_{\mathrm{agent}} = 32$, (c).$n_{\mathrm{agent}} = 48$ and (d)$n_{\mathrm{agent}} = 64$ in the task. II. Meanings of the black and green cells are the same as Fig. \ref{['detour_grid']}, and right(left)-proceeding agents are painted red(blue). In this figure, agents of different groups are painted in different colors so that we can distinguish them from each other. However, they have the same color, $(1,0)$, in the viewpoint of agents.
  • Figure 5: Learning curves of task. I at (a).$n_{\mathrm{agent}} = 12$, (b).$n_{\mathrm{agent}} = 24$, (c).$n_{\mathrm{agent}} = 32$ and (d)$n_{\mathrm{agent}} = 40$. The green curve is the mean value of all agents' rewards, and the red and blue curves indicate the rewards of the best and worst-performing agents. Each value is averaged over 8 independent trials, and the standard errors taken from them are painted in pale colors.
  • ...and 15 more figures