Decentralized multi-agent reinforcement learning algorithm using a cluster-synchronized laser network

Shun Kotoku; Takatomo Mihana; André Röhm; Ryoichi Horisaki

Decentralized multi-agent reinforcement learning algorithm using a cluster-synchronized laser network

Shun Kotoku, Takatomo Mihana, André Röhm, Ryoichi Horisaki

TL;DR

This work tackles decentralized MARL for the competitive multi-armed bandit problem using a six-laser, cluster-synchronized photonic network. A decentralized coupling adjustment (DCA) algorithm updates inter-laser couplings based on locally observed rewards, enabling two players to avoid collisions and converge on the optimal two slots without information sharing. Numerical simulations show the system achieves three-cluster synchronization, balances exploration and exploitation, and robustly adapts across diverse reward distributions, with performance modulated by hyperparameters that govern exploitation strength and coupling bounds. The results highlight the potential of photonics-based decision-making for edge devices and outline paths to scalability and time-varying environments.

Abstract

Multi-agent reinforcement learning (MARL) studies crucial principles that are applicable to a variety of fields, including wireless networking and autonomous driving. We propose a photonic-based decision-making algorithm to address one of the most fundamental problems in MARL, called the competitive multi-armed bandit (CMAB) problem. Our numerical simulations demonstrate that chaotic oscillations and cluster synchronization of optically coupled lasers, along with our proposed decentralized coupling adjustment, efficiently balance exploration and exploitation while facilitating cooperative decision-making without explicitly sharing information among agents. Our study demonstrates how decentralized reinforcement learning can be achieved by exploiting complex physical processes controlled by simple algorithms.

Decentralized multi-agent reinforcement learning algorithm using a cluster-synchronized laser network

TL;DR

Abstract

Paper Structure (10 sections, 4 equations, 8 figures, 4 tables)

This paper contains 10 sections, 4 equations, 8 figures, 4 tables.

Introduction
Method and dynamics investigation
System configuration
Leader probability and cluster synchronization
Decentralized coupling adjustment
Decision-making simulations
Results of a single trial
Impact of reward distributions
Effects of hyperparameters
Conclusion

Figures (8)

Figure 1: Schematic illustration of our proposed system. (a) A six-laser network to address the competitive multi-armed bandit problem with two players and three slot machines. $r_{1\mathrm{\sharp}}$ and $r_{2\mathrm{\sharp}}$$(\mathrm{\sharp} = \mathrm{bl}, \mathrm{or}, \mathrm{ye})$ represent the attenuation rates adjusted by Players 1 and 2. $\kappa_{\mathrm{\sharp}}$ represents the total multiplicative coupling strength. (b) Typical laser intensity waveforms of the six-laser network obtained through numerical simulations. Lasers drawn with identical colors are synchronized.
Figure 2: Numerical simulation results to investigate the leader probabilities of the six-laser network shown in Fig. \ref{['fig:schematic']}. (a) Short-term cross-correlation (STCC) waveforms calculated for coupling strength $\kappa_{\mathrm{or}} = \kappa_{\mathrm{ye}} = 45\per ns$ and $\kappa_{\mathrm{bl}} = 38\per ns$. (b) $\kappa_{\mathrm{bl}} = 45\per ns$. (c) The relationship between the coupling strengths and leader probabilities. $\kappa_{\mathrm{or}}$ and $\kappa_{\mathrm{ye}}$ are fixed at 45ns, and $\kappa_{\mathrm{bl}}$ is set from 560.
Figure 3: Numerical simulation results of a single trial of decision-making. (a) Short-term cross-correlation (STCC). (b) Slot machines selected by Player 1 (red) and Player 2 (black). (c) The excess hit probabilities of slots $Q_{1, \mathrm{X}}$ and $Q_{2, \mathrm{X}}$$(\mathrm{X} = \mathrm{A}, \mathrm{B}, \mathrm{C})$. (d) Total coupling strengths $\kappa_{\sharp} = r_{1\sharp}r_{2\sharp}\kappa$$(\mathrm{\sharp} = \mathrm{bl}, \mathrm{or}, \mathrm{ye})$.
Figure 4: Correct decision ratio (CDR) for 2000 trials. Various reward distributions are applied. The horizontal dotted line represents $\mathrm{CDR} = 0.95$. (a) Five different reward distributions with symmetric hit probabilities for the top two slots. (b) Five different reward distributions, including ones with asymmetry.
Figure 5: Correct decision ratio (CDR) for 2000 trials. Seven different values of scaling factor $r_{\mathrm{step}}$ are applied for a setting $(P_{\mathrm{A}}, P_{\mathrm{B}}, P_{\mathrm{C}}) = (0.4, 0.6, 0.6)$. The horizontal dotted line represents $\mathrm{CDR} = 0.95$.
...and 3 more figures

Decentralized multi-agent reinforcement learning algorithm using a cluster-synchronized laser network

TL;DR

Abstract

Decentralized multi-agent reinforcement learning algorithm using a cluster-synchronized laser network

Authors

TL;DR

Abstract

Table of Contents

Figures (8)