Biologically-Plausible Topology Improved Spiking Actor Network for Efficient Deep Reinforcement Learning

Duzhen Zhang; Qingyu Wang; Tielin Zhang; Bo Xu

Biologically-Plausible Topology Improved Spiking Actor Network for Efficient Deep Reinforcement Learning

Duzhen Zhang, Qingyu Wang, Tielin Zhang, Bo Xu

TL;DR

This work tackles the expressivity gap of artificial actors in deep reinforcement learning by introducing the Biologically-Plausible Topology improved Spiking Actor Network (BPT-SAN), which fuses spiking neurons with rich spatial-temporal dynamics and biologically-plausible connectivity. The method encodes continuous states into spikes via population, Poisson, or deterministic coding, and processes them through inter-layer nonlinear dendritic branches combined with intra-layer lateral connections, all learned under a hybrid TD3/SAC framework with pseudo backpropagation. Empirical results on four continuous control tasks from OpenAI Gym MuJoCo show that BPT-SAN outperforms an artificial actor network and a regular spiking actor network, with ablations confirming the value of both nonlinear dendrites and lateral intra-layer interactions. The study highlights how brain-inspired topologies can enhance DRL performance and suggests avenues for energy-efficient and robust decision-making in real-world robotics.

Abstract

The success of Deep Reinforcement Learning (DRL) is largely attributed to utilizing Artificial Neural Networks (ANNs) as function approximators. Recent advances in neuroscience have unveiled that the human brain achieves efficient reward-based learning, at least by integrating spiking neurons with spatial-temporal dynamics and network topologies with biologically-plausible connectivity patterns. This integration process allows spiking neurons to efficiently combine information across and within layers via nonlinear dendritic trees and lateral interactions. The fusion of these two topologies enhances the network's information-processing ability, crucial for grasping intricate perceptions and guiding decision-making procedures. However, ANNs and brain networks differ significantly. ANNs lack intricate dynamical neurons and only feature inter-layer connections, typically achieved by direct linear summation, without intra-layer connections. This limitation leads to constrained network expressivity. To address this, we propose a novel alternative for function approximator, the Biologically-Plausible Topology improved Spiking Actor Network (BPT-SAN), tailored for efficient decision-making in DRL. The BPT-SAN incorporates spiking neurons with intricate spatial-temporal dynamics and introduces intra-layer connections, enhancing spatial-temporal state representation and facilitating more precise biological simulations. Diverging from the conventional direct linear weighted sum, the BPT-SAN models the local nonlinearities of dendritic trees within the inter-layer connections. For the intra-layer connections, the BPT-SAN introduces lateral interactions between adjacent neurons, integrating them into the membrane potential formula to ensure accurate spike firing.

Biologically-Plausible Topology Improved Spiking Actor Network for Efficient Deep Reinforcement Learning

TL;DR

Abstract

Paper Structure (21 sections, 9 equations, 6 figures)

This paper contains 21 sections, 9 equations, 6 figures.

Introduction
Related Work
DRL
Integrating SNNs with DRL
Method
Input Encoding
Population coding
Poisson coding
Deterministic coding
BPT-SAN
Inter-layer connections
Intra-layer connections
The Hybrid Learning of BPT-SAN
Tuning BPT-SAN with Pseudo Backpropagation
Experimental Settings
...and 6 more sections

Figures (6)

Figure 1: The schematic diagram of our proposed BPT-SAN, which integrates spiking neurons with rich spatial-temporal dynamics and network topologies featuring biologically-plausible connectivity patterns. In the inter-layer connections, the BPT-SAN models the local nonlinearity of dendritic trees by breaking down the standard layer into two stages. In the initial stage, dendritic branches perform a mutually exclusive partition of the input and subsequently execute a weighted summation of the sparsely connected inputs. In the subsequent stage, the outputs of all branches converge to produce the neuron output via a maxout strategy. Furthermore, within the intra-layer connections, the BPT-SAN introduces lateral interactions to incorporate spiking states from neighboring neurons effectively. These two network topologies work synergistically to significantly enhance the information processing capacity of the network, enabling efficient decision-making in DRL.
Figure 2: Four continuous control tasks. (a) Hopper: State dimension: $N=11$, Action dimension: $M=3$, Goal: make a 2D one-legged robot hop forward as fast as possible; (b) Walker2d: State dimension: $N=17$, Action dimension: $M=6$, Goal: make a 2D bipedal robot walk forward as fast as possible; (c) Half-Cheetah: State dimension: $N=17$, Action dimension: $M=6$, Goal: make a 2D cheetah robot run as fast as possible; (d) Ant: State dimension: $N=111$, Action dimension: $M=8$, Goal: make a four-legged creature walk forward as fast as possible.
Figure 3: Comparison of average returns achieved by various actor networks trained with the TD3 algorithm. (a) Performance of AAN, Regular SAN, and BPT-SAN during training on the Hopper-v3 task. (b, c, d) Performances of these three actor networks on Walker2d-v3, Half-Cheetah-v3, and Ant-v3, respectively. Notably, our BPT-SAN outperforms the others consistently across all tasks. The shaded area illustrates half a standard deviation of the average evaluation result across $10$ random seeds, while the curves are smoothed for improved visualization.
Figure 4: Comparison of average returns achieved by various actor networks trained with the SAC algorithm. (a) Performance of AAN, Regular SAN, and BPT-SAN during training on the Hopper-v3 task. (b, c, d) Performances of these three actor networks on Walker2d-v3, Half-Cheetah-v3, and Ant-v3, respectively. Notably, our BPT-SAN outperforms the others consistently across all tasks. The shaded area illustrates half a standard deviation of the average evaluation result across $10$ random seeds, while the curves are smoothed for improved visualization.
Figure 5: Comparison of the max average returns over $10$ random seeds achieved by our BPT-SAN and its two variants: BPT-SAN w/o NDT (replace Nonlinear Dendritic Tree (NDT) with linear weighted summation in inter-layer connections) and BPT-SAN w/o LI (remove Lateral Interaction (LI) in the intra-layer connections), all trained using the TD3 algorithm. (a) Performance of BPT-SAN w/o NDT, BPT-SAN, and BPT-SAN w/o LI on the Hopper-v3 task. (b, c, d) Performances of these three actor networks on Walker2d-v3, Half-Cheetah-v3, and Ant-v3, respectively. It's worth noting that the BPT-SAN, featuring both inter-layer NDT and intra-layer LI, consistently outperforms the other two variants across all tasks, highlighting the superiority of this comprehensive architecture.
...and 1 more figures

Biologically-Plausible Topology Improved Spiking Actor Network for Efficient Deep Reinforcement Learning

TL;DR

Abstract

Biologically-Plausible Topology Improved Spiking Actor Network for Efficient Deep Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (6)