Table of Contents
Fetching ...

Learning fast changing slow in spiking neural networks

Cristiano Capone, Paolo Muratore

TL;DR

This work introduces a biologically plausible implementation of proximal policy optimization, referred to as lf-cs (learning fast changing slow), which results in the capacity to assimilate new information into a new policy without requiring alterations to the current policy; and the capability to replay experiences without experiencing policy divergence.

Abstract

Reinforcement learning (RL) faces substantial challenges when applied to real-life problems, primarily stemming from the scarcity of available data due to limited interactions with the environment. This limitation is exacerbated by the fact that RL often demands a considerable volume of data for effective learning. The complexity escalates further when implementing RL in recurrent spiking networks, where inherent noise introduced by spikes adds a layer of difficulty. Life-long learning machines must inherently resolve the plasticity-stability paradox. Striking a balance between acquiring new knowledge and maintaining stability is crucial for artificial agents. To address this challenge, we draw inspiration from machine learning technology and introduce a biologically plausible implementation of proximal policy optimization, referred to as lf-cs (learning fast changing slow). Our approach results in two notable advancements: firstly, the capacity to assimilate new information into a new policy without requiring alterations to the current policy; and secondly, the capability to replay experiences without experiencing policy divergence. Furthermore, when contrasted with other experience replay (ER) techniques, our method demonstrates the added advantage of being computationally efficient in an online setting. We demonstrate that the proposed methodology enhances the efficiency of learning, showcasing its potential impact on neuromorphic and real-world applications.

Learning fast changing slow in spiking neural networks

TL;DR

This work introduces a biologically plausible implementation of proximal policy optimization, referred to as lf-cs (learning fast changing slow), which results in the capacity to assimilate new information into a new policy without requiring alterations to the current policy; and the capability to replay experiences without experiencing policy divergence.

Abstract

Reinforcement learning (RL) faces substantial challenges when applied to real-life problems, primarily stemming from the scarcity of available data due to limited interactions with the environment. This limitation is exacerbated by the fact that RL often demands a considerable volume of data for effective learning. The complexity escalates further when implementing RL in recurrent spiking networks, where inherent noise introduced by spikes adds a layer of difficulty. Life-long learning machines must inherently resolve the plasticity-stability paradox. Striking a balance between acquiring new knowledge and maintaining stability is crucial for artificial agents. To address this challenge, we draw inspiration from machine learning technology and introduce a biologically plausible implementation of proximal policy optimization, referred to as lf-cs (learning fast changing slow). Our approach results in two notable advancements: firstly, the capacity to assimilate new information into a new policy without requiring alterations to the current policy; and secondly, the capability to replay experiences without experiencing policy divergence. Furthermore, when contrasted with other experience replay (ER) techniques, our method demonstrates the added advantage of being computationally efficient in an online setting. We demonstrate that the proposed methodology enhances the efficiency of learning, showcasing its potential impact on neuromorphic and real-world applications.
Paper Structure (2 sections, 17 equations, 6 figures, 1 table)

This paper contains 2 sections, 17 equations, 6 figures, 1 table.

Figures (6)

  • Figure 1: Learning on separate timescales. (a) The model is composed of two networks: a reference policy network $\pi^\mathrm{ref}$ (left) that interacts with the environment receiving states $s$ and emitting actions $a$, and a future policy network $\pi^\mathrm{new}$ (right) that quickly accumulates knowledge based on the new acquired experience and the reward signal $r$. (b) The two networks work on different timescales. During the fast phase (left box, cyan) the reference policy network $\pi^\mathrm{ref}$ interacts with the environment, producing a series of reward signals. This policy is held fixed. The reward signals is used by the second network $\pi^\mathrm{new}$ to updates its policy. On a slower timescales (right box, magenta) the updated policy $\pi^\mathrm{new}$ is transferred into the reference policy $\pi^\mathrm{ref}$. (c) Average reward on the Pong-100 environment as a function of the number of interactions with the environment. Separating the timescales in a fast and slow component (cyan) has a beneficial effect on learning speed as compared to only using a single slow timescale (magenta), which is equivalent to the e-prop learning algorithm bellec2020. Solid lines are averages over $10$ independent experiments. Shaded areas span the $\pm \mathrm{std}$ intervals. Thin lines of respective colors are individual runs.
  • Figure 2: Learning via experience replay. (a) Graphical depiction of the interplay between replay storage-retrieval for the reference policy network $\pi^\mathrm{ref}$ and the fast updating network $\pi^\mathrm{new}$. An external buffers collects the state, action, reward tuple $\left(r^t, s^t, a^t \right)$ generated via the interaction of the reference network $\pi^\mathrm{ref}$ with the environment. The $\pi^\mathrm{new}$ network retrieves the experience from the buffer and updates its internal parameters. The updates are then transferred to the reference network $\pi^\mathrm{ref}$ on a slower timescale. (b) Visual representation of the policy update control mechanism. Given a stiffness $\epsilon$, the policy network $\pi^\mathrm{new}$ computes eligible updates, the magnitude of which (computed as the ratio $\pi^\mathrm{new} / \pi^\mathrm{ref}$) is compared against the update stiffness $\epsilon$. If update is within stiffness constraints, updates are integrated into the reference network $\pi^\mathrm{ref}$, otherwise they are not. (c) Examples of the policy and entropy evolution over many replays of the same memory. The policy control avoid runaway updates and instabilities due to experience overuse, improving training stability.
  • Figure 3: Dependence of learning dynamics on stiffness parameter $\varepsilon$. (a). Dynamics of two key metrics: the surrogate function and the final reward $r^t$ pre- and post- policy update, for two choices of the stiffness parameter $\epsilon \in \left\{ 0.05, 0.1 \right\}$. When the policy control update is too large ($\epsilon = 0.1$, right column) major synaptic changes are allowed, resulting in large surrogate function variations (bottom left panel), this degrades overall performances as can be seen by confronting the pre- and post- measured reward (bottom right panel). Better values for the conservation parameter ($\epsilon = 0.05$) ensure sufficient update control (top tow) which benefits performances (top right panel). Inset in the right column panels depict average pre- and post- reward. (b). Comparison between performances on the atari-pong environment for different choices of the parameter $\varepsilon$, ranging from no control (right, yellow line, high $\varepsilon$) to strict control (left, purple line, low $\varepsilon$). From the reward profiles a clear optimal $\varepsilon$ regime emerges for intermediate values of the stiffness parameter ($\varepsilon \simeq 0.2$). Thick solid lines report averages over $8$ repetitions, dashed lines mark the $20$- and $80$-percentile, shaded areas cover the $\pm$STD regions, while thin lines are individual runs. (c) Maximum reward achieved by an agent in the atari-pong (environment as a function of the $\varepsilon$ parameter. Optimal performances are achieved for intermediate choices of the control parameter. Solid line represent average performance over $10$ independent experiment repetition, while dashed lines represent the $20$-th and $80$-th percentile.
  • Figure 4: Generalization to other conditions. (a) Measured reward in the pong-200 Atari environment (game with a temporal horizon of $200$ frames) as a function of the frame numbers for two learning configurations employing a different number of experience replays. Solid thick lines are average over $4$ repetitions, while shaded areas are $\pm$STDs. Thin lines are individual runs. (b) Violin plot reporting the distribution of the top-50 measured rewards across experiments for different numbers of experience replays. Thick whiskers represent the $25$- and $75$-percentile, solid white markers report the median.
  • Figure S1: Exploration of learning rates Measured reward in the pong-100 environment as a function of the environment interactions. Different colors code for different learning rates $\eta$ used by the Adam optimizer to update the model parameters. We set $\eta_0 = 1.5 \times 10^{-3}$. Solid lines are averages over $5$ repetitions, while shaded areas represent the $\pm \mathrm{std}$.
  • ...and 1 more figures