Neuromorphic dreaming: A pathway to efficient learning in artificial agents

Ingo Blakowski; Dmitrii Zendrikov; Cristiano Capone; Giacomo Indiveri

Neuromorphic dreaming: A pathway to efficient learning in artificial agents

Ingo Blakowski, Dmitrii Zendrikov, Cristiano Capone, Giacomo Indiveri

TL;DR

The paper addresses energy and data efficiency in reinforcement learning by implementing a model-based RL framework using spiking neural networks on mixed-signal neuromorphic hardware. It introduces a two-network architecture (agent and world model) and an awake-dreaming training protocol that alternates real-environment interactions with simulated rollouts to boost sample efficiency, employing local learning rules such as e-prop and a policy-gradient objective $E^A = -\sum_t R^t \log(\pi^t_k)$ with $R^t = \sum_{t' \ge t} \gamma^{t'-t} r^{t'}$. The world model is trained with supervised e-prop-based readouts, minimizing a combined state and reward prediction loss $E^M = c_\xi \sum_{t,k} (\xi^{\star t+1}_k - \xi^{t+1}_k)^2 + c_r \sum_t (r^{\star t+1} - r^{t+1})^2$, enabling accurate imagined experiences. Validation on Atari Pong demonstrates that dreaming reduces the required number of real environment interactions while maintaining or improving learning performance, and that the approach runs in real time on the DYNAP-SE neuromorphic processor with sub-milliwatt power consumption, supporting a practical path toward energy-efficient neuromorphic learning for real-world robotics and intelligent agents.

Abstract

Achieving energy efficiency in learning is a key challenge for artificial intelligence (AI) computing platforms. Biological systems demonstrate remarkable abilities to learn complex skills quickly and efficiently. Inspired by this, we present a hardware implementation of model-based reinforcement learning (MBRL) using spiking neural networks (SNNs) on mixed-signal analog/digital neuromorphic hardware. This approach leverages the energy efficiency of mixed-signal neuromorphic chips while achieving high sample efficiency through an alternation of online learning, referred to as the "awake" phase, and offline learning, known as the "dreaming" phase. The model proposed includes two symbiotic networks: an agent network that learns by combining real and simulated experiences, and a learned world model network that generates the simulated experiences. We validate the model by training the hardware implementation to play the Atari game Pong. We start from a baseline consisting of an agent network learning without a world model and dreaming, which successfully learns to play the game. By incorporating dreaming, the number of required real game experiences are reduced significantly compared to the baseline. The networks are implemented using a mixed-signal neuromorphic processor, with the readout layers trained using a computer in-the-loop, while the other layers remain fixed. These results pave the way toward energy-efficient neuromorphic learning systems capable of rapid learning in real world applications and use-cases.

Neuromorphic dreaming: A pathway to efficient learning in artificial agents

TL;DR

with

. The world model is trained with supervised e-prop-based readouts, minimizing a combined state and reward prediction loss

, enabling accurate imagined experiences. Validation on Atari Pong demonstrates that dreaming reduces the required number of real environment interactions while maintaining or improving learning performance, and that the approach runs in real time on the DYNAP-SE neuromorphic processor with sub-milliwatt power consumption, supporting a practical path toward energy-efficient neuromorphic learning for real-world robotics and intelligent agents.

Abstract

Paper Structure (38 sections, 10 equations, 8 figures, 1 table, 1 algorithm)

This paper contains 38 sections, 10 equations, 8 figures, 1 table, 1 algorithm.

Introduction
Reinforcement learning with spiking neural networks.
Neuromorphic hardware.
Methodology
Agent network
Learning rule:
Network architecture:
World model network
Learning rules:
Network architecture:
Awake-dreaming learning
Input Encoding
Environment State Encoding:
Action Encoding:
Experiments and results
...and 23 more sections

Figures (8)

Figure 1: Training with dreaming alternates between two phases. (a) In the awake phase, consisting of 100 frames, the agent network (policy) and model network interact with the real environment. The agent takes the current real state and predicts an action, which is performed in the game and fed into the model network. The real reward is used to compute the policy gradient, while the model network predicts the change in state variables and reward, which are compared to the actual state and reward to update the model frame by frame. (b) In the dreaming phase, lasting 50 frames, the agent network is updated by interacting solely with the model network, detached from the real environment. The agent takes the imaginary current state from the world model and predicts an action, which is fed into the world model to predict the next state and reward. The imaginary reward is used to compute the policy gradient for updating the agent network.
Figure 2: Setup and timing diagram. (a) The setup employs a computer-in-the-loop system, where the DYNAP-SE neuromorphic chip efficiently simulates the neural dynamics of the input and hidden layers, while the computer handles chip-environment synchronization and manages the learning protocol. The readout layers are implemented on the computer, and learning focuses on updating the output weights stored on the computer, while the input-to-hidden connections on the chip remain fixed due to hardware constraints. The computer generates input spike trains based on the game state and action, which are loaded onto the chip. After a short waiting period for the chip to process the new input, the computer reads out the spike events, which are then used as input (in the form of the number of spikes per neuron) to the readout layers to predict the respective outputs. (b) The timing diagram illustrates the duration of one step with and without dreaming. The majority of the time is consumed by updating the input to the agent or model and the subsequent waiting period, while other processing times are negligible. Updating the input takes 8.4 ms, and the waiting period is 10 ms. Consequently, a step without the model takes 18.4 ms, while a step with the model takes 36.8 ms.
Figure 3: (a) Average return per game over the last 50 games as a function of the number of games played, for an agent that uses dreaming (purple) and a baseline that does not (blue). The dashed line represents the mean over 10 independent training realizations, the solid line represents the 80th percentile, and the shaded area represents the standard deviation. With dreaming, the average return increases significantly faster in terms of interactions with the real environment. (b) Evolution of the policy entropy during training, quantifying the uncertainty in action selection at each game. Before training (top right), the policy shows high uncertainty, while after training (bottom right), the policy adapts quickly and assigns high probabilities to single actions, indicating growing confidence as learning progresses.
Figure 4: Network architectures of the agent and model networks. (a) The agent network predicts action probabilities based on the state input encoded by population-coded spike generators. The input-to-hidden connections (gray) are fixed and randomly initialized such that each hidden neuron receives exactly 8 incoming connections. These connections are assigned one of four quantized synaptic weights, chosen randomly, which are implemented by 1 to 4 parallel connections between an input neuron and a hidden neuron. The learnable hidden-to-output connections (purple) are trained using a policy gradient method on the computer in the loop with the neuromorphic chip. (b) The model network predicts the next state and reward based on the current state and action inputs. The fixed input-to-hidden connections (gray) follow a similar random connectivity pattern as the agent network. The hidden-to-output connections (purple) are learned in a supervised manner to minimize state and reward prediction errors.
Figure 5: Population coding for the four state variables in Atari Pong. The activity of the input spike generators follows a Gaussian distribution, with the mean shifting according to the corresponding state variable's value between the minimum and maximum spike generator index of the corresponding population.
...and 3 more figures

Neuromorphic dreaming: A pathway to efficient learning in artificial agents

TL;DR

Abstract

Neuromorphic dreaming: A pathway to efficient learning in artificial agents

Authors

TL;DR

Abstract

Table of Contents

Figures (8)