Brain-Like Replay Naturally Emerges in Reinforcement Learning Agents

Jiyi Wang; Likai Tang; Huimiao Chen; Marcelo G Mattar; Sen Song

Brain-Like Replay Naturally Emerges in Reinforcement Learning Agents

Jiyi Wang, Likai Tang, Huimiao Chen, Marcelo G Mattar, Sen Song

TL;DR

A modular reinforcement learning model that could generate replay is developed and it is proved that replay generated in this way helps complete the task.

Abstract

Replay is a powerful strategy to promote learning in artificial intelligence and the brain. However, the conditions to generate it and its functional advantages have not been fully recognized. In this study, we develop a modular reinforcement learning model that could generate replay. We prove that replay generated in this way helps complete the task. We also analyze the information contained in the representation and provide a mechanism for how replay makes a difference. Our design avoids complex assumptions and enables replay to emerge naturally within a task-optimized paradigm. Our model also reproduces key phenomena observed in biological agents. This research explores the structural biases in modular ANN to generate replay and its potential utility in developing efficient RL.

Brain-Like Replay Naturally Emerges in Reinforcement Learning Agents

TL;DR

A modular reinforcement learning model that could generate replay is developed and it is proved that replay generated in this way helps complete the task.

Abstract

Paper Structure (25 sections, 3 equations, 8 figures)

This paper contains 25 sections, 3 equations, 8 figures.

Introduction
Methods
Conditions to generate instrumental replay
Principle of designing different modules
Design of the HF Module
Design of the PFC Module
Training and Test Paradigm
Results
Biologically similar replay sequences emerge after training
Task setting
Simulation result
Ablation study demonstrates that replay helps learning
Information flow during replay entails information about context and action plan
Manifold Analysis Reveals That Replay Flow Bridging Between Contexts
Discussion
...and 10 more sections

Figures (8)

Figure 1: A--C. Model structure. A The HF GRU module completes the task of path integration and episodic memory. The PFC RNN module is responsible for decision-making. The information passage between the PFC and HF only opens at rest. B. During mobility, the information passage remains closed, and the HF and PFC modules operate independently. The place cell activation reflects the real present locations. C. During immobility, the information passage opens. The HF and PFC modules start to communicate with each other. The activated place cell output constituted a replay sequence. D--F. Task setting. D. The agent should start from S, first get to checkpoint C, consume a small amount of reward (0.5), and then arrive at goal G to get a large reward (1.0). Directly moving to G will cause no rewards. E. A representative trajectory generated by the trained RL agent. First, it reaches C and gets the small reward. The replay happens at this time. F. The replay event in E.
Figure 2: A. Performance for the RL agent in the test period. B. Legends for C and E. Different colors represent different parts of the room. C. Change of replay distribution for different segments in the animal data, adapted from igata2021prioritized. D. Distribution of distances between adjacent replay steps. E. Change of replay distribution generated by the RL agent.
Figure 3: A. The total reward decreases when the signal from HF to PFC is replaced by random noise (left) or all-zero vectors (right). B. The different number of replay steps is masked (right), and the performance decreases monotonically as we mask more steps (left). C. The number of steps it takes to find the new reward increases when the multi-step information emission is replaced by one step. D. The performance is impaired only a little when the order of information is shuffled.
Figure 4: A. Decoding accuracy for the reward location by activities in HF (left), PFC (middle), and information passage (right), respectively. The X-axis is composed of three parts: the initial stage before replay, steps 1-4 during replay, and the output stage after replay. B. Decoding error for correct future actions from PFC activities after replay. Red, orange, yellow, and brown denote the first, second, third, and fourth future action following replay. In A-C, darker colors represent the agent's first getting the relocated reward, while lighter colors represent the second time. C. Value map produced through the “stop and scan” method. The C1 map represents the value before the checkpoint change, and the subsequent value maps are when the agent gets to C2 repeatedly. D. Value advantages of checkpoint 2 over checkpoint 1, calculated by convolving the value map with a Difference of Gaussian (DoG) filter. E. Left, value advantage calculated as the value of the path S-C2 minus the value of S-C1. Right, the value of the path C2-G minus the value of C1-G.
Figure 5: Manifold analysis reveals the context-switch process in detail. A. The 3D embedding of the “neural” manifold through dimension reduction of PFC activities when the small intermediate reward stays at C1 (left), the agent meets the relocated reward at C2 for the first time (middle), and the agent meets C2 for the second time (right). The highlighted trajectory represents the neural manifold at the corresponding stage, but the other trajectories are also plotted for comparison. B. AEV of different PCs of PFC activities during replay when the agent first finds the new reward. C. Left, the dimension of the PFC activities during replay (virtual experience) when the threshold for AEV is set to 70%. Right, the dimension of PFC activities during both movement and replay. The results are the same. D. The mean square distance of data points to their KNN centroids calculated from hidden states in HF and PFC during movement.
...and 3 more figures

Brain-Like Replay Naturally Emerges in Reinforcement Learning Agents

TL;DR

Abstract

Brain-Like Replay Naturally Emerges in Reinforcement Learning Agents

Authors

TL;DR

Abstract

Table of Contents

Figures (8)