Efficient Deep Reinforcement Learning with Predictive Processing Proximal Policy Optimization

Burcu Küçükoğlu; Walraaf Borkent; Bodo Rueckauer; Nasir Ahmad; Umut Güçlü; Marcel van Gerven

Efficient Deep Reinforcement Learning with Predictive Processing Proximal Policy Optimization

Burcu Küçükoğlu, Walraaf Borkent, Bodo Rueckauer, Nasir Ahmad, Umut Güçlü, Marcel van Gerven

TL;DR

The paper tackles sample inefficiency in deep reinforcement learning by introducing P4O, a PPO-based recurrent agent augmented with predictive processing via a world-model RNN and a predictive processing loss. The method jointly optimizes PPO objectives and a multi-step prediction error term, yielding improved learning speed and performance on Atari with a single GPU and without hyperparameter tuning. Empirical results show P4O outperforms a recurrent PPO baseline across multiple games, matches or surpasses several state-of-the-art single-GPU agents after long training, and even exceeds human gamer performance in several titles, notably Seaquest. The findings suggest predictive processing can yield more data-efficient, robust, and scalable RL agents, with potential implications for neuroscience-inspired AI and efficient control in real-world tasks.

Abstract

Advances in reinforcement learning (RL) often rely on massive compute resources and remain notoriously sample inefficient. In contrast, the human brain is able to efficiently learn effective control strategies using limited resources. This raises the question whether insights from neuroscience can be used to improve current RL methods. Predictive processing is a popular theoretical framework which maintains that the human brain is actively seeking to minimize surprise. We show that recurrent neural networks which predict their own sensory states can be leveraged to minimise surprise, yielding substantial gains in cumulative reward. Specifically, we present the Predictive Processing Proximal Policy Optimization (P4O) agent; an actor-critic reinforcement learning agent that applies predictive processing to a recurrent variant of the PPO algorithm by integrating a world model in its hidden state. Even without hyperparameter tuning, P4O significantly outperforms a baseline recurrent variant of the PPO algorithm on multiple Atari games using a single GPU. It also outperforms other state-of-the-art agents given the same wall-clock time and exceeds human gamer performance on multiple games including Seaquest, which is a particularly challenging environment in the Atari domain. Altogether, our work underscores how insights from the field of neuroscience may support the development of more capable and efficient artificial agents.

Efficient Deep Reinforcement Learning with Predictive Processing Proximal Policy Optimization

TL;DR

Abstract

Paper Structure (24 sections, 4 equations, 9 figures, 2 tables)

This paper contains 24 sections, 4 equations, 9 figures, 2 tables.

Introduction
Methods
P4O architecture
Encoder model
RNN model
Actor-Critic model
P4O algorithm
Empirical evaluation
Hypotheses
Experimental setup
Performance Comparisons
Baseline comparison
Individual runs
Comparison with state-of-the-art
Analysis of the P4O algorithm
...and 9 more sections

Figures (9)

Figure 1: Components of the P4O architecture. A sensory input $s$ is encoded into a low-dimensional latent representation $x$. This encoded game state is subtracted from a prediction $p$ generated by an LSTM layer. The resulting prediction error is passed into both the prediction layer and a second LSTM layer representing the agent's belief states. The LSTM outputs are used by an actor-critic model to select an action $a$ and compute a corresponding state value $v$. The red box highlights the main architectural contribution of P4O. Minimization of the prediction error is added to the P4O agent's objective function.
Figure 2: Comparison of P4O algorithm against the baselines LSTM-PPO ($k=1024$), LSTM-PPO ($k=800$) and the P4O model optimized without predictive processing loss (P4O without PP loss). Results are based on average score over the last 100 episodes. Shaded areas show standard error of the mean over the 8 runs.
Figure 3: Comparison of individual performance curves. P4O results are shown in the top row and LSTM-PPO baseline ($k=1024$) results are shown in the bottom row. Equivalent colors indicate the use of the same seed for network initialization.
Figure 4: P4O's performance comparison against algorithms IQN, Rainbow, DreamerV2 and the baseline LSTM-PPO after 10 days of accelerator time. Lighter curves correspond to the individual runs, whereas darker curves are the mean of the 3 individual runs. Circles represent final reported scores of the IQN, Rainbow and DreamerV2 agents after 10 days. The moment where P4O mean exceeds their respective endpoint score is marked with triangles. Reported average score is in log scale and is based on a rolling mean of the last 100 episodes.
Figure 5: Analysis of the contribution of various P4O algorithm elements on performance via ablation study. “P4O without PP Loss” indicates an architecture identical to P4O but which does not include the predictive processing loss in the loss computation. “P4O without PP mechanism” indicates an architecture in which the LSTM hidden states are linearly read-out to predict the input ($x$) with an additional prediction output - but without providing this as a negative feedback signal (hence, no predictive processing mechanism within the network). “P4O without Belief to Prediction connections” indicates a network in which the connections from the belief LSTM to the prediction LSTM units have been cut. Results are based on average score over the last 100 episodes. Shaded areas show standard error of the mean over the 8 runs.
...and 4 more figures

Efficient Deep Reinforcement Learning with Predictive Processing Proximal Policy Optimization

TL;DR

Abstract

Efficient Deep Reinforcement Learning with Predictive Processing Proximal Policy Optimization

Authors

TL;DR

Abstract

Table of Contents

Figures (9)