Table of Contents
Fetching ...

Learning To Explore With Predictive World Model Via Self-Supervised Learning

Alana Santana, Paula P. Costa, Esther L. Colombini

TL;DR

In environments with scarce extrinsic rewards, the paper tackles exploration by introducing an intrinsically motivated agent built from a predictive world model and a policy network. The world model is modular, hierarchical, and BRIM-based, using attention to allocate computation and generate intrinsic rewards via prediction error, $r^{int}_{t} = \frac{\left \| h_{t}^{p} - h_{t-1}^{f} \right \|^{2}_{2}}{n}$. Trained with PPO on 18 Atari games, the approach achieves superior performance in many cases, demonstrating robust reactive and deliberative behaviors and faster accrual of extrinsic rewards, while highlighting some limitations in highly sparse tasks. These results suggest that integrating sparsity, modularity, independence, hierarchy, and attention into predictive world models can yield scalable intrinsic motivation for complex environments and potential extensions to robotics and more realistic settings.

Abstract

Autonomous artificial agents must be able to learn behaviors in complex environments without humans to design tasks and rewards. Designing these functions for each environment is not feasible, thus, motivating the development of intrinsic reward functions. In this paper, we propose using several cognitive elements that have been neglected for a long time to build an internal world model for an intrinsically motivated agent. Our agent performs satisfactory iterations with the environment, learning complex behaviors without needing previously designed reward functions. We used 18 Atari games to evaluate what cognitive skills emerge in games that require reactive and deliberative behaviors. Our results show superior performance compared to the state-of-the-art in many test cases with dense and sparse rewards.

Learning To Explore With Predictive World Model Via Self-Supervised Learning

TL;DR

In environments with scarce extrinsic rewards, the paper tackles exploration by introducing an intrinsically motivated agent built from a predictive world model and a policy network. The world model is modular, hierarchical, and BRIM-based, using attention to allocate computation and generate intrinsic rewards via prediction error, . Trained with PPO on 18 Atari games, the approach achieves superior performance in many cases, demonstrating robust reactive and deliberative behaviors and faster accrual of extrinsic rewards, while highlighting some limitations in highly sparse tasks. These results suggest that integrating sparsity, modularity, independence, hierarchy, and attention into predictive world models can yield scalable intrinsic motivation for complex environments and potential extensions to robotics and more realistic settings.

Abstract

Autonomous artificial agents must be able to learn behaviors in complex environments without humans to design tasks and rewards. Designing these functions for each environment is not feasible, thus, motivating the development of intrinsic reward functions. In this paper, we propose using several cognitive elements that have been neglected for a long time to build an internal world model for an intrinsically motivated agent. Our agent performs satisfactory iterations with the environment, learning complex behaviors without needing previously designed reward functions. We used 18 Atari games to evaluate what cognitive skills emerge in games that require reactive and deliberative behaviors. Our results show superior performance compared to the state-of-the-art in many test cases with dense and sparse rewards.

Paper Structure

This paper contains 8 sections, 9 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1: Our Intrinsically-motivated agent architecture. Our approach has two modules: predictive world model and policy network. The predictive world model generates intrinsic motivation rewards using attention and modular structures. At the same time, the policy network learns a policy to execute actions in the environment.
  • Figure 2: Intrinsically-motivated agent architecture. At each time step $t$, the current state $s_{t}$ triggers the predictive world modules. Modules can be active, expected, null, or inactive state. Active modules use the state information $s_{t}$ to build a representation for choosing the current action. Modules in the expected state build an expected representation for the next state $s_{t+1}$ before the agent sees it. Thus, the agent performs a future prediction of the consequences of its action in the world. Finally, inactive/null modules do not participate in the gradient and can be activated as needed in the next iteration. After executing the action, the intrinsic reward is the difference between the agent's expectations and the real-world state.
  • Figure 3: MsPacman game. The training curve shows the best extrinsic returns per episode in the MsPacman environment. In blue, we have our agent, and in red, we have the baseline. The x-axis represents the training steps, and the y-axis is the game score (i.e., accumulated extrinsic reward) received at the end of the episode.
  • Figure 4: Our proposed approach results. The training curve shows the best extrinsic returns per episode in the Asterix, Riverraid, and Atlantis environments. The x-axis represents the training steps, and the y-axis is the game score (i.e., accumulated extrinsic reward) received at the end of the episode. In these scenarios, our agent excels against the baseline (Table \ref{['tab:tabela_resultados']}). The exploratory strategies chosen to stay alive allowed the agent to obtain a much higher score in these games. After exploring states that led to bad scores, the agent quickly changes its exploratory strategy.
  • Figure 5: Freeway game. The training curve shows the best extrinsic returns per episode in the Freeway environment. In blue, we have our agent, and in red, we have the baseline. The x-axis represents the training steps, and the y-axis is the game score (i.e., accumulated extrinsic reward) received at the end of the episode.