Table of Contents
Fetching ...

World Model Agents with Change-Based Intrinsic Motivation

Jeremias Ferrao, Rafael Cunha

TL;DR

This work examines how Change Based Exploration Transfer (CBET) can be adapted for world-model agents like DreamerV3 and evaluated against IMPALA in sparse-reward environments Crafter and Minigrid. It demonstrates that CBET can boost DreamerV3 performance in Crafter but may harm Minigrid performance, and that pre-training with intrinsic rewards does not guarantee immediate gains in extrinsic rewards during transfer. The study introduces a two-instance DreamerV3 transfer approach to accommodate CBET’s policy-transfer notion within a world-model framework. Overall, results indicate that CBET’s utility is environment- and architecture-dependent, motivating future research into scheduler-based intrinsic rewards and more resource-efficient transfer methods. These findings highlight the importance of aligning exploration incentives with task objectives and model structure when deploying advanced RL agents in sparse-reward settings.

Abstract

Sparse reward environments pose a significant challenge for reinforcement learning due to the scarcity of feedback. Intrinsic motivation and transfer learning have emerged as promising strategies to address this issue. Change Based Exploration Transfer (CBET), a technique that combines these two approaches for model-free algorithms, has shown potential in addressing sparse feedback but its effectiveness with modern algorithms remains understudied. This paper provides an adaptation of CBET for world model algorithms like DreamerV3 and compares the performance of DreamerV3 and IMPALA agents, both with and without CBET, in the sparse reward environments of Crafter and Minigrid. Our tabula rasa results highlight the possibility of CBET improving DreamerV3's returns in Crafter but the algorithm attains a suboptimal policy in Minigrid with CBET further reducing returns. In the same vein, our transfer learning experiments show that pre-training DreamerV3 with intrinsic rewards does not immediately lead to a policy that maximizes extrinsic rewards in Minigrid. Overall, our results suggest that CBET provides a positive impact on DreamerV3 in more complex environments like Crafter but may be detrimental in environments like Minigrid. In the latter case, the behaviours promoted by CBET in DreamerV3 may not align with the task objectives of the environment, leading to reduced returns and suboptimal policies.

World Model Agents with Change-Based Intrinsic Motivation

TL;DR

This work examines how Change Based Exploration Transfer (CBET) can be adapted for world-model agents like DreamerV3 and evaluated against IMPALA in sparse-reward environments Crafter and Minigrid. It demonstrates that CBET can boost DreamerV3 performance in Crafter but may harm Minigrid performance, and that pre-training with intrinsic rewards does not guarantee immediate gains in extrinsic rewards during transfer. The study introduces a two-instance DreamerV3 transfer approach to accommodate CBET’s policy-transfer notion within a world-model framework. Overall, results indicate that CBET’s utility is environment- and architecture-dependent, motivating future research into scheduler-based intrinsic rewards and more resource-efficient transfer methods. These findings highlight the importance of aligning exploration incentives with task objectives and model structure when deploying advanced RL agents in sparse-reward settings.

Abstract

Sparse reward environments pose a significant challenge for reinforcement learning due to the scarcity of feedback. Intrinsic motivation and transfer learning have emerged as promising strategies to address this issue. Change Based Exploration Transfer (CBET), a technique that combines these two approaches for model-free algorithms, has shown potential in addressing sparse feedback but its effectiveness with modern algorithms remains understudied. This paper provides an adaptation of CBET for world model algorithms like DreamerV3 and compares the performance of DreamerV3 and IMPALA agents, both with and without CBET, in the sparse reward environments of Crafter and Minigrid. Our tabula rasa results highlight the possibility of CBET improving DreamerV3's returns in Crafter but the algorithm attains a suboptimal policy in Minigrid with CBET further reducing returns. In the same vein, our transfer learning experiments show that pre-training DreamerV3 with intrinsic rewards does not immediately lead to a policy that maximizes extrinsic rewards in Minigrid. Overall, our results suggest that CBET provides a positive impact on DreamerV3 in more complex environments like Crafter but may be detrimental in environments like Minigrid. In the latter case, the behaviours promoted by CBET in DreamerV3 may not align with the task objectives of the environment, leading to reduced returns and suboptimal policies.

Paper Structure

This paper contains 21 sections, 4 equations, 6 figures, 1 table, 2 algorithms.

Figures (6)

  • Figure 1: Minigrid environments: Doorkey (left) and Unlock (right). The agent's observations (light coloured squares) consist of a 7 $\times$ 7 grid infront of it.
  • Figure 2: Crafter environment. The agent is provided a top down view of the game with statistics at the bottom.
  • Figure 3: Mean extrinsic return plotted with standard error. Standard errors in the tabula rasa case represent variability across 5 experiments, while those in the transfer learning case reflect variability across 8 evaluation episodes. Transfer learning experiments exclusively feature the CBET variants of the algorithms. The transfer learning results indicate that DreamerV3 outperforms IMPALA in Crafter but IMPALA acquires higher returns initially in Minigrid. In the tabula rasa experiments, DreamerV3 outperforms IMPALA in Crafter but significantly falls short in Minigrid. CBET is also beneficial for DreamerV3 in Crafter but reduces returns and exhibits higher variance in Minigrid.
  • Figure C.1: Equivalent Training Time Comparison between IMPALA and DreamerV3. IMPALA fails to outperform DreamerV3 even after being provided 5x more training time.
  • Figure D.1: Impact of Planning Ratio on DreamerV3 with and without CBET in Crafter. There does not appear to be a significant difference in performance between the two models as the planning ratio increases.
  • ...and 1 more figures