Table of Contents
Fetching ...

OptionZero: Planning with Learned Options

Po-Wei Huang, Pei-Chiun Peng, Hung Guei, Ti-Rong Wu

TL;DR

OptionZero addresses planning with temporally extended actions by automatically discovering options and integrating them into MuZero. It introduces an option network that predicts dominant options and a dynamics model capable of unrolling composite actions, enabling deeper planning under the same budget. Empirical results on 26 Atari games show substantial gains in mean human-normalized performance, and GridWorld experiments illustrate dramatic training speedups when using learned options. The findings indicate that learned options adapt to game characteristics and can be leveraged to enhance planning efficiency across domains, with code released for reproducibility.

Abstract

Planning with options -- a sequence of primitive actions -- has been shown effective in reinforcement learning within complex environments. Previous studies have focused on planning with predefined options or learned options through expert demonstration data. Inspired by MuZero, which learns superhuman heuristics without any human knowledge, we propose a novel approach, named OptionZero. OptionZero incorporates an option network into MuZero, providing autonomous discovery of options through self-play games. Furthermore, we modify the dynamics network to provide environment transitions when using options, allowing searching deeper under the same simulation constraints. Empirical experiments conducted in 26 Atari games demonstrate that OptionZero outperforms MuZero, achieving a 131.58% improvement in mean human-normalized score. Our behavior analysis shows that OptionZero not only learns options but also acquires strategic skills tailored to different game characteristics. Our findings show promising directions for discovering and using options in planning. Our code is available at https://rlg.iis.sinica.edu.tw/papers/optionzero.

OptionZero: Planning with Learned Options

TL;DR

OptionZero addresses planning with temporally extended actions by automatically discovering options and integrating them into MuZero. It introduces an option network that predicts dominant options and a dynamics model capable of unrolling composite actions, enabling deeper planning under the same budget. Empirical results on 26 Atari games show substantial gains in mean human-normalized performance, and GridWorld experiments illustrate dramatic training speedups when using learned options. The findings indicate that learned options adapt to game characteristics and can be leveraged to enhance planning efficiency across domains, with code released for reproducibility.

Abstract

Planning with options -- a sequence of primitive actions -- has been shown effective in reinforcement learning within complex environments. Previous studies have focused on planning with predefined options or learned options through expert demonstration data. Inspired by MuZero, which learns superhuman heuristics without any human knowledge, we propose a novel approach, named OptionZero. OptionZero incorporates an option network into MuZero, providing autonomous discovery of options through self-play games. Furthermore, we modify the dynamics network to provide environment transitions when using options, allowing searching deeper under the same simulation constraints. Empirical experiments conducted in 26 Atari games demonstrate that OptionZero outperforms MuZero, achieving a 131.58% improvement in mean human-normalized score. Our behavior analysis shows that OptionZero not only learns options but also acquires strategic skills tailored to different game characteristics. Our findings show promising directions for discovering and using options in planning. Our code is available at https://rlg.iis.sinica.edu.tw/papers/optionzero.

Paper Structure

This paper contains 35 sections, 8 equations, 7 figures, 14 tables.

Figures (7)

  • Figure 1: An illustration of calculating option in a decision tree. Each node represents a with two possible actions, L and R, corresponding to the left and right transitions to the subsequent state. (a) The decision tree and probabilities for each option at state $s$. (b) The procedure of determining the dominant option from the option network.
  • Figure 2: An illustration of each phase in MCTS in OptionZero.
  • Figure 3: An illustration of optimization in OptionZero. The notion is from the perspective of $s_t$.
  • Figure 4: Visualization of options learned by OptionZero at different stages of training in GridWorld.
  • Figure 5: Sequence of game play from (a) to (e) for OptionZero in hero. The actions R, L, D, and F represent moving right, moving left, placing bombs, and firing, respectively.
  • ...and 2 more figures