AlphaZeroES: Direct score maximization outperforms planning loss minimization

Carlos Martin; Tuomas Sandholm

AlphaZeroES: Direct score maximization outperforms planning loss minimization

Carlos Martin, Tuomas Sandholm

TL;DR

This work questions whether AlphaZero-style planning can be outperformed in single-agent settings by directly maximizing the episode score. It keeps MCTS and the neural architecture fixed and uses OpenAI-ES, a zeroth-order evolution strategy, to optimize the agent’s parameters with respect to the episode return $\mathbb{E}[R_0]$, circumventing differentiability issues. Across five combinatorial and planning domains (Navigation, Sokoban, TSP, VKCP, MDP), AlphaZeroES consistently achieves higher episode scores than standard AlphaZero, with only modest or task-dependent changes to value and policy losses. The results demonstrate that black-box optimization can effectively leverage nondifferentiable planning components, suggesting a broadly applicable approach for planning-intensive RL problems.

Abstract

Planning at execution time has been shown to dramatically improve performance for agents in both single-agent and multi-agent settings. A well-known family of approaches to planning at execution time are AlphaZero and its variants, which use Monte Carlo Tree Search together with a neural network that guides the search by predicting state values and action probabilities. AlphaZero trains these networks by minimizing a planning loss that makes the value prediction match the episode return, and the policy prediction at the root of the search tree match the output of the full tree expansion. AlphaZero has been applied to both single-agent environments (such as Sokoban) and multi-agent environments (such as chess and Go) with great success. In this paper, we explore an intriguing question: In single-agent environments, can we outperform AlphaZero by directly maximizing the episode score instead of minimizing this planning loss, while leaving the MCTS algorithm and neural architecture unchanged? To directly maximize the episode score, we use evolution strategies, a family of algorithms for zeroth-order blackbox optimization. Our experiments indicate that, across multiple environments, directly maximizing the episode score outperforms minimizing the planning loss.

AlphaZeroES: Direct score maximization outperforms planning loss minimization

TL;DR

, circumventing differentiability issues. Across five combinatorial and planning domains (Navigation, Sokoban, TSP, VKCP, MDP), AlphaZeroES consistently achieves higher episode scores than standard AlphaZero, with only modest or task-dependent changes to value and policy losses. The results demonstrate that black-box optimization can effectively leverage nondifferentiable planning components, suggesting a broadly applicable approach for planning-intensive RL problems.

Abstract

Paper Structure (16 sections, 5 figures, 1 algorithm)

This paper contains 16 sections, 5 figures, 1 algorithm.

Introduction
Problem formulation
Related research
Proposed method
Planning algorithm
Prediction function
Training procedure
Experiments
Navigation problem
Sokoban
Traveling salesman problem
Vertex k-center problem
Maximum diversity problem
Conclusion
Acknowledgments
...and 1 more sections

Figures (5)

Figure 1: Navigation state and metrics.
Figure 2: Sokoban state and metrics.
Figure 3: TSP state and metrics.
Figure 4: VKCP state and metrics.
Figure 5: MDP state and metrics.

AlphaZeroES: Direct score maximization outperforms planning loss minimization

TL;DR

Abstract

AlphaZeroES: Direct score maximization outperforms planning loss minimization

Authors

TL;DR

Abstract

Table of Contents

Figures (5)