Table of Contents
Fetching ...

Improving Robustness of AlphaZero Algorithms to Test-Time Environment Changes

Isidoro Tamassia, Wendelin Böhmer

TL;DR

AlphaZero’s performance can degrade when test-time environments diverge from training due to misaligned neural priors. The authors propose Extra-Deep Planning (EDP), which combines greedy planning ($C=0$), tree recycling, and loop blocking to adapt planning under limited budgets without retraining. In MAZE grid-world tests with training/testing mismatches, EDP consistently outperforms standard AZ and AZ with UCT/PUCT, and ablations show that each component contributes, with loop blocking being particularly crucial. This approach demonstrates robust planning under deployment-time shifts and suggests promising directions for scaling to larger, continuous, and non-stationary environments.

Abstract

The AlphaZero framework provides a standard way of combining Monte Carlo planning with prior knowledge provided by a previously trained policy-value neural network. AlphaZero usually assumes that the environment on which the neural network was trained will not change at test time, which constrains its applicability. In this paper, we analyze the problem of deploying AlphaZero agents in potentially changed test environments and demonstrate how the combination of simple modifications to the standard framework can significantly boost performance, even in settings with a low planning budget available. The code is publicly available on GitHub.

Improving Robustness of AlphaZero Algorithms to Test-Time Environment Changes

TL;DR

AlphaZero’s performance can degrade when test-time environments diverge from training due to misaligned neural priors. The authors propose Extra-Deep Planning (EDP), which combines greedy planning (), tree recycling, and loop blocking to adapt planning under limited budgets without retraining. In MAZE grid-world tests with training/testing mismatches, EDP consistently outperforms standard AZ and AZ with UCT/PUCT, and ablations show that each component contributes, with loop blocking being particularly crucial. This approach demonstrates robust planning under deployment-time shifts and suggests promising directions for scaling to larger, continuous, and non-stationary environments.

Abstract

The AlphaZero framework provides a standard way of combining Monte Carlo planning with prior knowledge provided by a previously trained policy-value neural network. AlphaZero usually assumes that the environment on which the neural network was trained will not change at test time, which constrains its applicability. In this paper, we analyze the problem of deploying AlphaZero agents in potentially changed test environments and demonstrate how the combination of simple modifications to the standard framework can significantly boost performance, even in settings with a low planning budget available. The code is publicly available on GitHub.

Paper Structure

This paper contains 23 sections, 15 equations, 10 figures, 1 table.

Figures (10)

  • Figure 1: A single iteration of AZ planning, from left to right: the agent selects a path using a selection policy (e.g., PUCT \ref{['PUCT']}) until reaching a not-fully-expanded node, expands it through a sampled action, and estimates the value of the newly created node using the value network. Its value is then backpropagated through the path until reaching the root, updating the mean value estimates along the path. This procedure is repeated for $B$ iterations, where $B$ is the planning budget.
  • Figure 2: Example of the tree re-usage mechanism. The green node in each tree corresponds to the current state $s_t$ in the real environment at step $t$. After step $1$, we can reuse the right subtree of the previous root node as the corresponding child is the only one whose state is $s_1$ (the current state in the real environment). The left subtree is discarded (red-crossed nodes in the figure). After step $2$, we could reuse both children of the previous root, but we choose the left one as the corresponding subtree is deeper.
  • Figure 3: Example of loop blocking mechanism. During traversal, we prune the edges of the tree leading to states that had already been visited along the path.
  • Figure 4: Overview of the MAZE training and test configurations. The green square represents the agent's starting position, the blue squares represent obstacles, and the yellow square represents the goal position. The agent is trained on the MAZE_LR and MAZE_RL configurations and can be tested by moving the holes in the walls into different positions, as shown in the figure.
  • Figure 5: Results of the EDP planning algorithm (red line) on the MAZE grid-world challenges, compared with standard AZ (blue lines). Straight and dashed lines are reported for AZ+PUCT and AZ+UCT versions, respectively. The displayed metric is the mean discounted return averaged across $10 \times10$ training/evaluation seeds, and the shaded area represents the standard error across training seeds. The label MAZE_$X$$\rightarrow$ MAZE_$Y$ on top of each plot indicates training on the MAZE_$X$ configuration and testing on MAZE_$Y$.
  • ...and 5 more figures