PUZZLES: A Benchmark for Neural Algorithmic Reasoning

Benjamin Estermann; Luca A. Lanzendörfer; Yannick Niedermayr; Roger Wattenhofer

PUZZLES: A Benchmark for Neural Algorithmic Reasoning

Benjamin Estermann, Luca A. Lanzendörfer, Yannick Niedermayr, Roger Wattenhofer

TL;DR

PUZZLES introduces a scalable RL benchmark derived from Simon Tatham's Puzzle Collection to probe neural algorithmic reasoning. It provides 40 puzzles with adjustable size/difficulty, two observation modalities (internal state or pixel), a discrete action space, action masking, and early termination options, all within a Gymnasium-compatible environment. Across multiple baselines (PPO, TRPO, A2C, DQN, QRDQN, MuZero, DreamerV3), DreamerV3 showed the strongest performance on average but many puzzles remain challenging, with many not solvable within the optimal upper bound. The results highlight the importance of reward design, input representation, and inductive biases (e.g., Transformer-based encoders, potential GNNs) for learning algorithmic reasoning, and they establish PUZZLES as a standardized platform for future research in this area.

Abstract

Algorithmic reasoning is a fundamental cognitive ability that plays a pivotal role in problem-solving and decision-making processes. Reinforcement Learning (RL) has demonstrated remarkable proficiency in tasks such as motor control, handling perceptual input, and managing stochastic environments. These advancements have been enabled in part by the availability of benchmarks. In this work we introduce PUZZLES, a benchmark based on Simon Tatham's Portable Puzzle Collection, aimed at fostering progress in algorithmic and logical reasoning in RL. PUZZLES contains 40 diverse logic puzzles of adjustable sizes and varying levels of complexity; many puzzles also feature a diverse set of additional configuration parameters. The 40 puzzles provide detailed information on the strengths and generalization capabilities of RL agents. Furthermore, we evaluate various RL algorithms on PUZZLES, providing baseline comparisons and demonstrating the potential for future research. All the software, including the environment, is available at https://github.com/ETH-DISCO/rlp.

PUZZLES: A Benchmark for Neural Algorithmic Reasoning

TL;DR

Abstract

Paper Structure (42 sections, 44 figures, 11 tables)

This paper contains 42 sections, 44 figures, 11 tables.

Contributions.
Related Work
RL benchmarks.
Logical and algorithmic reasoning within RL.
Reasoning benchmarks.
The PUZZLES Environment
Environment Overview
Difficulty Progression and Generalization
Empirical Evaluation
Baseline Experiments
Difficulty
Effect of Action Masking and Observation Representation
Effect of Episode Length and Early Termination
Generalization
Discussion
...and 27 more sections

Figures (44)

Figure 1: All puzzle classes of Simon Tatham's Portable Puzzle Collection.
Figure 2: Code and library landscape around the PUZZLES Environment, made up of the rlp Package and the puzzle Module . The figure shows how the puzzle Module presented in this paper fits within Tathams's Puzzle Collection code, the Pygame package, and a user's Gymnasium reinforcement learning code . The different parts are also categorized as Python language and C language.
Figure 3: Average episode length of successful episodes for all evaluated algorithms on all puzzles in the easiest setting (lower is better). Some puzzles, namely Loopy, Pearl, Pegs, Solo, and Unruly, were intractable for all algorithms and were therefore excluded in this aggregation. The standard deviation is computed with respect to the performance over all evaluated instances for all trained seeds, aggregated for the total number of puzzles. Optimal refers the upper bound of the performance of an optimal policy, it therefore does not include a standard deviation. We see that DreamerV3 performs the best with an average episode length of 1334. However, this is still worse than the optimal upper bound at an average of 217 steps.
Figure 4: (left) We demonstrate the effect of action masking in both RGB observation and internal game state. By masking moves that do not change the current state, the agent requires fewer actions to explore, and therefore, on average solves a puzzle using fewer steps. (right) Moving average episode length during training for the Flood puzzle. Lower episode length is better, as the episode gets terminated as soon as the agent has solved a puzzle. Different colors describe different algorithms, where different shades of a color indicate different random seeds. Sparse dots indicate that an agent only occasionally managed to find a policy that solves a puzzle. It can be seen that both the use of discrete internal state observations and action masking have a positive effect on the training, leading to faster convergence and a stronger overall performance.
Figure 5: Black Box: Find the hidden balls in the box by bouncing laser beams off them.
...and 39 more figures

PUZZLES: A Benchmark for Neural Algorithmic Reasoning

TL;DR

Abstract

PUZZLES: A Benchmark for Neural Algorithmic Reasoning

Authors

TL;DR

Abstract

Table of Contents

Figures (44)