Table of Contents
Fetching ...

AMaze: An intuitive benchmark generator for fast prototyping of generalizable agents

Kevin Godin-Dubois, Karine Miras, Anna V. Kononova

TL;DR

AMaze presents a controllable benchmark generator for generalizable embodied agents navigating 2D mazes with deceptive visual cues, enabling fast prototyping and human-in-the-loop experimentation. It introduces two maze-complexity metrics, $S(M)$ and $D(M)$, and evaluates three training regimens (direct, interpolation, interactive EDHuCAT) with A2C and PPO, finding that PPO generally excels in dynamic settings and that EDHuCAT combined with PPO delivers the strongest generalization. The results show median gains up to 50–100% in generalized performance depending on the metric, with interactive training achieving the best outcomes, demonstrating the value of human-in-the-loop control for benchmark design. Overall, AMaze offers a scalable, interpretable platform for probing generalization and perception in RL and embodied AI, bridging fast, low-cost prototyping with insights into how training regimens and human guidance shape robust, transferable policies.

Abstract

Traditional approaches to training agents have generally involved a single, deterministic environment of minimal complexity to solve various tasks such as robot locomotion or computer vision. However, agents trained in static environments lack generalization capabilities, limiting their potential in broader scenarios. Thus, recent benchmarks frequently rely on multiple environments, for instance, by providing stochastic noise, simple permutations, or altogether different settings. In practice, such collections result mainly from costly human-designed processes or the liberal use of random number generators. In this work, we introduce AMaze, a novel benchmark generator in which embodied agents must navigate a maze by interpreting visual signs of arbitrary complexities and deceptiveness. This generator promotes human interaction through the easy generation of feature-specific mazes and an intuitive understanding of the resulting agents' strategies. As a proof-of-concept, we demonstrate the capabilities of the generator in a simple, fully discrete case with limited deceptiveness. Agents were trained under three different regimes (one-shot, scaffolding, interactive), and the results showed that the latter two cases outperform direct training in terms of generalization capabilities. Indeed, depending on the combination of generalization metric, training regime, and algorithm, the median gain ranged from 50% to 100% and maximal performance was achieved through interactive training, thereby demonstrating the benefits of a controllable human-in-the-loop benchmark generator.

AMaze: An intuitive benchmark generator for fast prototyping of generalizable agents

TL;DR

AMaze presents a controllable benchmark generator for generalizable embodied agents navigating 2D mazes with deceptive visual cues, enabling fast prototyping and human-in-the-loop experimentation. It introduces two maze-complexity metrics, and , and evaluates three training regimens (direct, interpolation, interactive EDHuCAT) with A2C and PPO, finding that PPO generally excels in dynamic settings and that EDHuCAT combined with PPO delivers the strongest generalization. The results show median gains up to 50–100% in generalized performance depending on the metric, with interactive training achieving the best outcomes, demonstrating the value of human-in-the-loop control for benchmark design. Overall, AMaze offers a scalable, interpretable platform for probing generalization and perception in RL and embodied AI, bridging fast, low-cost prototyping with insights into how training regimens and human guidance shape robust, transferable policies.

Abstract

Traditional approaches to training agents have generally involved a single, deterministic environment of minimal complexity to solve various tasks such as robot locomotion or computer vision. However, agents trained in static environments lack generalization capabilities, limiting their potential in broader scenarios. Thus, recent benchmarks frequently rely on multiple environments, for instance, by providing stochastic noise, simple permutations, or altogether different settings. In practice, such collections result mainly from costly human-designed processes or the liberal use of random number generators. In this work, we introduce AMaze, a novel benchmark generator in which embodied agents must navigate a maze by interpreting visual signs of arbitrary complexities and deceptiveness. This generator promotes human interaction through the easy generation of feature-specific mazes and an intuitive understanding of the resulting agents' strategies. As a proof-of-concept, we demonstrate the capabilities of the generator in a simple, fully discrete case with limited deceptiveness. Agents were trained under three different regimes (one-shot, scaffolding, interactive), and the results showed that the latter two cases outperform direct training in terms of generalization capabilities. Indeed, depending on the combination of generalization metric, training regime, and algorithm, the median gain ranged from 50% to 100% and maximal performance was achieved through interactive training, thereby demonstrating the benefits of a controllable human-in-the-loop benchmark generator.

Paper Structure

This paper contains 17 sections, 3 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Generic maze example. Agents start in one corner and must reach the opposite. Corridors can be empty or contain easily identifiable misleading signs (lures). Signs placed on intersections maybe trustworthy or not depending on whether they are a clue or a trap, respectively.
  • Figure 2:
  • Figure 3: Mazes used for generalization evaluation. The first three columns correspond to different maze classes, while the last three all include traps but with different frequencies (1, 3, 16). Each row corresponds to the minimal, median, and maximal complexity of mazes obtained from a random sample of size 10000.