Table of Contents
Fetching ...

AutoEnv: Automated Environments for Measuring Cross-Environment Agent Learning

Jiayi Zhang, Yiran Peng, Fanqi Kong, Cheng Yang, Yifan Wu, Zhaoyang Yu, Jinyu Xiang, Jianhao Ruan, Jinlin Wang, Maojia Song, HongZhang Liu, Xiangru Tang, Bang Liu, Chenglin Wu, Yuyu Luo

TL;DR

AutoEnv addresses the scarcity of cross-environment benchmarks by automatically generating heterogeneous environments and a unified framework for studying agent learning across rule distributions. It introduces AutoEnv-36, a 36-environment benchmark with 358 validated levels, and formalizes agent learning as a component-centric process with Selection, Optimization, and Evaluation, plus a Learning Upper Bound. Experiments show fixed learning methods poorly scale across diverse environments, while environment-adaptive selection recovers part of the upper bound but leaves room for richer strategy spaces. The work provides a scalable testbed and methodology for advancing robust cross-environment generalization in intelligent agents.

Abstract

Humans naturally adapt to diverse environments by learning underlying rules across worlds with different dynamics, observations, and reward structures. In contrast, existing agents typically demonstrate improvements via self-evolving within a single domain, implicitly assuming a fixed environment distribution. Cross-environment learning has remained largely unmeasured: there is no standard collection of controllable, heterogeneous environments, nor a unified way to represent how agents learn. We address these gaps in two steps. First, we propose AutoEnv, an automated framework that treats environments as factorizable distributions over transitions, observations, and rewards, enabling low-cost (4.12 USD on average) generation of heterogeneous worlds. Using AutoEnv, we construct AutoEnv-36, a dataset of 36 environments with 358 validated levels, on which seven language models achieve 12-49% normalized reward, demonstrating the challenge of AutoEnv-36. Second, we formalize agent learning as a component-centric process driven by three stages of Selection, Optimization, and Evaluation applied to an improvable agent component. Using this formulation, we design eight learning methods and evaluate them on AutoEnv-36. Empirically, the gain of any single learning method quickly decrease as the number of environments increases, revealing that fixed learning methods do not scale across heterogeneous environments. Environment-adaptive selection of learning methods substantially improves performance but exhibits diminishing returns as the method space expands. These results highlight both the necessity and the current limitations of agent learning for scalable cross-environment generalization, and position AutoEnv and AutoEnv-36 as a testbed for studying cross-environment agent learning. The code is avaiable at https://github.com/FoundationAgents/AutoEnv.

AutoEnv: Automated Environments for Measuring Cross-Environment Agent Learning

TL;DR

AutoEnv addresses the scarcity of cross-environment benchmarks by automatically generating heterogeneous environments and a unified framework for studying agent learning across rule distributions. It introduces AutoEnv-36, a 36-environment benchmark with 358 validated levels, and formalizes agent learning as a component-centric process with Selection, Optimization, and Evaluation, plus a Learning Upper Bound. Experiments show fixed learning methods poorly scale across diverse environments, while environment-adaptive selection recovers part of the upper bound but leaves room for richer strategy spaces. The work provides a scalable testbed and methodology for advancing robust cross-environment generalization in intelligent agents.

Abstract

Humans naturally adapt to diverse environments by learning underlying rules across worlds with different dynamics, observations, and reward structures. In contrast, existing agents typically demonstrate improvements via self-evolving within a single domain, implicitly assuming a fixed environment distribution. Cross-environment learning has remained largely unmeasured: there is no standard collection of controllable, heterogeneous environments, nor a unified way to represent how agents learn. We address these gaps in two steps. First, we propose AutoEnv, an automated framework that treats environments as factorizable distributions over transitions, observations, and rewards, enabling low-cost (4.12 USD on average) generation of heterogeneous worlds. Using AutoEnv, we construct AutoEnv-36, a dataset of 36 environments with 358 validated levels, on which seven language models achieve 12-49% normalized reward, demonstrating the challenge of AutoEnv-36. Second, we formalize agent learning as a component-centric process driven by three stages of Selection, Optimization, and Evaluation applied to an improvable agent component. Using this formulation, we design eight learning methods and evaluate them on AutoEnv-36. Empirically, the gain of any single learning method quickly decrease as the number of environments increases, revealing that fixed learning methods do not scale across heterogeneous environments. Environment-adaptive selection of learning methods substantially improves performance but exhibits diminishing returns as the method space expands. These results highlight both the necessity and the current limitations of agent learning for scalable cross-environment generalization, and position AutoEnv and AutoEnv-36 as a testbed for studying cross-environment agent learning. The code is avaiable at https://github.com/FoundationAgents/AutoEnv.

Paper Structure

This paper contains 57 sections, 5 figures, 10 tables, 2 algorithms.

Figures (5)

  • Figure 1: Conceptual comparison between learning in a single environment (left) and cross-environment learning (right). In the single-environment case, trajectories from one environment are used to update the agent's internal components (model and agent code). In the cross-environment case, trajectories collected across many environments update both the agent components and a shared learning procedure, so that the agent learns how to learn from diverse environments.
  • Figure 2: Overview of the AutoEnv environment generation pipeline. The left panel shows an input environment theme and its DSL design in YAML form. The middle panel illustrates how coding agents instantiate basic code from the DSL, then run a self-repair loop followed by three verification stages, execution testing, level generation, and reliability checking with differential models. The right panel presents the core code structure for the three abstraction layers, and an example SkinEnv that renders both text descriptions and an image view.
  • Figure 3: Impact of environment diversity on learning performance across 36 environments. Left: Performance gain of individual learning methods decreases as more environments are included (top-k by gain). Right: Expanding learning method space (M=1 to M=4) improves the upper bound with diminishing marginal returns. For each M, we evaluate all $\binom{4}{M}$ method subsets by taking the per-environment maximum over methods in each subset, averaging across environments, then averaging over subsets. This shows the performance upper bound achievable with M method choices.
  • Figure A1: Generated multimodal skin for partial environments in AutoEnv-36.
  • Figure A2: Generated multimodal skins for the same base environment, showing how rules can be decoupled from observations.