Table of Contents
Fetching ...

EcoGym: Evaluating LLMs for Long-Horizon Plan-and-Execute in Interactive Economies

Xavier Hu, Jinxiang Xia, Shengze Xu, Kangqi Song, Yishuo Yuan, Guibin Zhang, Jincheng Ren, Boyu Feng, Li Lu, Tieyong Zeng, Jiaheng Liu, Minghao Liu, Yuchen Elenor Jiang, Wei Wang, He Zhu, Wangchunshu Zhou

TL;DR

EcoGym introduces an open, generalizable benchmark for long-horizon plan-and-execute decision making in interactive economies, featuring three environments (Vending, Freelance, Operation) and an effectively unbounded horizon of $1000+$ steps. The framework grounds evaluation in economic outcomes (net worth, income, DAU) and emphasizes latent mechanics to foster exploratory discovery. Across 11 LLMs, EcoGym reveals a systematic lack of a single dominant model and highlights suboptimality in high-level strategy or action execution, while diagnostics show benefits from memory modules and explicit thinking. The results underscore the challenge of sustaining strategic coherence over long horizons and position EcoGym as a transparent, community-driven tool for studying controllability-utility trade-offs in realistic economic settings.

Abstract

Long-horizon planning is widely recognized as a core capability of autonomous LLM-based agents; however, current evaluation frameworks suffer from being largely episodic, domain-specific, or insufficiently grounded in persistent economic dynamics. We introduce EcoGym, a generalizable benchmark for continuous plan-and-execute decision making in interactive economies. EcoGym comprises three diverse environments: Vending, Freelance, and Operation, implemented in a unified decision-making process with standardized interfaces, and budgeted actions over an effectively unbounded horizon (1000+ steps if 365 day-loops for evaluation). The evaluation of EcoGym is based on business-relevant outcomes (e.g., net worth, income, and DAU), targeting long-term strategic coherence and robustness under partial observability and stochasticity. Experiments across eleven leading LLMs expose a systematic tension: no single model dominates across all three scenarios. Critically, we find that models exhibit significant suboptimality in either high-level strategies or efficient actions executions. EcoGym is released as an open, extensible testbed for transparent long-horizon agent evaluation and for studying controllability-utility trade-offs in realistic economic settings.

EcoGym: Evaluating LLMs for Long-Horizon Plan-and-Execute in Interactive Economies

TL;DR

EcoGym introduces an open, generalizable benchmark for long-horizon plan-and-execute decision making in interactive economies, featuring three environments (Vending, Freelance, Operation) and an effectively unbounded horizon of steps. The framework grounds evaluation in economic outcomes (net worth, income, DAU) and emphasizes latent mechanics to foster exploratory discovery. Across 11 LLMs, EcoGym reveals a systematic lack of a single dominant model and highlights suboptimality in high-level strategy or action execution, while diagnostics show benefits from memory modules and explicit thinking. The results underscore the challenge of sustaining strategic coherence over long horizons and position EcoGym as a transparent, community-driven tool for studying controllability-utility trade-offs in realistic economic settings.

Abstract

Long-horizon planning is widely recognized as a core capability of autonomous LLM-based agents; however, current evaluation frameworks suffer from being largely episodic, domain-specific, or insufficiently grounded in persistent economic dynamics. We introduce EcoGym, a generalizable benchmark for continuous plan-and-execute decision making in interactive economies. EcoGym comprises three diverse environments: Vending, Freelance, and Operation, implemented in a unified decision-making process with standardized interfaces, and budgeted actions over an effectively unbounded horizon (1000+ steps if 365 day-loops for evaluation). The evaluation of EcoGym is based on business-relevant outcomes (e.g., net worth, income, and DAU), targeting long-term strategic coherence and robustness under partial observability and stochasticity. Experiments across eleven leading LLMs expose a systematic tension: no single model dominates across all three scenarios. Critically, we find that models exhibit significant suboptimality in either high-level strategies or efficient actions executions. EcoGym is released as an open, extensible testbed for transparent long-horizon agent evaluation and for studying controllability-utility trade-offs in realistic economic settings.
Paper Structure (37 sections, 20 equations, 19 figures, 9 tables)

This paper contains 37 sections, 20 equations, 19 figures, 9 tables.

Figures (19)

  • Figure 1: Long-horizon performances across three environments in EcoGym. The plots illustrate the daily progression of key metrics: Net Worth in Vending (left), Income in Freelance (middle), and DAU in Operation (right). Note: Truncated lines represent agents that failed to survive the full simulation horizon due to triggering failure conditions. Only top-performance models are kept for clarity; full experimental results are available in Table \ref{['tab:main_results']}.
  • Figure 2: Design Principles (upper left) and three Environments in EcoGym (lower left), and detailed description for Vending environment (right). We marked how our designs reflect the principles by golden leader line.
  • Figure 3: Stochastic stability analysis of Gemini-3-Pro on Vending (left), Freelance (middle) and Operation (right).
  • Figure 4: Impact of context window length on Operation environment. We compare Gemini-3-Flash and Gemini-3-Pro across lengths from 32 to 1024.
  • Figure 5: Temporal evolution of action frequencies for Gemini-3-Pro in Vending (left), Freelance (middle) and Operation (right).
  • ...and 14 more figures