Table of Contents
Fetching ...

OdysseyArena: Benchmarking Large Language Models For Long-Horizon, Active and Inductive Interactions

Fangzhi Xu, Hang Yan, Qiushi Sun, Jinyang Wu, Zixian Huang, Muye Huang, Jingyang Gong, Zichen Ding, Kanzhi Cheng, Yian Wang, Xinyu Che, Zeyi Sun, Jian Zhang, Zhangyue Yin, Haoran Luo, Xuanjing Huang, Ben Kao, Jun Liu, Qika Lin

TL;DR

OdysseyArena introduces a paradigm shift in LLM agent evaluation by emphasizing long-horizon, active, and inductive interactions. It formalizes four latent-world primitives and instantiates them into four lightweight environments, then provides OdysseyArena-Lite (120 tasks) and OdysseyArena-Challenge (>$200$ steps) to benchmark inductive efficiency. Across 15+ leading LLMs, frontier models show a persistent inductive bottleneck, performing well on deductive tasks but failing to autonomously discover latent transition dynamics, with humans still outperforming. The work highlights the necessity for scalable world-model induction mechanisms to enable robust, coherent autonomy in complex, dynamic environments and provides open-source code and data for ongoing benchmarking.

Abstract

The rapid advancement of Large Language Models (LLMs) has catalyzed the development of autonomous agents capable of navigating complex environments. However, existing evaluations primarily adopt a deductive paradigm, where agents execute tasks based on explicitly provided rules and static goals, often within limited planning horizons. Crucially, this neglects the inductive necessity for agents to discover latent transition laws from experience autonomously, which is the cornerstone for enabling agentic foresight and sustaining strategic coherence. To bridge this gap, we introduce OdysseyArena, which re-centers agent evaluation on long-horizon, active, and inductive interactions. We formalize and instantiate four primitives, translating abstract transition dynamics into concrete interactive environments. Building upon this, we establish OdysseyArena-Lite for standardized benchmarking, providing a set of 120 tasks to measure an agent's inductive efficiency and long-horizon discovery. Pushing further, we introduce OdysseyArena-Challenge to stress-test agent stability across extreme interaction horizons (e.g., > 200 steps). Extensive experiments on 15+ leading LLMs reveal that even frontier models exhibit a deficiency in inductive scenarios, identifying a critical bottleneck in the pursuit of autonomous discovery in complex environments. Our code and data are available at https://github.com/xufangzhi/Odyssey-Arena

OdysseyArena: Benchmarking Large Language Models For Long-Horizon, Active and Inductive Interactions

TL;DR

OdysseyArena introduces a paradigm shift in LLM agent evaluation by emphasizing long-horizon, active, and inductive interactions. It formalizes four latent-world primitives and instantiates them into four lightweight environments, then provides OdysseyArena-Lite (120 tasks) and OdysseyArena-Challenge (> steps) to benchmark inductive efficiency. Across 15+ leading LLMs, frontier models show a persistent inductive bottleneck, performing well on deductive tasks but failing to autonomously discover latent transition dynamics, with humans still outperforming. The work highlights the necessity for scalable world-model induction mechanisms to enable robust, coherent autonomy in complex, dynamic environments and provides open-source code and data for ongoing benchmarking.

Abstract

The rapid advancement of Large Language Models (LLMs) has catalyzed the development of autonomous agents capable of navigating complex environments. However, existing evaluations primarily adopt a deductive paradigm, where agents execute tasks based on explicitly provided rules and static goals, often within limited planning horizons. Crucially, this neglects the inductive necessity for agents to discover latent transition laws from experience autonomously, which is the cornerstone for enabling agentic foresight and sustaining strategic coherence. To bridge this gap, we introduce OdysseyArena, which re-centers agent evaluation on long-horizon, active, and inductive interactions. We formalize and instantiate four primitives, translating abstract transition dynamics into concrete interactive environments. Building upon this, we establish OdysseyArena-Lite for standardized benchmarking, providing a set of 120 tasks to measure an agent's inductive efficiency and long-horizon discovery. Pushing further, we introduce OdysseyArena-Challenge to stress-test agent stability across extreme interaction horizons (e.g., > 200 steps). Extensive experiments on 15+ leading LLMs reveal that even frontier models exhibit a deficiency in inductive scenarios, identifying a critical bottleneck in the pursuit of autonomous discovery in complex environments. Our code and data are available at https://github.com/xufangzhi/Odyssey-Arena
Paper Structure (106 sections, 5 equations, 13 figures, 7 tables)

This paper contains 106 sections, 5 equations, 13 figures, 7 tables.

Figures (13)

  • Figure 1: Comparison between deductive and inductive settings in multi-turn agentic tasks.
  • Figure 2: Demonstrations of four OdysseyArena environments: Turn On Lights, AI Trading, Energy Dispatch, and Repo System. For clarity, we omit the task prompts here and present only the interaction trajectories. Full prompts are provided in Appendix \ref{['appendix:exp_setting']}.
  • Figure 3: Overview of the benchmark architecture, illustrating the environment configuration initialization (left) and the interaction loop between the LLM agent and the environment step logic (right).
  • Figure 4: Success rate comparison of w/ and w/o rules in Turn On Lights. We select Llama 3.3 70B Instruct, GLM-4-32B-0414, Qwen3-235B-A22B-Instruct, DeepSeek-V3.2, Grok 4 Fast, GPT-5, Gemini 3 Pro Preview for illustration.
  • Figure 5: Task success status (based on pass@4) of different tasks in Turn On Lights. Each row represents: (a) Human, (b) Gemini3 Pro Preview, (c) GPT-5, (d) Gemini 2.5 Pro, (e) gpt-oss-120b (high), (f) DeepSeek-V3.2, (g) Grok 4 Fast, (h) Qwen3-235B-A22B-Instruct, (i) gpt-oss-120b (medium), (j) Qwen3-30B-A3B-Instruct, (k) GLM-4-32B-0414, (l) gpt-oss-120b (low), (m) Llama 3.3 70B Instruct, (n) Qwen3-4B-Instruct, (o) Llama 3.1 8B Instruct, (p) GLM-4-9B-Chat. Dark green cells indicate tasks solved by Human. Green cells indicate tasks solved by LLM agents. Gray cells indicate unsolved tasks. for each subset (Easy, Medium and Hard), we report the average success rate across all LLMs.
  • ...and 8 more figures