Table of Contents
Fetching ...

EvoEmpirBench: Dynamic Spatial Reasoning with Agent-ExpVer

Pukun Zhao, Longxiang Wang, Miaowei Wang, Chen Chen, Fanqing Zhou, Haojian Huang

TL;DR

EvoEmpirBench tackles dynamic, partially observable spatial reasoning by combining two interactive benchmarks with a cognitively grounded online-learning workflow (Agent-ExpVer). The framework uses three agents to collect experiences, distill subjective insights, and integrate transferable truths into policy, enabling continual adaptation without offline training. Empirical results across both open-source and proprietary models show consistent performance gains and highlight limitations of current models in dynamic reasoning and memory, while ablations emphasize the critical role of truth management. The work delivers a scalable platform for advancing lifelong learning, memory-based transfer, and adaptive planning in dynamic environments.

Abstract

Most existing spatial reasoning benchmarks focus on static or globally observable environments, failing to capture the challenges of long-horizon reasoning and memory utilization under partial observability and dynamic changes. We introduce two dynamic spatial benchmarks, locally observable maze navigation and match-2 elimination that systematically evaluate models' abilities in spatial understanding and adaptive planning when local perception, environment feedback, and global objectives are tightly coupled. Each action triggers structural changes in the environment, requiring continuous update of cognition and strategy. We further propose a subjective experience-based memory mechanism for cross-task experience transfer and validation. Experiments show that our benchmarks reveal key limitations of mainstream models in dynamic spatial reasoning and long-term memory, providing a comprehensive platform for future methodological advances. Our code and data are available at https://anonymous.4open.science/r/EvoEmpirBench-143C/.

EvoEmpirBench: Dynamic Spatial Reasoning with Agent-ExpVer

TL;DR

EvoEmpirBench tackles dynamic, partially observable spatial reasoning by combining two interactive benchmarks with a cognitively grounded online-learning workflow (Agent-ExpVer). The framework uses three agents to collect experiences, distill subjective insights, and integrate transferable truths into policy, enabling continual adaptation without offline training. Empirical results across both open-source and proprietary models show consistent performance gains and highlight limitations of current models in dynamic reasoning and memory, while ablations emphasize the critical role of truth management. The work delivers a scalable platform for advancing lifelong learning, memory-based transfer, and adaptive planning in dynamic environments.

Abstract

Most existing spatial reasoning benchmarks focus on static or globally observable environments, failing to capture the challenges of long-horizon reasoning and memory utilization under partial observability and dynamic changes. We introduce two dynamic spatial benchmarks, locally observable maze navigation and match-2 elimination that systematically evaluate models' abilities in spatial understanding and adaptive planning when local perception, environment feedback, and global objectives are tightly coupled. Each action triggers structural changes in the environment, requiring continuous update of cognition and strategy. We further propose a subjective experience-based memory mechanism for cross-task experience transfer and validation. Experiments show that our benchmarks reveal key limitations of mainstream models in dynamic spatial reasoning and long-term memory, providing a comprehensive platform for future methodological advances. Our code and data are available at https://anonymous.4open.science/r/EvoEmpirBench-143C/.

Paper Structure

This paper contains 14 sections, 5 equations, 5 figures, 7 tables.

Figures (5)

  • Figure 1: Overview of EvoEmpirBench: Locally observable maze navigation (left) and match-2 elimination (right), illustrating dynamic and interactive challenges for language agents.
  • Figure 2: Workflow of the Agent-ExpVer System. The left side showcases EvoEmpirBench (EEB), our dynamic benchmark. The right side presents the Agent-ExpVer framework, comprising three agents: the GeoLink Agent collects and selects key historical actions based on game highlight metrics; the InsightForce Agent summarizes and validates subjective experiences; the TruthWeaver Agent maintains truths (insert, merge, remove) and passes them back to the GeoLink Agent. The figure highlights the processes of selection, summarization, validation, and maintenance.
  • Figure 3: Performance Analysis of Agent-ExpVer in EEB. Demonstrating Enhanced Global Long-Horizon Reasoning through Dynamic Truth Refinement: (a) Radar Plots of Model Metrics, (b) A.Score and Suc.Rate Trends Across Learning Episodes, (c) Distribution of Steps for Maze Navigation Before and After Learning.
  • Figure 4: Subjective experience guided by adventure
  • Figure 5: Subjective experience of survival orientation