TALES: Text Adventure Learning Environment Suite
Christopher Zhang Cui, Xingdi Yuan, Ziang Xiao, Prithviraj Ammanabrolu, Marc-Alexandre Côté
TL;DR
TALES addresses the challenge of evaluating large language models on long-horizon, environment-grounded reasoning by unifying multiple text-adventure platforms into a single benchmark. It identifies four core reasoning capabilities—spatial, deductive, inductive, and grounded—and designs minimal scaffolding under the ECBD framework to probe compositional reasoning across 122 games. The authors report zero-shot results for 34 models, revealing that, despite strong performance on synthetic tasks, top models still fail to progress meaningfully in human-designed games within a 100-step limit. The work provides insights into failure modes, emphasizes the importance of long-context feedback and grounding, and offers a practical benchmark for advancing grounded-language agents. The TALES suite, along with open code and visualizations, enables rigorous, reproducible evaluation of long-horizon reasoning in text-adventure environments.
Abstract
Reasoning is an essential skill to enable Large Language Models (LLMs) to interact with the world. As tasks become more complex, they demand increasingly sophisticated and diverse reasoning capabilities for sequential decision-making, requiring structured reasoning over the context history to determine the next best action. We introduce TALES, a diverse collection of synthetic and human-written text-adventure games designed to challenge and evaluate diverse reasoning capabilities. We present results over a range of LLMs, open- and closed-weights, performing a qualitative analysis on the top performing models. Despite an impressive showing on synthetic games, even the top LLM-driven agents fail to achieve 15% on games designed for human enjoyment. Code and visualization of the experiments can be found at https://microsoft.github.io/tale-suite.
