Table of Contents
Fetching ...

TALES: Text Adventure Learning Environment Suite

Christopher Zhang Cui, Xingdi Yuan, Ziang Xiao, Prithviraj Ammanabrolu, Marc-Alexandre Côté

TL;DR

TALES addresses the challenge of evaluating large language models on long-horizon, environment-grounded reasoning by unifying multiple text-adventure platforms into a single benchmark. It identifies four core reasoning capabilities—spatial, deductive, inductive, and grounded—and designs minimal scaffolding under the ECBD framework to probe compositional reasoning across 122 games. The authors report zero-shot results for 34 models, revealing that, despite strong performance on synthetic tasks, top models still fail to progress meaningfully in human-designed games within a 100-step limit. The work provides insights into failure modes, emphasizes the importance of long-context feedback and grounding, and offers a practical benchmark for advancing grounded-language agents. The TALES suite, along with open code and visualizations, enables rigorous, reproducible evaluation of long-horizon reasoning in text-adventure environments.

Abstract

Reasoning is an essential skill to enable Large Language Models (LLMs) to interact with the world. As tasks become more complex, they demand increasingly sophisticated and diverse reasoning capabilities for sequential decision-making, requiring structured reasoning over the context history to determine the next best action. We introduce TALES, a diverse collection of synthetic and human-written text-adventure games designed to challenge and evaluate diverse reasoning capabilities. We present results over a range of LLMs, open- and closed-weights, performing a qualitative analysis on the top performing models. Despite an impressive showing on synthetic games, even the top LLM-driven agents fail to achieve 15% on games designed for human enjoyment. Code and visualization of the experiments can be found at https://microsoft.github.io/tale-suite.

TALES: Text Adventure Learning Environment Suite

TL;DR

TALES addresses the challenge of evaluating large language models on long-horizon, environment-grounded reasoning by unifying multiple text-adventure platforms into a single benchmark. It identifies four core reasoning capabilities—spatial, deductive, inductive, and grounded—and designs minimal scaffolding under the ECBD framework to probe compositional reasoning across 122 games. The authors report zero-shot results for 34 models, revealing that, despite strong performance on synthetic tasks, top models still fail to progress meaningfully in human-designed games within a 100-step limit. The work provides insights into failure modes, emphasizes the importance of long-context feedback and grounding, and offers a practical benchmark for advancing grounded-language agents. The TALES suite, along with open code and visualizations, enables rigorous, reproducible evaluation of long-horizon reasoning in text-adventure environments.

Abstract

Reasoning is an essential skill to enable Large Language Models (LLMs) to interact with the world. As tasks become more complex, they demand increasingly sophisticated and diverse reasoning capabilities for sequential decision-making, requiring structured reasoning over the context history to determine the next best action. We introduce TALES, a diverse collection of synthetic and human-written text-adventure games designed to challenge and evaluate diverse reasoning capabilities. We present results over a range of LLMs, open- and closed-weights, performing a qualitative analysis on the top performing models. Despite an impressive showing on synthetic games, even the top LLM-driven agents fail to achieve 15% on games designed for human enjoyment. Code and visualization of the experiments can be found at https://microsoft.github.io/tale-suite.

Paper Structure

This paper contains 33 sections, 2 figures, 12 tables.

Figures (2)

  • Figure 1: Example of a gameplay trajectory presenting the conversation between the game engine and an agent. We additionally fabricate the agent's reasoning to demonstrate the reasoning types this work concerns, detailed in Section \ref{['sec:reasoning']}. Here, the agent made a mistake in its inductive reasoning, which further caused the generation of a sub-optimal action.
  • Figure 2: Max normalized score per step for the hardest mode of Simon Says, ALFWorld, ScienceWorld, and Zork1 for top LLMs. Error bars represent the standard deviation of scores across 5 different seeds. We see that outside of one notable exception (o1), all selected LLMs achieve near the maximum score for the most difficult version of the Simon Says game. However, despite increasing performance in synthetic, training environments, LLMs still struggle immensely with the human-written Zork1.