Table of Contents
Fetching ...

EscapeBench: Towards Advancing Creative Intelligence of Language Model Agents

Cheng Qian, Peixuan Han, Qinyu Luo, Bingxiang He, Xiusi Chen, Yuji Zhang, Hongyi Du, Jiarui Yao, Xiaocheng Yang, Denghui Zhang, Yunzhu Li, Heng Ji

TL;DR

EscapeBench introduces a public benchmark for evaluating the creative intelligence of language-model agents in room-escape environments, exposing substantial gaps in current LM creativity. The authors propose EscapeAgent, a framework that adds Foresight (creative tool use) and Reflection (implicit goal identification) atop a BaseAgent, enabling super-long reasoning chains and more efficient, innovative puzzle solving. Across 36 settings and multiple model families, EscapeAgent reduces hint reliance by roughly 50% and cuts total steps while maintaining coherence, though humans still outperform AI in creativity. The work lays a foundation for measuring and improving AI creativity, highlighting potential extensions to multimodal perception and reinforcement learning to further strengthen creative reasoning and human–AI collaboration.

Abstract

Language model agents excel in long-session planning and reasoning, but existing benchmarks primarily focus on goal-oriented tasks with explicit objectives, neglecting creative adaptation in unfamiliar environments. To address this, we introduce EscapeBench, a benchmark suite of room escape game environments designed to challenge agents with creative reasoning, unconventional tool use, and iterative problem-solving to uncover implicit goals. Our results show that current LM models, despite employing working memory and Chain-of-Thought reasoning, achieve only 15% average progress without hints, highlighting their limitations in creativity. To bridge this gap, we propose EscapeAgent, a framework designed to enhance creative reasoning through Foresight (innovative tool use) and Reflection (identifying unsolved tasks). Experiments show that EscapeAgent can execute action chains over 1,000 steps while maintaining logical coherence. It navigates and completes games with up to 40% fewer steps and hints, performs robustly across difficulty levels, and achieves higher action success rates with more efficient and innovative puzzle-solving strategies.

EscapeBench: Towards Advancing Creative Intelligence of Language Model Agents

TL;DR

EscapeBench introduces a public benchmark for evaluating the creative intelligence of language-model agents in room-escape environments, exposing substantial gaps in current LM creativity. The authors propose EscapeAgent, a framework that adds Foresight (creative tool use) and Reflection (implicit goal identification) atop a BaseAgent, enabling super-long reasoning chains and more efficient, innovative puzzle solving. Across 36 settings and multiple model families, EscapeAgent reduces hint reliance by roughly 50% and cuts total steps while maintaining coherence, though humans still outperform AI in creativity. The work lays a foundation for measuring and improving AI creativity, highlighting potential extensions to multimodal perception and reinforcement learning to further strengthen creative reasoning and human–AI collaboration.

Abstract

Language model agents excel in long-session planning and reasoning, but existing benchmarks primarily focus on goal-oriented tasks with explicit objectives, neglecting creative adaptation in unfamiliar environments. To address this, we introduce EscapeBench, a benchmark suite of room escape game environments designed to challenge agents with creative reasoning, unconventional tool use, and iterative problem-solving to uncover implicit goals. Our results show that current LM models, despite employing working memory and Chain-of-Thought reasoning, achieve only 15% average progress without hints, highlighting their limitations in creativity. To bridge this gap, we propose EscapeAgent, a framework designed to enhance creative reasoning through Foresight (innovative tool use) and Reflection (identifying unsolved tasks). Experiments show that EscapeAgent can execute action chains over 1,000 steps while maintaining logical coherence. It navigates and completes games with up to 40% fewer steps and hints, performs robustly across difficulty levels, and achieves higher action success rates with more efficient and innovative puzzle-solving strategies.

Paper Structure

This paper contains 53 sections, 15 figures, 7 tables.

Figures (15)

  • Figure 1: An agent with creative thinking should adapt its observation (e.g. hard texture of wood stick) into a novel tool-use strategy (e.g. prying objects open).
  • Figure 2: An illustration of Scenes, Tools, and Items in the game and their relations with agent action space. Tools can be collected for "Apply" and "Craft", while items require "Input", "Click" or "Apply" of tools to trigger effects.
  • Figure 3: Statistics of total Scenes, Tools, and Items across all game settings. "Key Steps" refer to the essential bottleneck actions required to complete the game.
  • Figure 4: Illustration of the EscapeAgent design. Building on the BaseAgent (Action), we integrate the Foresight and Reflection modules to enhance the agent's capabilities in creative reasoning and implicit goal identification.
  • Figure 5: Distribution of Key Steps Hints Used, categorized by different actions. Colored bars represent the percentage of hints used for each action type relative to the total key steps for that type (See right of \ref{['tab:game_statistic']}).
  • ...and 10 more figures