Table of Contents
Fetching ...

VisEscape: A Benchmark for Evaluating Exploration-driven Decision-making in Virtual Escape Rooms

Seungwon Lim, Sungwoong Kim, Jihwan Yu, Sungjae Lee, Jiwan Chung, Youngjae Yu

TL;DR

VisEscape addresses the challenge of exploration-driven decision-making in dynamic, visually rich environments by introducing a benchmark of 20 virtual escape rooms. The authors demonstrate that state-of-the-art multimodal models struggle to escape without guidance, highlighting the need for memory and reasoning components. They propose a modular agent, VisEscaper, that integrates a Memory Management module and a Reasoning module, yielding significant improvements in success and efficiency and revealing a synergistic interaction between memory and reasoning. The work further analyzes module contributions, compares VLM-based input processing with LLM-driven captioning, and emphasizes the potential of structured exploration and iterative hypothesis testing for complex, open-ended tasks.

Abstract

Escape rooms present a unique cognitive challenge that demands exploration-driven planning: with the sole instruction to 'escape the room', players must actively search their environment, collecting information, and finding solutions through repeated trial and error. Motivated by this, we introduce VisEscape, a benchmark of 20 virtual escape rooms specifically designed to evaluate AI models under these challenging conditions, where success depends not only on solving isolated puzzles but also on iteratively constructing and refining spatial-temporal knowledge of a dynamically changing environment. On VisEscape, we observe that even state-of-the-art multi-modal models generally fail to escape the rooms, showing considerable variation in their progress and problem-solving approaches. We find that integrating memory management and reasoning contributes to efficient exploration and enables successive hypothesis formulation and testing, thereby leading to significant improvements in dynamic and exploration-driven environments

VisEscape: A Benchmark for Evaluating Exploration-driven Decision-making in Virtual Escape Rooms

TL;DR

VisEscape addresses the challenge of exploration-driven decision-making in dynamic, visually rich environments by introducing a benchmark of 20 virtual escape rooms. The authors demonstrate that state-of-the-art multimodal models struggle to escape without guidance, highlighting the need for memory and reasoning components. They propose a modular agent, VisEscaper, that integrates a Memory Management module and a Reasoning module, yielding significant improvements in success and efficiency and revealing a synergistic interaction between memory and reasoning. The work further analyzes module contributions, compares VLM-based input processing with LLM-driven captioning, and emphasizes the potential of structured exploration and iterative hypothesis testing for complex, open-ended tasks.

Abstract

Escape rooms present a unique cognitive challenge that demands exploration-driven planning: with the sole instruction to 'escape the room', players must actively search their environment, collecting information, and finding solutions through repeated trial and error. Motivated by this, we introduce VisEscape, a benchmark of 20 virtual escape rooms specifically designed to evaluate AI models under these challenging conditions, where success depends not only on solving isolated puzzles but also on iteratively constructing and refining spatial-temporal knowledge of a dynamically changing environment. On VisEscape, we observe that even state-of-the-art multi-modal models generally fail to escape the rooms, showing considerable variation in their progress and problem-solving approaches. We find that integrating memory management and reasoning contributes to efficient exploration and enables successive hypothesis formulation and testing, thereby leading to significant improvements in dynamic and exploration-driven environments

Paper Structure

This paper contains 50 sections, 18 figures, 23 tables.

Figures (18)

  • Figure 1: Depiction of the exploration-driven problem-solving in VisEscape. Agents must (1) actively explore to uncover relevant information and (2) subsequently formulate and test hypotheses through interaction to solve puzzles to a successful escape.
  • Figure 2: An illustration of an excerpt of a trajectory from VisEscape. To escape the room successfully, agents must explore multiple directions and diverse views, and interact with various objects. Additionally, they need to infer associations between two or more scenes in different locations to solve creative puzzles.
  • Figure 3: An illustration of each component in VisEscape.
  • Figure 4: Process of VisEscape construction.
  • Figure 5: An overview of memory management module and reasoning module, along with examples of inputs (gray boxes) and outputs (colored boxes) for each module.
  • ...and 13 more figures