Table of Contents
Fetching ...

FlashAdventure: A Benchmark for GUI Agents Solving Full Story Arcs in Diverse Adventure Games

Jaewoo Ahn, Junseo Kim, Heeseung Yun, Jaehyeon Son, Dongmin Park, Jaewoong Cho, Gunhee Kim

TL;DR

FlashAdventure presents 34 Flash-based adventure games to evaluate GUI agents on full narrative arcs, addressing long-horizon planning and the observation-behavior gap. It introduces CUA-as-a-Judge for automatic milestone validation and COAST, a clue-oriented framework that maintains long-term memory of clues to guide a Seek-Map-Solve planning cycle. Empirical results show that current GUI agents struggle to complete full story arcs, with COAST delivering the best improvements in milestone completion and success rates relative to baselines, though humans still outperform agents by a wide margin. The benchmark, evaluation protocol, and memory-augmented reasoning framework collectively establish a practical, diverse platform for advancing long-horizon GUI reasoning in game-like environments.

Abstract

GUI agents powered by LLMs show promise in interacting with diverse digital environments. Among these, video games offer a valuable testbed due to their varied interfaces, with adventure games posing additional challenges through complex, narrative-driven interactions. Existing game benchmarks, however, lack diversity and rarely evaluate agents on completing entire storylines. To address this, we introduce FlashAdventure, a benchmark of 34 Flash-based adventure games designed to test full story arc completion and tackle the observation-behavior gap: the challenge of remembering and acting on earlier gameplay information. We also propose CUA-as-a-Judge, an automated gameplay evaluator, and COAST, an agentic framework leveraging long-term clue memory to better plan and solve sequential tasks. Experiments show current GUI agents struggle with full story arcs, while COAST improves milestone completion by bridging the observation-behavior gap. Nonetheless, a marked discrepancy between humans and best-performing agents warrants continued research efforts to narrow this divide.

FlashAdventure: A Benchmark for GUI Agents Solving Full Story Arcs in Diverse Adventure Games

TL;DR

FlashAdventure presents 34 Flash-based adventure games to evaluate GUI agents on full narrative arcs, addressing long-horizon planning and the observation-behavior gap. It introduces CUA-as-a-Judge for automatic milestone validation and COAST, a clue-oriented framework that maintains long-term memory of clues to guide a Seek-Map-Solve planning cycle. Empirical results show that current GUI agents struggle to complete full story arcs, with COAST delivering the best improvements in milestone completion and success rates relative to baselines, though humans still outperform agents by a wide margin. The benchmark, evaluation protocol, and memory-augmented reasoning framework collectively establish a practical, diverse platform for advancing long-horizon GUI reasoning in game-like environments.

Abstract

GUI agents powered by LLMs show promise in interacting with diverse digital environments. Among these, video games offer a valuable testbed due to their varied interfaces, with adventure games posing additional challenges through complex, narrative-driven interactions. Existing game benchmarks, however, lack diversity and rarely evaluate agents on completing entire storylines. To address this, we introduce FlashAdventure, a benchmark of 34 Flash-based adventure games designed to test full story arc completion and tackle the observation-behavior gap: the challenge of remembering and acting on earlier gameplay information. We also propose CUA-as-a-Judge, an automated gameplay evaluator, and COAST, an agentic framework leveraging long-term clue memory to better plan and solve sequential tasks. Experiments show current GUI agents struggle with full story arcs, while COAST improves milestone completion by bridging the observation-behavior gap. Nonetheless, a marked discrepancy between humans and best-performing agents warrants continued research efforts to narrow this divide.

Paper Structure

This paper contains 38 sections, 11 figures, 18 tables, 1 algorithm.

Figures (11)

  • Figure 1: FlashAdventure consists of 34 Flash-based classic adventure games and supports automatic evaluation of the GUI agent using CUA-as-a-Judge.
  • Figure 2: Comparison of gameplay progression across (a) VisEscape lim2025visescape, (b) Cradle tan2024cradle, and (c) FlashAdventure. Prior benchmarks focus on short-term objectives or include short story arcs, limiting their ability to fully evaluate agents’ capacity to manage the long-term observation-behavior gap. In contrast, FlashAdventure emphasizes completion of full story arcs involving long-term objectives, exemplified by suspect interrogations leading to a verdict.
  • Figure 3: Overview of COAST Framework with Seek-Map-Solve Cycle.
  • Figure 4: Comparison of average milestone completion rates (MCR) across different game subgenres for three GUI agents.
  • Figure 5: An illustration of the Point-and-Click Adventure (mystery/detective) subgenre, showing a human player's walkthrough of Sherlock Holmes: The Tea Shop Murder Mystery. The cumulative number of steps is written at the end of each subfigure caption; the game ends at 718 steps.
  • ...and 6 more figures