Table of Contents
Fetching ...

How Far Can We Go with Pixels Alone? A Pilot Study on Screen-Only Navigation in Commercial 3D ARPGs

Kaijie Xu, Mustafa Bugti, Clark Verbrugge

TL;DR

The paper targets the problem of quantifying navigability in visually dense 3D ARPGs using pixels alone. It introduces a screen-only navigation agent built on a fixed STP/MSTP perception backbone, a finite-state controller, and a lightweight visual memory, evaluated through a milestone-based protocol that relies on image matching to designer-specified viewpoints $g_1,\dots,g_K$. The main contributions are (i) a general screen-only navigation controller that operates solely on STP/MSTP predictions and discrete camera/forward actions, (ii) a practical milestone-based evaluation toolchain, and (iii) a pilot study across four Souls-like ARPGs showing feasibility and illuminating perception-driven failure modes. Findings indicate that the FSM can boost robustness on selected segments, but memory effects are inconsistent and purely perceptual navigation remains insufficient for robust, general navigation without explicit world modeling. The work provides a concrete baseline and benchmark for screen-only visual navigation, motivating future integration with world representations (e.g., video-based world models or SLAM) and offering a useful tool for level design analysis and automated QA in modern games.

Abstract

Modern 3D game levels rely heavily on visual guidance, yet the navigability of level layouts remains difficult to quantify. Prior work either simulates play in simplified environments or analyzes static screenshots for visual affordances, but neither setting faithfully captures how players explore complex, real-world game levels. In this paper, we build on an existing open-source visual affordance detector and instantiate a screen-only exploration and navigation agent that operates purely from visual affordances. Our agent consumes live game frames, identifies salient interest points, and drives a simple finite-state controller over a minimal action space to explore Dark Souls-style linear levels and attempt to reach expected goal regions. Pilot experiments show that the agent can traverse most required segments and exhibits meaningful visual navigation behavior, but also highlight that limitations of the underlying visual model prevent truly comprehensive and reliable auto-navigation. We argue that this system provides a concrete, shared baseline and evaluation protocol for visual navigation in complex games, and we call for more attention to this necessary task. Our results suggest that purely vision-based sense-making models, with discrete single-modality inputs and without explicit reasoning, can effectively support navigation and environment understanding in idealized settings, but are unlikely to be a general solution on their own.

How Far Can We Go with Pixels Alone? A Pilot Study on Screen-Only Navigation in Commercial 3D ARPGs

TL;DR

The paper targets the problem of quantifying navigability in visually dense 3D ARPGs using pixels alone. It introduces a screen-only navigation agent built on a fixed STP/MSTP perception backbone, a finite-state controller, and a lightweight visual memory, evaluated through a milestone-based protocol that relies on image matching to designer-specified viewpoints . The main contributions are (i) a general screen-only navigation controller that operates solely on STP/MSTP predictions and discrete camera/forward actions, (ii) a practical milestone-based evaluation toolchain, and (iii) a pilot study across four Souls-like ARPGs showing feasibility and illuminating perception-driven failure modes. Findings indicate that the FSM can boost robustness on selected segments, but memory effects are inconsistent and purely perceptual navigation remains insufficient for robust, general navigation without explicit world modeling. The work provides a concrete baseline and benchmark for screen-only visual navigation, motivating future integration with world representations (e.g., video-based world models or SLAM) and offering a useful tool for level design analysis and automated QA in modern games.

Abstract

Modern 3D game levels rely heavily on visual guidance, yet the navigability of level layouts remains difficult to quantify. Prior work either simulates play in simplified environments or analyzes static screenshots for visual affordances, but neither setting faithfully captures how players explore complex, real-world game levels. In this paper, we build on an existing open-source visual affordance detector and instantiate a screen-only exploration and navigation agent that operates purely from visual affordances. Our agent consumes live game frames, identifies salient interest points, and drives a simple finite-state controller over a minimal action space to explore Dark Souls-style linear levels and attempt to reach expected goal regions. Pilot experiments show that the agent can traverse most required segments and exhibits meaningful visual navigation behavior, but also highlight that limitations of the underlying visual model prevent truly comprehensive and reliable auto-navigation. We argue that this system provides a concrete, shared baseline and evaluation protocol for visual navigation in complex games, and we call for more attention to this necessary task. Our results suggest that purely vision-based sense-making models, with discrete single-modality inputs and without explicit reasoning, can effectively support navigation and environment understanding in idealized settings, but are unlikely to be a general solution on their own.
Paper Structure (23 sections, 10 equations, 12 figures, 3 tables)

This paper contains 23 sections, 10 equations, 12 figures, 3 tables.

Figures (12)

  • Figure 1: Top-down map of the Dark Souls III Grand Archives test route. The red polyline shows the designer-specified path from the bonfire (bottom) toward the second-floor landing (top). Red boxes mark the six visual milestones $M_1$-$M_6$ used in our experiments; $M_2$ and $M_3$ correspond to the challenging mid-route transitions analyzed in Section \ref{['sec:case-study']}.
  • Figure 2: Irithyll $M_3$: the MSTP correctly highlights the upper landing, but the avatar gets stuck on local geometry.
  • Figure 3: Raya Lucaria Academy from $M_1$ to $M_2$: the doorway is correctly detected as an MSTP, but the avatar gets stuck between a breakable table and an effigy.
  • Figure 4: Painted World from $M_2$ to $M_3$: the stairs are correctly detected as an MSTP, but require a narrow, oblique approach angle that is easy to miss from the avatar's starting pose.
  • Figure 5: Grand Archives milestone $M_2$: a near-symmetric scene where both the data and the route favor turning left.
  • ...and 7 more figures