Table of Contents
Fetching ...

GameplayQA: A Benchmarking Framework for Decision-Dense POV-Synced Multi-Video Understanding of 3D Virtual Agents

Yunzhe Wang, Runhui Xu, Kexin Zheng, Tianyi Zhang, Jayavibhav Niranjan Kogundi, Soham Hans, Volkan Ustun

Abstract

Multimodal LLMs are increasingly deployed as perceptual backbones for autonomous agents in 3D environments, from robotics to virtual worlds. These applications require agents to perceive rapid state changes, attribute actions to the correct entities, and reason about concurrent multi-agent behaviors from a first-person perspective, capabilities that existing benchmarks do not adequately evaluate. We introduce GameplayQA, a framework for evaluating agentic-centric perception and reasoning through video understanding. Specifically, we densely annotate multiplayer 3D gameplay videos at 1.22 labels/second, with time-synced, concurrent captions of states, actions, and events structured around a triadic system of Self, Other Agents, and the World, a natural decomposition for multi-agent environments. From these annotations, we refined 2.4K diagnostic QA pairs organized into three levels of cognitive complexity, accompanied by a structured distractor taxonomy that enables fine-grained analysis of where models hallucinate. Evaluation of frontier MLLMs reveals a substantial gap from human performance, with common failures in temporal and cross-video grounding, agent-role attribution, and handling the decision density of the game. We hope GameplayQA stimulates future research at the intersection of embodied AI, agentic perception, and world modeling.

GameplayQA: A Benchmarking Framework for Decision-Dense POV-Synced Multi-Video Understanding of 3D Virtual Agents

Abstract

Multimodal LLMs are increasingly deployed as perceptual backbones for autonomous agents in 3D environments, from robotics to virtual worlds. These applications require agents to perceive rapid state changes, attribute actions to the correct entities, and reason about concurrent multi-agent behaviors from a first-person perspective, capabilities that existing benchmarks do not adequately evaluate. We introduce GameplayQA, a framework for evaluating agentic-centric perception and reasoning through video understanding. Specifically, we densely annotate multiplayer 3D gameplay videos at 1.22 labels/second, with time-synced, concurrent captions of states, actions, and events structured around a triadic system of Self, Other Agents, and the World, a natural decomposition for multi-agent environments. From these annotations, we refined 2.4K diagnostic QA pairs organized into three levels of cognitive complexity, accompanied by a structured distractor taxonomy that enables fine-grained analysis of where models hallucinate. Evaluation of frontier MLLMs reveals a substantial gap from human performance, with common failures in temporal and cross-video grounding, agent-role attribution, and handling the decision density of the game. We hope GameplayQA stimulates future research at the intersection of embodied AI, agentic perception, and world modeling.

Paper Structure

This paper contains 41 sections, 1 equation, 8 figures, 14 tables.

Figures (8)

  • Figure 1: Question taxonomy of GameplayQA. Questions are organized along two axes: Entity (Self, Other, World) and Temporal Nature (Action/State for agents, Object/Event for world), yielding six primitive label types. These primitives compose into 15 task categories across three cognitive levels: single-reference perception (L1), temporal reasoning (L2), and cross-video understanding (L3). See Sec. \ref{['subsec:question_taxonomy']} and Table \ref{['tab:question_categories']} for details.
  • Figure 2: Overview of GameplayQA. Gameplay videos undergo (1) dense multi-track temporal captioning on 6 types of target entities (Sec. \ref{['subsec:timeline_captioning']}), (2) captioning includes negative labels for hallucination-inducing distractors, and (3) QA pairs are generated through a combinatorial template-based algorithm (Sec. \ref{['subsec:qa_generation']}). After (4) quality assurance (Sec. \ref{['subsec:quality_assurance']}), the benchmark enables (5) model evaluation with (6) fine-grained hallucination analysis (Sec. \ref{['subsec:hallucination']}).
  • Figure 3: Example questions from GameplayQA across different question codes and cognitive levels. Each example shows video frames paired with the corresponding QA pair, illustrating the progression from basic perception (L1) to temporal reasoning (L2) to cross-video understanding (L3). Additional cross-domain examples from car collision and egocentric human activity videos demonstrate the generalizability of the framework.
  • Figure 4: Error rate analysis across four dimensions. Top-left: Cross-video and temporal distractors cause the most errors. Top-right: Fast-paced shooters (CS2, Battlefield) are hardest. Bottom-left: Error increases with video length. Bottom-right: Error scales with number of synchronized videos.
  • Figure 5: Illustration of the Self--Other--World framework. A first-person player (Self) perceives a teammate (Other) issuing a warning, set against the surrounding game environment (World). These three perspectives define the entity types used in our question taxonomy.
  • ...and 3 more figures