Table of Contents
Fetching ...

How Does a Virtual Agent Decide Where to Look? Symbolic Cognitive Reasoning for Embodied Head Rotation

Juyeong Hwang, Seong-Eun Hong, JaeYoung Seon, Hyeongyeop Kang

TL;DR

The paper tackles the realism gap in embodied agent gaze by introducing SCORE, a framework that grounds head-rotation decisions in symbolic cognitive predicates derived from a VR user study. It leverages a two-stage pipeline—offline Deliberative Perception–Planning Stage (DPS) using Vision–Language Models and Large Language Models, followed by online Reactive Execution Stage (RES) with FastVLM validation—to produce context-aware head poses without task-specific data. Across single- and multi-agent environments, SCORE yields more human-like trajectories than saliency-based baselines and generalizes to unseen scenes, while remaining responsive to late scene changes and distractors. The work demonstrates the practical impact of integrating cognitive reasoning with perception for believable virtual agents, with potential applications in social VR, telepresence, and interactive simulations.

Abstract

Natural head rotation is critical for believable embodied virtual agents, yet this micro-level behavior remains largely underexplored. While head-rotation prediction algorithms could, in principle, reproduce this behavior, they typically focus on visually salient stimuli and overlook the cognitive motives that guide head rotation. This yields agents that look at conspicuous objects while overlooking obstacles or task-relevant cues, diminishing realism in a virtual environment. We introduce SCORE, a Symbolic Cognitive Reasoning framework for Embodied Head Rotation, a data-agnostic framework that produces context-aware head movements without task-specific training or hand-tuned heuristics. A controlled VR study (N=20) identifies five motivational drivers of human head movements: Interest, Information Seeking, Safety, Social Schema, and Habit. SCORE encodes these drivers as symbolic predicates, perceives the scene with a Vision-Language Model (VLM), and plans head poses with a Large Language Model (LLM). The framework employs a hybrid workflow: the VLM-LLM reasoning is executed offline, after which a lightweight FastVLM performs online validation to suppress hallucinations while maintaining responsiveness to scene dynamics. The result is an agent that predicts not only where to look but also why, generalizing to unseen scenes and multi-agent crowds while retaining behavioral plausibility.

How Does a Virtual Agent Decide Where to Look? Symbolic Cognitive Reasoning for Embodied Head Rotation

TL;DR

The paper tackles the realism gap in embodied agent gaze by introducing SCORE, a framework that grounds head-rotation decisions in symbolic cognitive predicates derived from a VR user study. It leverages a two-stage pipeline—offline Deliberative Perception–Planning Stage (DPS) using Vision–Language Models and Large Language Models, followed by online Reactive Execution Stage (RES) with FastVLM validation—to produce context-aware head poses without task-specific data. Across single- and multi-agent environments, SCORE yields more human-like trajectories than saliency-based baselines and generalizes to unseen scenes, while remaining responsive to late scene changes and distractors. The work demonstrates the practical impact of integrating cognitive reasoning with perception for believable virtual agents, with potential applications in social VR, telepresence, and interactive simulations.

Abstract

Natural head rotation is critical for believable embodied virtual agents, yet this micro-level behavior remains largely underexplored. While head-rotation prediction algorithms could, in principle, reproduce this behavior, they typically focus on visually salient stimuli and overlook the cognitive motives that guide head rotation. This yields agents that look at conspicuous objects while overlooking obstacles or task-relevant cues, diminishing realism in a virtual environment. We introduce SCORE, a Symbolic Cognitive Reasoning framework for Embodied Head Rotation, a data-agnostic framework that produces context-aware head movements without task-specific training or hand-tuned heuristics. A controlled VR study (N=20) identifies five motivational drivers of human head movements: Interest, Information Seeking, Safety, Social Schema, and Habit. SCORE encodes these drivers as symbolic predicates, perceives the scene with a Vision-Language Model (VLM), and plans head poses with a Large Language Model (LLM). The framework employs a hybrid workflow: the VLM-LLM reasoning is executed offline, after which a lightweight FastVLM performs online validation to suppress hallucinations while maintaining responsiveness to scene dynamics. The result is an agent that predicts not only where to look but also why, generalizing to unseen scenes and multi-agent crowds while retaining behavioral plausibility.

Paper Structure

This paper contains 23 sections, 1 equation, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Red dashed boxes highlight elements present only in the APC version of each environment: a Santa, a traffic accident, and arguing people. The corresponding MDC scenes omit such distracting events.
  • Figure 2: Categorized distribution of participants’ self-reported head-movement rationales.
  • Figure 3: SCORE pipeline. The two-stage architecture combines offline deliberation with near-online refinement. In the DPS, PEM passes each "novel” view, the entity lists, and the scenario goal to a VLM, which generates contextual descriptions stored in FMM. The Planning Module then invokes an LLM that reasons over the five cognitive drivers to select an action–reason pair that schedules the next head orientation. During runtime, RES evaluates a 2-second look-ahead image with a lightweight FastVLM; if the pre-planned action is inconsistent with the live context, it is replaced before execution.
  • Figure 4: The figure illustrates differences in proportional selections, which indicate prioritization tendencies between humans and the model in the MDC and APC of the crossing scene.
  • Figure 5: Visualization of action and reasoning result when Interest is removed from Motivational Drivers.
  • ...and 2 more figures