How Does a Virtual Agent Decide Where to Look? Symbolic Cognitive Reasoning for Embodied Head Rotation

Juyeong Hwang; Seong-Eun Hong; JaeYoung Seon; Hyeongyeop Kang

How Does a Virtual Agent Decide Where to Look? Symbolic Cognitive Reasoning for Embodied Head Rotation

Juyeong Hwang, Seong-Eun Hong, JaeYoung Seon, Hyeongyeop Kang

TL;DR

The paper tackles the realism gap in embodied agent gaze by introducing SCORE, a framework that grounds head-rotation decisions in symbolic cognitive predicates derived from a VR user study. It leverages a two-stage pipeline—offline Deliberative Perception–Planning Stage (DPS) using Vision–Language Models and Large Language Models, followed by online Reactive Execution Stage (RES) with FastVLM validation—to produce context-aware head poses without task-specific data. Across single- and multi-agent environments, SCORE yields more human-like trajectories than saliency-based baselines and generalizes to unseen scenes, while remaining responsive to late scene changes and distractors. The work demonstrates the practical impact of integrating cognitive reasoning with perception for believable virtual agents, with potential applications in social VR, telepresence, and interactive simulations.

Abstract

Natural head rotation is critical for believable embodied virtual agents, yet this micro-level behavior remains largely underexplored. While head-rotation prediction algorithms could, in principle, reproduce this behavior, they typically focus on visually salient stimuli and overlook the cognitive motives that guide head rotation. This yields agents that look at conspicuous objects while overlooking obstacles or task-relevant cues, diminishing realism in a virtual environment. We introduce SCORE, a Symbolic Cognitive Reasoning framework for Embodied Head Rotation, a data-agnostic framework that produces context-aware head movements without task-specific training or hand-tuned heuristics. A controlled VR study (N=20) identifies five motivational drivers of human head movements: Interest, Information Seeking, Safety, Social Schema, and Habit. SCORE encodes these drivers as symbolic predicates, perceives the scene with a Vision-Language Model (VLM), and plans head poses with a Large Language Model (LLM). The framework employs a hybrid workflow: the VLM-LLM reasoning is executed offline, after which a lightweight FastVLM performs online validation to suppress hallucinations while maintaining responsiveness to scene dynamics. The result is an agent that predicts not only where to look but also why, generalizing to unseen scenes and multi-agent crowds while retaining behavioral plausibility.

How Does a Virtual Agent Decide Where to Look? Symbolic Cognitive Reasoning for Embodied Head Rotation

TL;DR

Abstract

How Does a Virtual Agent Decide Where to Look? Symbolic Cognitive Reasoning for Embodied Head Rotation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (7)