Reading the Mood Behind Words: Integrating Prosody-Derived Emotional Context into Socially Responsive VR Agents

SangYeop Jeong; Yeongseo Na; Seung Gyu Jeong; Jin-Woo Jeong; Seong-Eun Kim

Reading the Mood Behind Words: Integrating Prosody-Derived Emotional Context into Socially Responsive VR Agents

SangYeop Jeong, Yeongseo Na, Seung Gyu Jeong, Jin-Woo Jeong, Seong-Eun Kim

TL;DR

An emotion-context-aware VR interaction pipeline that treats vocal emotion as explicit dialogue context in an LLM-based conversational agent and injects emotion labels into the agent's dialogue context to shape response tone and style is proposed.

Abstract

In VR interactions with embodied conversational agents, users' emotional intent is often conveyed more by how something is said than by what is said. However, most VR agent pipelines rely on speech-to-text processing, discarding prosodic cues and often producing emotionally incongruent responses despite correct semantics. We propose an emotion-context-aware VR interaction pipeline that treats vocal emotion as explicit dialogue context in an LLM-based conversational agent. A real-time speech emotion recognition model infers users' emotional states from prosody, and the resulting emotion labels are injected into the agent's dialogue context to shape response tone and style. Results from a within-subjects VR study (N=30) show significant improvements in dialogue quality, naturalness, engagement, rapport, and human-likeness, with 93.3% of participants preferring the emotion-aware agent.

Reading the Mood Behind Words: Integrating Prosody-Derived Emotional Context into Socially Responsive VR Agents

TL;DR

Abstract

Paper Structure (25 sections, 3 figures, 2 tables)

This paper contains 25 sections, 3 figures, 2 tables.

Introduction
Method
Experimental Design
Stimuli: Disentangling Content and Emotion
Procedure and Measures
Results
RQ1: Effects on Social Presence and Social Agency
RQ2: Interaction Quality under Emotionally Neutral and Ambiguous Language
Emotional Engagement and User Evaluation
Discussion
Affective Resonance Over Mechanical Alignment
The Novelty--Utility Paradox
Limitations and Future Work
AI Use Disclosure
System Prompt
...and 10 more sections

Figures (3)

Figure 1: Comparison between the Emotion Recognition (ER) and Non-Emotion Recognition (NER) conditions across evaluation metrics. (A) Human--agent interaction quality (HAI: Naturalness, Engagement, Rapport, Human-likeness, Synchrony). (B) System performance and quality (Dialogue Quality, Emotional Responsiveness, Reuse Intention). (C) Emotional response (SAM: Valence, Arousal, Dominance). (D) User experience (UEQ subscales). (E) Intrinsic motivation (IMI: Value/Usefulness, Interest/Enjoyment, Effort/Importance). Boxplots indicate medians, interquartile ranges, and 1.5$\times$IQR whiskers; circles denote outliers. Significance markers denote paired comparisons: * $p<.05$, ** $p<.01$, *** $p<.001$.
Figure 2: Interaction with the ER (Emotion-Recognition) Agent. The agent detects the user's vocal prosody (Happy, Sad, Angry) and generates an affectively congruent response, recognizing the emotional intent behind the neutral text.
Figure 3: Interaction with the NER (Non-Emotion-Recognition) Agent. The agent relies solely on semantic content, ignoring vocal cues. Consequently, it provides factual or generic responses about "rain" regardless of the user's emotional tone.

Reading the Mood Behind Words: Integrating Prosody-Derived Emotional Context into Socially Responsive VR Agents

TL;DR

Abstract

Reading the Mood Behind Words: Integrating Prosody-Derived Emotional Context into Socially Responsive VR Agents

Authors

TL;DR

Abstract

Table of Contents

Figures (3)