Table of Contents
Fetching ...

"May I Speak?": Multi-modal Attention Guidance in Social VR Group Conversations

Geonsun Lee, Dae Yeol Lee, Guan-Ming Su, Dinesh Manocha

TL;DR

This work tackles turn-taking in social VR by introducing a diegetic, multi-modal attention guidance framework that uses light and spatial audio to direct users to new speakers. Grounded in a formative group interview and a controlled evaluation, the method adapts cue intensity via engagement cues such as head-body rotation and gaze, employing environment light, point light, spotlight, signaling sounds, and dynamic volume adjustments. Results show significantly faster response times, higher perceived conversation satisfaction, and stronger user preference for the proposed Light-Audio approach over baselines like Text-Icon and SGD. The study demonstrates the practical potential of diegetic, multi-modal cues to improve presence and conversational smoothness in VR meetings, while acknowledging limitations and outlining directions for personalization, larger-scale multi-speaker scenarios, and real-world deployments.

Abstract

In this paper, we present a novel multi-modal attention guidance method designed to address the challenges of turn-taking dynamics in meetings and enhance group conversations within virtual reality (VR) environments. Recognizing the difficulties posed by a confined field of view and the absence of detailed gesture tracking in VR, our proposed method aims to mitigate the challenges of noticing new speakers attempting to join the conversation. This approach tailors attention guidance, providing a nuanced experience for highly engaged participants while offering subtler cues for those less engaged, thereby enriching the overall meeting dynamics. Through group interview studies, we gathered insights to guide our design, resulting in a prototype that employs "light" as a diegetic guidance mechanism, complemented by spatial audio. The combination creates an intuitive and immersive meeting environment, effectively directing users' attention to new speakers. An evaluation study, comparing our method to state-of-the-art attention guidance approaches, demonstrated significantly faster response times (p < 0.001), heightened perceived conversation satisfaction (p < 0.001), and preference (p < 0.001) for our method. Our findings contribute to the understanding of design implications for VR social attention guidance, opening avenues for future research and development.

"May I Speak?": Multi-modal Attention Guidance in Social VR Group Conversations

TL;DR

This work tackles turn-taking in social VR by introducing a diegetic, multi-modal attention guidance framework that uses light and spatial audio to direct users to new speakers. Grounded in a formative group interview and a controlled evaluation, the method adapts cue intensity via engagement cues such as head-body rotation and gaze, employing environment light, point light, spotlight, signaling sounds, and dynamic volume adjustments. Results show significantly faster response times, higher perceived conversation satisfaction, and stronger user preference for the proposed Light-Audio approach over baselines like Text-Icon and SGD. The study demonstrates the practical potential of diegetic, multi-modal cues to improve presence and conversational smoothness in VR meetings, while acknowledging limitations and outlining directions for personalization, larger-scale multi-speaker scenarios, and real-world deployments.

Abstract

In this paper, we present a novel multi-modal attention guidance method designed to address the challenges of turn-taking dynamics in meetings and enhance group conversations within virtual reality (VR) environments. Recognizing the difficulties posed by a confined field of view and the absence of detailed gesture tracking in VR, our proposed method aims to mitigate the challenges of noticing new speakers attempting to join the conversation. This approach tailors attention guidance, providing a nuanced experience for highly engaged participants while offering subtler cues for those less engaged, thereby enriching the overall meeting dynamics. Through group interview studies, we gathered insights to guide our design, resulting in a prototype that employs "light" as a diegetic guidance mechanism, complemented by spatial audio. The combination creates an intuitive and immersive meeting environment, effectively directing users' attention to new speakers. An evaluation study, comparing our method to state-of-the-art attention guidance approaches, demonstrated significantly faster response times (p < 0.001), heightened perceived conversation satisfaction (p < 0.001), and preference (p < 0.001) for our method. Our findings contribute to the understanding of design implications for VR social attention guidance, opening avenues for future research and development.
Paper Structure (42 sections, 4 equations, 6 figures, 2 tables)

This paper contains 42 sections, 4 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Progressive adjustments in the light manipulator and spatial audio control module components relative to the angular distance ($\theta$) between the user's head-body rotation or gaze direction and the new speaker's coordinate, with subfigures (a) environment light, (b) point light, (c) spotlight, and (d) signaling sound demonstrating the range from maximum to minimum angular thresholds. The area colored in red represents $\theta_{\text{viewport}}$, the range that the new speaker coordinates is within the user's viewport.
  • Figure 2: The point light gradually interpolates from (a) a warm color(yellow) to (b) a cold color (white) as $\theta$ gets closer to $\theta_{min}$
  • Figure 3: We illustrate the virtual environment from (a) the top-down view and (b) with the virtual agent avatars
  • Figure 4: Sequential turn-taking order for speaker and listener scenarios, with virtual agents' animations and voiceovers executed accordingly. The signal for the new speaker is dispatched 5 seconds following the current speaker's turn. Each scenario includes instances of the attention guidance method activation, both when the new speaker is out of and within the participant's field of view.
  • Figure 5: We compared our approach to (a) Text-Icon, a text window fixed to the user's desk space and an icon appearing next to the agent's name tag, and (b) SGD a flickering effect on the target at user's peripheral view.
  • ...and 1 more figures