"May I Speak?": Multi-modal Attention Guidance in Social VR Group Conversations
Geonsun Lee, Dae Yeol Lee, Guan-Ming Su, Dinesh Manocha
TL;DR
This work tackles turn-taking in social VR by introducing a diegetic, multi-modal attention guidance framework that uses light and spatial audio to direct users to new speakers. Grounded in a formative group interview and a controlled evaluation, the method adapts cue intensity via engagement cues such as head-body rotation and gaze, employing environment light, point light, spotlight, signaling sounds, and dynamic volume adjustments. Results show significantly faster response times, higher perceived conversation satisfaction, and stronger user preference for the proposed Light-Audio approach over baselines like Text-Icon and SGD. The study demonstrates the practical potential of diegetic, multi-modal cues to improve presence and conversational smoothness in VR meetings, while acknowledging limitations and outlining directions for personalization, larger-scale multi-speaker scenarios, and real-world deployments.
Abstract
In this paper, we present a novel multi-modal attention guidance method designed to address the challenges of turn-taking dynamics in meetings and enhance group conversations within virtual reality (VR) environments. Recognizing the difficulties posed by a confined field of view and the absence of detailed gesture tracking in VR, our proposed method aims to mitigate the challenges of noticing new speakers attempting to join the conversation. This approach tailors attention guidance, providing a nuanced experience for highly engaged participants while offering subtler cues for those less engaged, thereby enriching the overall meeting dynamics. Through group interview studies, we gathered insights to guide our design, resulting in a prototype that employs "light" as a diegetic guidance mechanism, complemented by spatial audio. The combination creates an intuitive and immersive meeting environment, effectively directing users' attention to new speakers. An evaluation study, comparing our method to state-of-the-art attention guidance approaches, demonstrated significantly faster response times (p < 0.001), heightened perceived conversation satisfaction (p < 0.001), and preference (p < 0.001) for our method. Our findings contribute to the understanding of design implications for VR social attention guidance, opening avenues for future research and development.
