Table of Contents
Fetching ...

Since U Been Gone: Augmenting Context-Aware Transcriptions for Re-engaging in Immersive VR Meetings

Geonsun Lee, Yue Yang, Jennifer Healey, Dinesh Manocha

TL;DR

The paper addresses the challenge of sustaining engagement in immersive VR meetings after disruptions by introducing EngageSync, a context-aware avatar-fixed transcription interface that provides live transcripts and LLМ-generated summaries to support re-engagement while preserving social presence. EngageSync operates in two modes—Engagement and Re-engagement—driven by gaze, speech activity, and pinch gestures, and it delivers on-demand access with automatic mode switching. Through formative and user studies across small and mid-sized groups, the authors show that EngageSync improves social presence, increases attention to avatars, reduces re-engagement time, and enhances information recall, with stronger effects in larger groups. The work offers design insights for adaptive transcription in VR, demonstrates the practicality of context-aware captions, and suggests that avatar-fixed, gaze-triggered, on-demand interfaces can better balance immersion with information catch-up in immersive meetings.

Abstract

Maintaining engagement in immersive meetings is challenging, particularly when users must catch up on missed content after disruptions. While transcription interfaces can help, table-fixed panels have the potential to distract users from the group, diminishing social presence, while avatar-fixed captions fail to provide past context. We present EngageSync, a context-aware avatar-fixed transcription interface that adapts based on user engagement, offering live transcriptions and LLM-generated summaries to enhance catching up while preserving social presence. We implemented a live VR meeting setup for a 12-participant formative study and elicited design considerations. In two user studies with small (3 avatars) and mid-sized (7 avatars) groups, EngageSync significantly improved social presence (p < .05) and time spent gazing at others in the group instead of the interface over table-fixed panels. Also, it reduced re-engagement time and increased information recall (p < .05) over avatar-fixed interfaces, with stronger effects in mid-sized groups (p < .01).

Since U Been Gone: Augmenting Context-Aware Transcriptions for Re-engaging in Immersive VR Meetings

TL;DR

The paper addresses the challenge of sustaining engagement in immersive VR meetings after disruptions by introducing EngageSync, a context-aware avatar-fixed transcription interface that provides live transcripts and LLМ-generated summaries to support re-engagement while preserving social presence. EngageSync operates in two modes—Engagement and Re-engagement—driven by gaze, speech activity, and pinch gestures, and it delivers on-demand access with automatic mode switching. Through formative and user studies across small and mid-sized groups, the authors show that EngageSync improves social presence, increases attention to avatars, reduces re-engagement time, and enhances information recall, with stronger effects in larger groups. The work offers design insights for adaptive transcription in VR, demonstrates the practicality of context-aware captions, and suggests that avatar-fixed, gaze-triggered, on-demand interfaces can better balance immersion with information catch-up in immersive meetings.

Abstract

Maintaining engagement in immersive meetings is challenging, particularly when users must catch up on missed content after disruptions. While transcription interfaces can help, table-fixed panels have the potential to distract users from the group, diminishing social presence, while avatar-fixed captions fail to provide past context. We present EngageSync, a context-aware avatar-fixed transcription interface that adapts based on user engagement, offering live transcriptions and LLM-generated summaries to enhance catching up while preserving social presence. We implemented a live VR meeting setup for a 12-participant formative study and elicited design considerations. In two user studies with small (3 avatars) and mid-sized (7 avatars) groups, EngageSync significantly improved social presence (p < .05) and time spent gazing at others in the group instead of the interface over table-fixed panels. Also, it reduced re-engagement time and increased information recall (p < .05) over avatar-fixed interfaces, with stronger effects in mid-sized groups (p < .01).

Paper Structure

This paper contains 56 sections, 15 figures, 3 tables.

Figures (15)

  • Figure 1: The formative study setup of a four-people meeting. A screenshot of the immersive meeting environment from a first-person view (Left). The text panel interface was used in the formative study. The interface consists of a text panel where the participants' names are color-coded for readability, an auto-scroll button to the left that follows the newest lines in the panel, and a scroll-up and down button to the right (Right).
  • Figure 2: Real-time multi-user networking pipeline for our immersive VR meeting setup. Local microphone inputs are processed by Google STT for transcription, while audio streams to a shared server via Fusion Voice Client for voice and avatar synchronization. The Text Server (host mode) manages transcriptions and requests summaries from GPT-4-Turbo API. All elements (transcription, summaries, audio) synchronize across remote users to maintain seamless real-time interaction
  • Figure 3: An example of dropouts and rejoins of participants during the formative study. Note that the order of drop-out was randomized between trials.
  • Figure 4: Statistical analyses from the formative study comparing Full Transcript and Summary. The NASA TLX results show lower cognitive load for Summary across multiple subscales (Mental, Physical, Temporal, Frustration, and Effort) with Temporal Demand being significantly lower with Summary. The NMSPI scores highlight significantly higher Attention Allocation (AA) for Summary, with relatively smaller differences in Co-presence (CP), Perceived Message Understanding (PMU) and Perceived Affective Understanding (PAU). Significant differences are indicated by * ($p < 0.05$) and ** ($p < 0.01$). Participants interacted more frequently with Full Transcript, showing similar gaze time spent between interface and avatars, whereas with Summary, participants directed their gaze predominantly toward avatars.
  • Figure 5: An overview of EngageSync flowchart and demonstrational screenshots of key features. In Engagement Mode, (a) if the user performs a pinch gesture while looking at a speaker, the panel attached to the avatar displays a live transcription; (b) if the avatar is a listener (no audio detected), a summary of their previous utterance is shown. Upon rejoining after a dropout, (c) summaries of what each avatar said during the dropout are displayed. (d) Once it is 'read', the interface disappears, and when all the summary panels are read, the system returns to Engagement Mode.
  • ...and 10 more figures