Table of Contents
Fetching ...

SpeechCap: Leveraging Playful Impact Captions to Facilitate Interpersonal Communication in Social Virtual Reality

Yu Zhang, Yi Wen, Siying Hu, Zhicong Lu

TL;DR

This work tackles the limited expressiveness of interpersonal communication in social VR by introducing SpeechCap, a real-time system that converts speech into interactive impact captions combining verbal content with non-verbal cues. It defines a design space for impact captions through TV-variety-show analysis and expert co-design, then validates the approach with a proof-of-concept implementation and an in-lab study (n=14) showing that captions can clarify and enrich conversations while enabling playful interactions. The study highlights benefits in emotional expression, information highlighting, and speaker identification, but also notes ambiguity risks in non-textual cues and keyword proliferation, which motivate design implications. Overall, the work contributes a concrete design space, a functional system, and evidence-based guidance for deploying expressive, multimodal communication tools in social VR, with applications in accessibility, education, and live streaming.

Abstract

Social Virtual Reality (VR) emerges as a promising platform bringing immersive, interactive, and engaging mechanisms for collaborative activities in virtual spaces. However, interpersonal communication in social VR is still limited with existing mediums and channels. To bridge the gap, we propose a novel method for mediating real-time conversation in social VR, which uses impact captions, a type of typographic visual effect widely used in videos, to convey both verbal and non-verbal information. We first investigated the design space of impact captions by content analysis and a co-design session with four experts. Next, we implemented SpeechCap as a proof-of-concept system, with which users can communicate with each other using speech-driven impact captions in VR. Through a user study (n=14), we evaluated the effectiveness of the visual and interaction design of impact captions, highlighting the interactivity and the integration of verbal and non-verbal information in communication mediums. Finally, we discussed topics of visual rhetoric, interactivity, and ambiguity as the main findings from the study, and further provided design implications for future work for facilitating interpersonal communication in social VR.

SpeechCap: Leveraging Playful Impact Captions to Facilitate Interpersonal Communication in Social Virtual Reality

TL;DR

This work tackles the limited expressiveness of interpersonal communication in social VR by introducing SpeechCap, a real-time system that converts speech into interactive impact captions combining verbal content with non-verbal cues. It defines a design space for impact captions through TV-variety-show analysis and expert co-design, then validates the approach with a proof-of-concept implementation and an in-lab study (n=14) showing that captions can clarify and enrich conversations while enabling playful interactions. The study highlights benefits in emotional expression, information highlighting, and speaker identification, but also notes ambiguity risks in non-textual cues and keyword proliferation, which motivate design implications. Overall, the work contributes a concrete design space, a functional system, and evidence-based guidance for deploying expressive, multimodal communication tools in social VR, with applications in accessibility, education, and live streaming.

Abstract

Social Virtual Reality (VR) emerges as a promising platform bringing immersive, interactive, and engaging mechanisms for collaborative activities in virtual spaces. However, interpersonal communication in social VR is still limited with existing mediums and channels. To bridge the gap, we propose a novel method for mediating real-time conversation in social VR, which uses impact captions, a type of typographic visual effect widely used in videos, to convey both verbal and non-verbal information. We first investigated the design space of impact captions by content analysis and a co-design session with four experts. Next, we implemented SpeechCap as a proof-of-concept system, with which users can communicate with each other using speech-driven impact captions in VR. Through a user study (n=14), we evaluated the effectiveness of the visual and interaction design of impact captions, highlighting the interactivity and the integration of verbal and non-verbal information in communication mediums. Finally, we discussed topics of visual rhetoric, interactivity, and ambiguity as the main findings from the study, and further provided design implications for future work for facilitating interpersonal communication in social VR.

Paper Structure

This paper contains 87 sections, 6 figures.

Figures (6)

  • Figure 1: The Visual Design Space of Impact Captions. Two impact captions, a "No Way" and a "Shocked Face" (translated from Chinese), demonstrate the textual and non-textual elements with relevant dimensions that form the visual design space. For textual elements, the typeface, the color, and the size of texts are three major dimensions. Non-textual elements include emoji, ornament, and speech bubble.
  • Figure 2: The Interaction Design of Impact Captions. The design space includes three dimensions: Physicalization, Motion, and Interaction. With Physicalization, impact captions can appear like physical objects to be affected by the gravity with mass, have velocity for movement, and take spaces with volume. With Motion, impact captions appear to be "alive" and be responsive to user actions. Interaction describes how users can play with impact captions using embodied interaction.
  • Figure 3: SpeechCap System Overview. The system consists of three key modules: (A) Voice Interface that processes real-time voice input and stores the transcribed texts to a shared database, (B) Text Processor that distills transcribed texts into impact captions and decides the design of each caption, and (C) VR Application that keeps polling the Text Processor for fetching upcoming impact captions and renders the VR space. As for hardware settings, Voice Interface and Text Processor run on a laptop that is paired with a Bluetooth Microphone for collecting voice input. A local area network (LAN) is configured to support the connection between the laptop and VR headsets and among multiple sets of hardware devices for multiple users.
  • Figure 4: Mappings Semantics to Impact Caption Design for Proof of the Concept. Valence links to text color where a warm color for positive moods and a cold color for negative moods. Loudness links to the size of captions. The larger the louder. Formality links to typeface. "Time New Roman" is used for formal and "Comic Sans" is used for casual words. Emoji is applied for words regarding happy, embarrassed, and sad feelings. Speech Bubble and a "shivering" motion is applied for words with excitement. Ornaments is applied for words representing specific entities.
  • Figure 5: Interactions with Impact Caption. (A) Grabbing allows users to hold and place a caption to an arbitrary position; (B) Stretching needs two hands to resize an impact caption; (C) Attaching allows an impact caption to be attached on the head or body of the virtual avatar; (D) Shooting can eject an impact caption forward and trigger the Explosion effect when a collision occurs.
  • ...and 1 more figures