Table of Contents
Fetching ...

Augmented Conversation with Embedded Speech-Driven On-the-Fly Referencing in AR

Shivesh Jadon, Mehrad Faridan, Edward Mah, Rajan Vaish, Wesley Willett, Ryo Suzuki

TL;DR

This work introduces augmented conversation, a framework for real-time, speech-driven referencing in augmented reality to support co-located discussions without breaking eye contact. It delivers RealityChat, a Hololens 2 prototype that combines real-time speech recognition, transformer-based keyword extraction, and gaze-based interaction to embed contextual visual references around conversational partners. A user study with 13 participants indicates the approach reduces distraction relative to smartphone searches and is perceived as intuitive and highly useful, with visuals (images/maps/Wikipedia) being particularly effective. The design-space analysis and prototype demonstrate the potential of AR for seamless, context-aware conversational augmentation, with implications for education, professional settings, and accessibility.

Abstract

This paper introduces the concept of augmented conversation, which aims to support co-located in-person conversations via embedded speech-driven on-the-fly referencing in augmented reality (AR). Today computing technologies like smartphones allow quick access to a variety of references during the conversation. However, these tools often create distractions, reducing eye contact and forcing users to focus their attention on phone screens and manually enter keywords to access relevant information. In contrast, AR-based on-the-fly referencing provides relevant visual references in real-time, based on keywords extracted automatically from the spoken conversation. By embedding these visual references in AR around the conversation partner, augmented conversation reduces distraction and friction, allowing users to maintain eye contact and supporting more natural social interactions. To demonstrate this concept, we developed \system, a Hololens-based interface that leverages real-time speech recognition, natural language processing and gaze-based interactions for on-the-fly embedded visual referencing. In this paper, we explore the design space of visual referencing for conversations, and describe our our implementation -- building on seven design guidelines identified through a user-centered design process. An initial user study confirms that our system decreases distraction and friction in conversations compared to smartphone searches, while providing highly useful and relevant information.

Augmented Conversation with Embedded Speech-Driven On-the-Fly Referencing in AR

TL;DR

This work introduces augmented conversation, a framework for real-time, speech-driven referencing in augmented reality to support co-located discussions without breaking eye contact. It delivers RealityChat, a Hololens 2 prototype that combines real-time speech recognition, transformer-based keyword extraction, and gaze-based interaction to embed contextual visual references around conversational partners. A user study with 13 participants indicates the approach reduces distraction relative to smartphone searches and is perceived as intuitive and highly useful, with visuals (images/maps/Wikipedia) being particularly effective. The design-space analysis and prototype demonstrate the potential of AR for seamless, context-aware conversational augmentation, with implications for education, professional settings, and accessibility.

Abstract

This paper introduces the concept of augmented conversation, which aims to support co-located in-person conversations via embedded speech-driven on-the-fly referencing in augmented reality (AR). Today computing technologies like smartphones allow quick access to a variety of references during the conversation. However, these tools often create distractions, reducing eye contact and forcing users to focus their attention on phone screens and manually enter keywords to access relevant information. In contrast, AR-based on-the-fly referencing provides relevant visual references in real-time, based on keywords extracted automatically from the spoken conversation. By embedding these visual references in AR around the conversation partner, augmented conversation reduces distraction and friction, allowing users to maintain eye contact and supporting more natural social interactions. To demonstrate this concept, we developed \system, a Hololens-based interface that leverages real-time speech recognition, natural language processing and gaze-based interactions for on-the-fly embedded visual referencing. In this paper, we explore the design space of visual referencing for conversations, and describe our our implementation -- building on seven design guidelines identified through a user-centered design process. An initial user study confirms that our system decreases distraction and friction in conversations compared to smartphone searches, while providing highly useful and relevant information.
Paper Structure (26 sections, 5 figures)

This paper contains 26 sections, 5 figures.

Figures (5)

  • Figure 1: On-the-fly conversational support through an interactive augmented reality (AR) application. (Left and Middle) Users interact with an AR interface that transforms real-time speech into visual overlays, displaying information about Paris, including maps, weather forecasts, and landmarks. (Right) A user explores details about Pablo Picasso through an overlay featuring images and textual information. This AR-based system exemplifies the concept of augmented conversation by providing real-time visual references based on spoken conversation keywords.
  • Figure 2: The design space of augmented conversation approaches is large and diverse. Here we identify a variety of possible design options for augmented conversation spanning seven design dimensions. Design possibilities explored in our RealityChat prototype are highlighted in blue.
  • Figure 3: Overview of the key features of our RealityChat, each with corresponding design goals.
  • Figure 4: The RealityChat system as seen from the point of view of a user wearing a HoloLens 2.
  • Figure 5: Closeup views of the visual reference types implemented in RealityChat including keywords, maps, weather, image search results, and Wikipedia snippets. Keywords are extracted from the live transcript (upper left) and labeled by category (organization, location, date, and person). The transcript itself is not shown in the AR view.