Table of Contents
Fetching ...

User Interaction Patterns and Breakdowns in Conversing with LLM-Powered Voice Assistants

Amama Mahmood, Junxiang Wang, Bingsheng Yao, Dakuo Wang, Chien-Ming Huang

TL;DR

This work investigates how Large Language Models (LLMs) augment voice assistants by prototyping ChatGPT within Alexa and conducting an exploratory study (N=$20$) across three tasks: medical self-diagnosis, creative planning, and opinionated discussion. It identifies diverse interaction patterns and demonstrates that the LLM absorbs the majority of intent-recognition failures ($81\%$, approximately) and proactively recovers from some breakdowns ($\approx 11\%$). The study provides design guidelines for tailoring text-centric LLMs to voice interactions, including hierarchical responses, reduced repetition, and context retention to support resilient, multi-turn conversations. The findings hold practical implications for building more fluid, context-aware, and safer LLM-powered voice assistants across high- and low-stakes scenarios.

Abstract

Conventional Voice Assistants (VAs) rely on traditional language models to discern user intent and respond to their queries, leading to interactions that often lack a broader contextual understanding, an area in which Large Language Models (LLMs) excel. However, current LLMs are largely designed for text-based interactions, thus making it unclear how user interactions will evolve if their modality is changed to voice. In this work, we investigate whether LLMs can enrich VA interactions via an exploratory study with participants (N=20) using a ChatGPT-powered VA for three scenarios (medical self-diagnosis, creative planning, and discussion) with varied constraints, stakes, and objectivity. We observe that LLM-powered VA elicits richer interaction patterns that vary across tasks, showing its versatility. Notably, LLMs absorb the majority of VA intent recognition failures. We additionally discuss the potential of harnessing LLMs for more resilient and fluid user-VA interactions and provide design guidelines for tailoring LLMs for voice assistance.

User Interaction Patterns and Breakdowns in Conversing with LLM-Powered Voice Assistants

TL;DR

This work investigates how Large Language Models (LLMs) augment voice assistants by prototyping ChatGPT within Alexa and conducting an exploratory study (N=) across three tasks: medical self-diagnosis, creative planning, and opinionated discussion. It identifies diverse interaction patterns and demonstrates that the LLM absorbs the majority of intent-recognition failures (, approximately) and proactively recovers from some breakdowns (). The study provides design guidelines for tailoring text-centric LLMs to voice interactions, including hierarchical responses, reduced repetition, and context retention to support resilient, multi-turn conversations. The findings hold practical implications for building more fluid, context-aware, and safer LLM-powered voice assistants across high- and low-stakes scenarios.

Abstract

Conventional Voice Assistants (VAs) rely on traditional language models to discern user intent and respond to their queries, leading to interactions that often lack a broader contextual understanding, an area in which Large Language Models (LLMs) excel. However, current LLMs are largely designed for text-based interactions, thus making it unclear how user interactions will evolve if their modality is changed to voice. In this work, we investigate whether LLMs can enrich VA interactions via an exploratory study with participants (N=20) using a ChatGPT-powered VA for three scenarios (medical self-diagnosis, creative planning, and discussion) with varied constraints, stakes, and objectivity. We observe that LLM-powered VA elicits richer interaction patterns that vary across tasks, showing its versatility. Notably, LLMs absorb the majority of VA intent recognition failures. We additionally discuss the potential of harnessing LLMs for more resilient and fluid user-VA interactions and provide design guidelines for tailoring LLMs for voice assistance.
Paper Structure (56 sections, 10 figures, 14 tables)

This paper contains 56 sections, 10 figures, 14 tables.

Figures (10)

  • Figure 1: We explore user interactions with an LLM-powered voice assistant in three distinct scenarios: medical self-diagnosis, creative planning, and discussion with an opinionated AI. We report interaction patterns and breakdowns based on the style of speech used during the conversations. The interaction pattern and example conversation above depict ChatGPT's reluctance to answer specific medical queries, such as requests for medication brand names. However, upon re-asking, ChatGPT lists brands with an accompanying warning (a statement informing the user that ChatGPT is not an expert and that they should consult an expert).
  • Figure 2: System implementation of integrating ChatGPT 3.5 into an Alexa skill. User query is transcribed and passed to the Alexa skill once the user's intent to interact with the ChatGPT-powered VA is detected by Alexa (1 and 3). User query (appended with conversation history) is sent to ChatGPT through a middleman API mechanism (4). Once ChatGPT's response is retrieved by a secondary middleman API (5), it is transmitted to the smart speaker via the primary middleman API and the Alexa skill (6). The primary and secondary APIs communicate ChatGPT's response via a shared database. The cycle 3 $\rightarrow$ 4 $\rightarrow$ 5 $\rightarrow$ 6 is repeated for all user queries for our ChatGPT-powered VA implementation.
  • Figure 3: Our study tasks: medical self-diagnosis, creative trip planning, and discussion with an opinionated AI.
  • Figure 4: Speech act hierarchy for states and attributes. Speech acts (states), attribute categories (style of speech), and attributes are denoted in purple, blue, and green, respectively. Attributes are leaf nodes in green. Attributes can co-occur in one utterance unless they belong to the same category (blue); for instance, factual and opinion cannot co-occur due to semantic conflicts. We end up with codes that are combinations of a state and one or more attributes (e.g., argument, egocentric statement or specific, opinion question).
  • Figure 5: Common interaction patterns observed across all tasks, including how the user starts the conversation (1) and concludes it (2); common patterns consistent throughout the scenarios for question--answer pairs (3) and (4); and wait patterns emerging from our design including filler and small talk questions. Green indicates user actions (states), while orange denotes VA actions. Arrows signify transitions between states. "User query" encompasses various user speech acts like questions or statements. "By design" refers to VA states emerging from our implementation, such as fillers. "n" indicates the number of times a pattern occurs.
  • ...and 5 more figures