Table of Contents
Fetching ...

Human and LLM-Based Voice Assistant Interaction: An Analytical Framework for User Verbal and Nonverbal Behaviors

Szeyi Chan, Shihan Fu, Jiachen Li, Bingsheng Yao, Smit Desai, Mirjana Prpa, Dakuo Wang

TL;DR

The paper addresses the lack of a systematic framework for analyzing verbal and nonverbal user behaviors in human-LLM-VA interactions during complex tasks. It introduces a three-dimensional analytical framework grounded in Behavior Characteristics, Interaction Stages (Exploration, Conflict, Integration), and Stage Transitions, and validates it through a focused reanalysis of 3 hours and 39 minutes of video with 12 participants performing a salad-cooking task using Mango Mango. The study highlights specific verbal and nonverbal behaviors across stages and details how users transition between stages, offering design implications such as emotion-aware responses and adaptive VA personas. The work provides a foundation for designing more natural, socially aware LLM-VAs and for developing multimodal assessment methods in human-LLM-VA interactions across diverse task contexts.

Abstract

Recent progress in large language model (LLM) technology has significantly enhanced the interaction experience between humans and voice assistants (VAs). This project aims to explore a user's continuous interaction with LLM-based VA (LLM-VA) during a complex task. We recruited 12 participants to interact with an LLM-VA during a cooking task, selected for its complexity and the requirement for continuous interaction. We observed that users show both verbal and nonverbal behaviors, though they know that the LLM-VA can not capture those nonverbal signals. Despite the prevalence of nonverbal behavior in human-human communication, there is no established analytical methodology or framework for exploring it in human-VA interactions. After analyzing 3 hours and 39 minutes of video recordings, we developed an analytical framework with three dimensions: 1) behavior characteristics, including both verbal and nonverbal behaviors, 2) interaction stages--exploration, conflict, and integration--that illustrate the progression of user interactions, and 3) stage transition throughout the task. This analytical framework identifies key verbal and nonverbal behaviors that provide a foundation for future research and practical applications in optimizing human and LLM-VA interactions.

Human and LLM-Based Voice Assistant Interaction: An Analytical Framework for User Verbal and Nonverbal Behaviors

TL;DR

The paper addresses the lack of a systematic framework for analyzing verbal and nonverbal user behaviors in human-LLM-VA interactions during complex tasks. It introduces a three-dimensional analytical framework grounded in Behavior Characteristics, Interaction Stages (Exploration, Conflict, Integration), and Stage Transitions, and validates it through a focused reanalysis of 3 hours and 39 minutes of video with 12 participants performing a salad-cooking task using Mango Mango. The study highlights specific verbal and nonverbal behaviors across stages and details how users transition between stages, offering design implications such as emotion-aware responses and adaptive VA personas. The work provides a foundation for designing more natural, socially aware LLM-VAs and for developing multimodal assessment methods in human-LLM-VA interactions across diverse task contexts.

Abstract

Recent progress in large language model (LLM) technology has significantly enhanced the interaction experience between humans and voice assistants (VAs). This project aims to explore a user's continuous interaction with LLM-based VA (LLM-VA) during a complex task. We recruited 12 participants to interact with an LLM-VA during a cooking task, selected for its complexity and the requirement for continuous interaction. We observed that users show both verbal and nonverbal behaviors, though they know that the LLM-VA can not capture those nonverbal signals. Despite the prevalence of nonverbal behavior in human-human communication, there is no established analytical methodology or framework for exploring it in human-VA interactions. After analyzing 3 hours and 39 minutes of video recordings, we developed an analytical framework with three dimensions: 1) behavior characteristics, including both verbal and nonverbal behaviors, 2) interaction stages--exploration, conflict, and integration--that illustrate the progression of user interactions, and 3) stage transition throughout the task. This analytical framework identifies key verbal and nonverbal behaviors that provide a foundation for future research and practical applications in optimizing human and LLM-VA interactions.
Paper Structure (47 sections, 15 figures, 3 tables)

This paper contains 47 sections, 15 figures, 3 tables.

Figures (15)

  • Figure 1: The proposed analytical framework consists of three dimensions: 1) behavior characteristics, 2) the three interaction stages, and 3) stage transition.
  • Figure 2: System diagram of "Mango Mango" (MM). The process begins with users providing voice input to Alexa. Then, Alexa performs a speech-to-text conversion and adds the transcribed input to the conversation log. This log is saved to a database. Next, the conversation histories are processed in the prompt module. The completed prompt is then sent to GPT-3.5 Turbo, and finally, the resulting response is sent back to Alexa, where it is converted into speech for the user, completing the system loop. The complete and detailed system flow is described in previous work chan2023mango.
  • Figure 3: Pictures of participants actively engaged in the experiment at different stages of their experience with dialogue history. The study took place in the fully operational kitchen in the smart home laboratory. The Alexa device was placed on the left side of the participants (circled in white).
  • Figure 4: Screenshot of ELAN annotation software showing three annotation cycles. All layers mentioned Section \ref{['video_analysis']} can be seen here.
  • Figure 5: (a) Participant's time allocation of interaction across different stages with voice assistant. Each color represents a different stage, with the length corresponding to the duration of time spent in that particular stage of interaction. (b) Total time allocation of each stage among all participants. Each bar represents the range between the minimum and maximum occurrences recorded. The black dot indicates the mean frequency of these behaviors, while the vertical line extending from each bar denotes the standard deviation.
  • ...and 10 more figures