Table of Contents
Fetching ...

Advancing User-Voice Interaction: Exploring Emotion-Aware Voice Assistants Through a Role-Swapping Approach

Yong Ma, Yuchong Zhang, Di Fu, Stephanie Zubicueta Portales, Danica Kragic, Morten Fjeld

TL;DR

This paper investigates emotion-aware voice assistants by adopting a role-swapping approach in which users regulate an AI's emotions during emotionally charged interactions. It conducted an online user study with five basic emotional contexts (neutral, happy, sad, angry, fear) and collects acoustic and NLP features to examine how people respond and what cues differentiate emotional states. The analysis identifies key acoustic indicators (RMS, ZCR, jitter) and linguistic metrics (polarity, TTR, word clouds) showing that users tend to respond with neutral or positive tones to negative cues, supporting emotional de-escalation and regulation. The findings inform the design of adaptive, context-aware VAs that tailor empathetic responses to user tendencies, cultural context, and conversational goals, potentially boosting trust and engagement in human-AI interactions.

Abstract

As voice assistants (VAs) become increasingly integrated into daily life, the need for emotion-aware systems that can recognize and respond appropriately to user emotions has grown. While significant progress has been made in speech emotion recognition (SER) and sentiment analysis, effectively addressing user emotions-particularly negative ones-remains a challenge. This study explores human emotional response strategies in VA interactions using a role-swapping approach, where participants regulate AI emotions rather than receiving pre-programmed responses. Through speech feature analysis and natural language processing (NLP), we examined acoustic and linguistic patterns across various emotional scenarios. Results show that participants favor neutral or positive emotional responses when engaging with negative emotional cues, highlighting a natural tendency toward emotional regulation and de-escalation. Key acoustic indicators such as root mean square (RMS), zero-crossing rate (ZCR), and jitter were identified as sensitive to emotional states, while sentiment polarity and lexical diversity (TTR) distinguished between positive and negative responses. These findings provide valuable insights for developing adaptive, context-aware VAs capable of delivering empathetic, culturally sensitive, and user-aligned responses. By understanding how humans naturally regulate emotions in AI interactions, this research contributes to the design of more intuitive and emotionally intelligent voice assistants, enhancing user trust and engagement in human-AI interactions.

Advancing User-Voice Interaction: Exploring Emotion-Aware Voice Assistants Through a Role-Swapping Approach

TL;DR

This paper investigates emotion-aware voice assistants by adopting a role-swapping approach in which users regulate an AI's emotions during emotionally charged interactions. It conducted an online user study with five basic emotional contexts (neutral, happy, sad, angry, fear) and collects acoustic and NLP features to examine how people respond and what cues differentiate emotional states. The analysis identifies key acoustic indicators (RMS, ZCR, jitter) and linguistic metrics (polarity, TTR, word clouds) showing that users tend to respond with neutral or positive tones to negative cues, supporting emotional de-escalation and regulation. The findings inform the design of adaptive, context-aware VAs that tailor empathetic responses to user tendencies, cultural context, and conversational goals, potentially boosting trust and engagement in human-AI interactions.

Abstract

As voice assistants (VAs) become increasingly integrated into daily life, the need for emotion-aware systems that can recognize and respond appropriately to user emotions has grown. While significant progress has been made in speech emotion recognition (SER) and sentiment analysis, effectively addressing user emotions-particularly negative ones-remains a challenge. This study explores human emotional response strategies in VA interactions using a role-swapping approach, where participants regulate AI emotions rather than receiving pre-programmed responses. Through speech feature analysis and natural language processing (NLP), we examined acoustic and linguistic patterns across various emotional scenarios. Results show that participants favor neutral or positive emotional responses when engaging with negative emotional cues, highlighting a natural tendency toward emotional regulation and de-escalation. Key acoustic indicators such as root mean square (RMS), zero-crossing rate (ZCR), and jitter were identified as sensitive to emotional states, while sentiment polarity and lexical diversity (TTR) distinguished between positive and negative responses. These findings provide valuable insights for developing adaptive, context-aware VAs capable of delivering empathetic, culturally sensitive, and user-aligned responses. By understanding how humans naturally regulate emotions in AI interactions, this research contributes to the design of more intuitive and emotionally intelligent voice assistants, enhancing user trust and engagement in human-AI interactions.

Paper Structure

This paper contains 29 sections, 7 figures.

Figures (7)

  • Figure 1: The web page is designed to collect voice samples from participants. When an emoji is clicked, it turns yellow and plays an emotional context (e.g., a sad or happy scenario). Participants are then prompted to say something comforting or engage in a conversation with the emoji by clicking the "Start Recording" button. This interactive design allows users to respond naturally to the emotional context, providing valuable data for emotion-aware systems.
  • Figure 2: The Emotional Responding Distribution from Different Emotional Scenarios
  • Figure 3: Comparison of Different Features Using T-Test from Five Basic Emotional Scenarios: (a) Root Mean Square, (b) Zero-Cross Rate, and (c) Jitter.
  • Figure 4: Word Cloud Representations for Different Emotional Scenarios: (a) Angry, (b) Fear, (c) Happy, (d) Neutral, and (e) Sad.
  • Figure 5: Correlation Heatmap Between NLP Features. The heatmap visualizes the relationships among key linguistic features, including polarity, subjectivity, word count, and type-token ratio (TTR). Warmer colors indicate stronger positive correlations, while cooler colors represent negative correlations.
  • ...and 2 more figures