Table of Contents
Fetching ...

Speech Command + Speech Emotion: Exploring Emotional Speech Commands as a Compound and Playful Modality

Ilhan Aslan, Timothy Merritt, Stine S. Johansen, Niels van Berkel

TL;DR

This study investigates emotional speech commands as a compound modality by integrating a speech-emotion recognition (SER) system with speech commands to control embodied agents. Using a retro game prototype, two agents respond to commands, with the affective agent additionally adapting its movement and emoji display based on $valence$, $arousal$, and $dominance$ extracted from the speaker's voice. A within-subject user study ($N=14$) reveals that the affective agent is more engaging and stimulating but less easy to use and predict, highlighting trade-offs between social richness and usability. The work provides design considerations for incorporating emotional speech as an input modality, discusses attribution and ethical implications, and lays groundwork for future research on affective, speech-driven interfaces in education and entertainment contexts.

Abstract

In an era of human-computer interaction with increasingly agentic AI systems capable of connecting with users conversationally, speech is an important modality for commanding agents. By recognizing and using speech emotions (i.e., how a command is spoken), we can provide agents with the ability to emotionally accentuate their responses and socially enrich users' perceptions and experiences. To explore the concept and impact of speech emotion commands on user perceptions, we realized a prototype and conducted a user study (N = 14) where speech commands are used to steer two vehicles in a minimalist and retro game style implementation. While both agents execute user commands, only one of the agents uses speech emotion information to adapt its execution behavior. We report on differences in how users perceived each agent, including significant differences in stimulation and dependability, outline implications for designing interactions with agents using emotional speech commands, and provide insights on how users consciously emote, which we describe as "voice acting".

Speech Command + Speech Emotion: Exploring Emotional Speech Commands as a Compound and Playful Modality

TL;DR

This study investigates emotional speech commands as a compound modality by integrating a speech-emotion recognition (SER) system with speech commands to control embodied agents. Using a retro game prototype, two agents respond to commands, with the affective agent additionally adapting its movement and emoji display based on , , and extracted from the speaker's voice. A within-subject user study () reveals that the affective agent is more engaging and stimulating but less easy to use and predict, highlighting trade-offs between social richness and usability. The work provides design considerations for incorporating emotional speech as an input modality, discusses attribution and ethical implications, and lays groundwork for future research on affective, speech-driven interfaces in education and entertainment contexts.

Abstract

In an era of human-computer interaction with increasingly agentic AI systems capable of connecting with users conversationally, speech is an important modality for commanding agents. By recognizing and using speech emotions (i.e., how a command is spoken), we can provide agents with the ability to emotionally accentuate their responses and socially enrich users' perceptions and experiences. To explore the concept and impact of speech emotion commands on user perceptions, we realized a prototype and conducted a user study (N = 14) where speech commands are used to steer two vehicles in a minimalist and retro game style implementation. While both agents execute user commands, only one of the agents uses speech emotion information to adapt its execution behavior. We report on differences in how users perceived each agent, including significant differences in stimulation and dependability, outline implications for designing interactions with agents using emotional speech commands, and provide insights on how users consciously emote, which we describe as "voice acting".

Paper Structure

This paper contains 12 sections, 5 figures, 1 table.

Figures (5)

  • Figure 1: Overview of the system architecture of the prototype.
  • Figure 2: Cropped screenshots showing examples of different behaviors of the agents for speech commands expressed in different ways. Lower position such as in a) and b) indicate lower valence, while higher speed indicates higher levels of (emotional) arousal in the speech command. c) is an example for high arousal high valence compared to f) which is lower arousal and lower valence. d) and e) are closer to a a command given in a neutral tone.
  • Figure 3: Overview of UEQ scores, measuring the user experience of both agents. Error bars denote 95% confidence intervals.
  • Figure 4: Overview of ratings for the godspeed questionnaire items of the perceived intelligence module. Error bars denote 95% confidence intervals.
  • Figure 5: Overview of mean opinion scores, provided by participant to describe how they felt about the agents and interacting with them. Error bars denote 95% confidence intervals.