Table of Contents
Fetching ...

A Customer Journey in the Land of Oz: Leveraging the Wizard of Oz Technique to Model Emotions in Customer Service Interactions

Sofie Labat, Thomas Demeester, Véronique Hoste

TL;DR

The paper presents EmoWOZ-CS, a bilingual Dutch–English corpus of 2,148 customer-service conversations collected via a controlled Wizard of Oz setup that actively steers emotional trajectories. It provides rich annotations (11 emotions plus neutral, valence/arousal/dominance) and operator-strategy labels, alongside self-versus third-party emotion judgments and participant profiling. Through descriptive analyses and predictive experiments, the authors show moderate annotation agreement, prominent neutral messaging, and the challenges of forward-looking emotion inference from context, with valence being more tractable than discrete emotions. The work demonstrates the value of WOZ-collected, in-domain data for training and evaluating both detection and forward-looking inference systems, while outlining limitations and future steps toward longer-context, perspectivist, and multimodal emotion modeling in customer service AI.

Abstract

Emotion-aware customer service needs in-domain conversational data, rich annotations, and predictive capabilities, but existing resources for emotion recognition are often out-of-domain, narrowly labeled, and focused on post-hoc detection. To address this, we conducted a controlled Wizard of Oz (WOZ) experiment to elicit interactions with targeted affective trajectories. The resulting corpus, EmoWOZ-CS, contains 2,148 bilingual (Dutch-English) written dialogues from 179 participants across commercial aviation, e-commerce, online travel agencies, and telecommunication scenarios. Our contributions are threefold: (1) Evaluate WOZ-based operator-steered valence trajectories as a design for emotion research; (2) Quantify human annotation performance and variation, including divergences between self-reports and third-party judgments; (3) Benchmark detection and forward-looking emotion inference in real-time support. Findings show neutral dominates participant messages; desire and gratitude are the most frequent non-neutral emotions. Agreement is moderate for multilabel emotions and valence, lower for arousal and dominance; self-reports diverge notably from third-party labels, aligning most for neutral, gratitude, and anger. Objective strategies often elicit neutrality or gratitude, while suboptimal strategies increase anger, annoyance, disappointment, desire, and confusion. Some affective strategies (cheerfulness, gratitude) foster positive reciprocity, whereas others (apology, empathy) can also leave desire, anger, or annoyance. Temporal analysis confirms successful conversation-level steering toward prescribed trajectories, most distinctly for negative targets; positive and neutral targets yield similar final valence distributions. Benchmarks highlight the difficulty of forward-looking emotion inference from prior turns, underscoring the complexity of proactive emotion-aware support.

A Customer Journey in the Land of Oz: Leveraging the Wizard of Oz Technique to Model Emotions in Customer Service Interactions

TL;DR

The paper presents EmoWOZ-CS, a bilingual Dutch–English corpus of 2,148 customer-service conversations collected via a controlled Wizard of Oz setup that actively steers emotional trajectories. It provides rich annotations (11 emotions plus neutral, valence/arousal/dominance) and operator-strategy labels, alongside self-versus third-party emotion judgments and participant profiling. Through descriptive analyses and predictive experiments, the authors show moderate annotation agreement, prominent neutral messaging, and the challenges of forward-looking emotion inference from context, with valence being more tractable than discrete emotions. The work demonstrates the value of WOZ-collected, in-domain data for training and evaluating both detection and forward-looking inference systems, while outlining limitations and future steps toward longer-context, perspectivist, and multimodal emotion modeling in customer service AI.

Abstract

Emotion-aware customer service needs in-domain conversational data, rich annotations, and predictive capabilities, but existing resources for emotion recognition are often out-of-domain, narrowly labeled, and focused on post-hoc detection. To address this, we conducted a controlled Wizard of Oz (WOZ) experiment to elicit interactions with targeted affective trajectories. The resulting corpus, EmoWOZ-CS, contains 2,148 bilingual (Dutch-English) written dialogues from 179 participants across commercial aviation, e-commerce, online travel agencies, and telecommunication scenarios. Our contributions are threefold: (1) Evaluate WOZ-based operator-steered valence trajectories as a design for emotion research; (2) Quantify human annotation performance and variation, including divergences between self-reports and third-party judgments; (3) Benchmark detection and forward-looking emotion inference in real-time support. Findings show neutral dominates participant messages; desire and gratitude are the most frequent non-neutral emotions. Agreement is moderate for multilabel emotions and valence, lower for arousal and dominance; self-reports diverge notably from third-party labels, aligning most for neutral, gratitude, and anger. Objective strategies often elicit neutrality or gratitude, while suboptimal strategies increase anger, annoyance, disappointment, desire, and confusion. Some affective strategies (cheerfulness, gratitude) foster positive reciprocity, whereas others (apology, empathy) can also leave desire, anger, or annoyance. Temporal analysis confirms successful conversation-level steering toward prescribed trajectories, most distinctly for negative targets; positive and neutral targets yield similar final valence distributions. Benchmarks highlight the difficulty of forward-looking emotion inference from prior turns, underscoring the complexity of proactive emotion-aware support.

Paper Structure

This paper contains 53 sections, 5 equations, 11 figures, 17 tables.

Figures (11)

  • Figure 1: WOZ experiment methodology. Purple box: full participant workflow, including the conversation collection, emotion labeling, and profiling questionnaires. Gray box: conversation collection with participants' unique functionalities in red and wizards' unique functionalities in blue. Both interlocutors read the scenario and can type messages. Wizards must label the response strategies (RS category) in their message before sending it, unlike participants. Participants decide about ending the interaction; wizards receive an emotional valence toward which to steer the conversation (end valence).
  • Figure 2: Histograms of message counts by prescribed final valence categories.
  • Figure 3: Big Five personality traits distribution.
  • Figure 4: Bubble pie chart showing the distribution of emotions in a two-dimensional affective space (valence × arousal). Each bubble represents an emotion, positioned by its average valence (horizontal axis) and average arousal (vertical axis). The size of each bubble reflects the relative frequency of that emotion among all messages labeled with an emotion (excluding neutral messages). Each bubble is divided into colored shades indicating the proportion of positive, neutral, and negative target valence outcomes for that emotion.
  • Figure 5: Spider plot comparing emotion annotations between participants (self-reports) and third-party annotators. For each emotion category, the plot shows the average relative frequency of: mutual agreement (purple), only self-reported annotations (red), and only third-party annotations (blue). All frequencies are averaged across three independent third-party annotators.
  • ...and 6 more figures