A Customer Journey in the Land of Oz: Leveraging the Wizard of Oz Technique to Model Emotions in Customer Service Interactions
Sofie Labat, Thomas Demeester, Véronique Hoste
TL;DR
The paper presents EmoWOZ-CS, a bilingual Dutch–English corpus of 2,148 customer-service conversations collected via a controlled Wizard of Oz setup that actively steers emotional trajectories. It provides rich annotations (11 emotions plus neutral, valence/arousal/dominance) and operator-strategy labels, alongside self-versus third-party emotion judgments and participant profiling. Through descriptive analyses and predictive experiments, the authors show moderate annotation agreement, prominent neutral messaging, and the challenges of forward-looking emotion inference from context, with valence being more tractable than discrete emotions. The work demonstrates the value of WOZ-collected, in-domain data for training and evaluating both detection and forward-looking inference systems, while outlining limitations and future steps toward longer-context, perspectivist, and multimodal emotion modeling in customer service AI.
Abstract
Emotion-aware customer service needs in-domain conversational data, rich annotations, and predictive capabilities, but existing resources for emotion recognition are often out-of-domain, narrowly labeled, and focused on post-hoc detection. To address this, we conducted a controlled Wizard of Oz (WOZ) experiment to elicit interactions with targeted affective trajectories. The resulting corpus, EmoWOZ-CS, contains 2,148 bilingual (Dutch-English) written dialogues from 179 participants across commercial aviation, e-commerce, online travel agencies, and telecommunication scenarios. Our contributions are threefold: (1) Evaluate WOZ-based operator-steered valence trajectories as a design for emotion research; (2) Quantify human annotation performance and variation, including divergences between self-reports and third-party judgments; (3) Benchmark detection and forward-looking emotion inference in real-time support. Findings show neutral dominates participant messages; desire and gratitude are the most frequent non-neutral emotions. Agreement is moderate for multilabel emotions and valence, lower for arousal and dominance; self-reports diverge notably from third-party labels, aligning most for neutral, gratitude, and anger. Objective strategies often elicit neutrality or gratitude, while suboptimal strategies increase anger, annoyance, disappointment, desire, and confusion. Some affective strategies (cheerfulness, gratitude) foster positive reciprocity, whereas others (apology, empathy) can also leave desire, anger, or annoyance. Temporal analysis confirms successful conversation-level steering toward prescribed trajectories, most distinctly for negative targets; positive and neutral targets yield similar final valence distributions. Benchmarks highlight the difficulty of forward-looking emotion inference from prior turns, underscoring the complexity of proactive emotion-aware support.
